I get a lot of questions about this and see quite a few postings about Analysis Services processing in the MSDN forums. So you created an SSAS database solution and have deployed it to a production environment. Now the data has accumulated and you need to take a look at moving to a incremental processing approach because you are either loading the data more regularly or the volume of data and processing does not meet your maintenance window anymore. So what do you do?
The solution is to look at doing a more incremental processing approach instead of just a Process Full on the database. You have a few options in regards to this and I am going to discuss what happens during a dimension update process and also go over an SSIS solution that I put together.
In order to update and refresh the data in your dimension table you can perform a Process Update on the dimension (or a Process Add if you are only adding new members and not updating any records, but that typically is not the case). So what happens to the cube data when you update a dimension? Well if nothing changed then you are okay, but if you have flexible relationships and the data within these relationships changed then the aggregation designs for these are dropped from the partitions. Here is a walkthrough showing this:
Aggregation Advanced View Display in SSAS 2008
Query the partition information to view the aggregation information
Process Update with no changes – everything will stay the same
Process Update and there were data changes to dimension attributes that had aggregation designs
Aggregations designs are dropped that referenced those dimension attributes (A0 and A2 were dropped). How do you get them back?
Run a Process Default on the affected partitions (you could also choose the option to process affected objects during your dimension processing, this will do the same thing)
Aggregation designs are now back
So that was just a quick overview of what happens when you do a refresh of a dimension in your SSAS database. You can run this, just be aware of what is going on behind the scenes. If there were changes and relationships were updated, then you will also need to rebuild the aggregation and indexes if these attributes were included in any aggregation designs.
So if you want to incorporate this logic along with a incremental update of your partitions, what can you do? Well for this I leveraged SSIS and setup a package that allows for either a full or incremental update process. In my example here I actually do this for a specific cube in a SSAS database, so I only process the dimensions that are associated with a particular cube.
I have to give credit to Vidas Matelis here for providing some of the AMO coding for the processing, SSIS package that process all partitions/measure groups/cubes in one database . I modified it a bit to be able to limit the scope for a particular cube along with a Try Catch block in the code, but it is what I used for a starting point. Using the AMO allows the process to be more flexible and dynamic. As with all things, as you expand on concepts and build them out they can start to get a bit complex, so I will try and explain this as best I can in a brief sort of way for this posting.
- I added a script task as a starting point so that I could evaluate the ProcessFull variable and I also use this to set the number of months that will be processed (or this could be years depending on your volume. I use a naming convention in the partition names so that it is like partition_name YYYY or partition_name YYYYMM. In this process it will be for the later YYYYMM format. So if the IncrementalMonthsProcess is set to 2 then at the time of this posting the value would be set to 200909 and this will be used later in the data flow task for evaluation where the names of the partitions will be parsed.
- Now at this point you move on to either the Full or Incremental processing. The Full is fairly self explanatory where I have it broken into a full dimension process script task and then another script task for the partition processing. I broke it out into two tasks so that I could get some additional duration information for logging and history. The dimension processing only processed the dimensions for the specified cube and this is the same for the incremental approach as well. So if you have multiple cubes with more dimensions or some unused dimensions then these would not be touched.
- The incremental process starts out with a script task that does a process update on all of the dimensions in the specified cube. Same script task code as the full with the exception of the type of processing that is performed.
- The data flow task is used to create three recordsets of data that will be used for the partition processing – a full, data, and default processing list. I setup a script task as a source to create a recordset of values and to parse out the date information in the partitions if it exists.
- The first list that is created is the full list of partitions and these get placed in the default processing list (object variable) that will be used in the last step of the package.
- The split transform is used to determine which partitions will have a full process performed on them (no date information included in the partition names, so yearinfo was set to zero) along with the partitions that will be part of the incremental update. These partitions are typically part of a large measure group and the data has been partitioned by year or month (in this case month) and we only are going to process the last 2 months based on step 1. We are just going to do a Process Data on these partitions (this reduces impact on server and also once this is done makes the data accessible to end-users – the final step in the package will build out the aggregations and indexes). The remaining partitions will be left untouched (at least in regards to processing the data, aggregations might be rebuilt if they were affected).
- Now back to the control flow and to finish off the process we first do the process full partition list, then move on to the process data partition list, and finally the process default processing list. As shown above in the screenshots the last step will do nothing to the partitions that were fully processed, but it will build the aggregations and indexes for the process data partitions and also for any additional partitions that might have been affected by the dimension updates.
So that was a quick overview of the process. Now there are definitely other ways you could go about this, this is simply one solution that I came up with and it has worked good. I have made a few custom modifications as needed like to do incremental processing by year or month along with limiting it to a particular cube, but these can quickly be modified and put to use in pretty much any environment. Just set the variable values (ASServer, ASDatabase, ASCube, ProcessFull, IncrementalMonthsProcess) and you are off to the races. If you want to turn on the SSIS logging that is up to you. I have used the SQL Server table logging and then leverage the SSIS report pack for reporting purposes, but that is entirely up to you.
I have uploaded the package if you want to check it out to my SkyDrive account. I do have a SSIS 2005 and 2008 version available to take a look at in the download and I have annotations included in the packages.
Hope you find this useful
(Click on the image above that contains link to download. You should be able to do a right-click and do a save target)
UPDATE (6/2/2010): Package download has been updated to reflect a fix to handle the partitions with a yearly naming convention so that they do not get included in the full processing. This was discovered during my May 2010 presentation preparation – PASSMN May 2010 Follow-up.