Dan English's BI Blog

Welcome to my random thoughts in regards to Business Intelligence, databases, and other technologies

Archive for the ‘Integration Services’ Category

SSIS Package Copy/Paste New GUID (ID) Fix

Posted by denglishbi on March 25, 2011

When doing development with Integration Services (SSIS) in Business Intelligence Development Studio (BIDS) it is quite common to copy an existing SSIS package to begin work on a new one.  This could be because you have a package template that you are using in your environment or possibly just to back up your package because you want to make a modification and are not using a source control system (definitely not a best practice to not be using a source control system).

The one thing you need to be aware of at least when performing a copy/paste of an existing package in BIDS 2005 and 2008 is that each package is setup with a unique identifier known as a GUID.

SSIS_BIDS2008_PackageGUID

When you copy and paste the existing package to begin working on a new version of the package the cloned version of the package will keep all of the same values of the template copy of the package.

image

This can become very problematic when you have SSIS logging enabled and are trying to troubleshoot a package and all of your packages have the same GUID value assigned to them.  So in BIDS 2005 and 2008 the way to get around this problem was to use the drop-down list in the ID property to generate a new ID (GUID) value to the copied (cloned) version of the master package.

image

NOTE: Doing this only assigns a new ID (GUID) value to the package.  All of the other items in the package will all still have the same ID values as the template package

The other option you have in changing the ID values is to leverage the BIDSHelper add-in that is available on CodePlex.  I had forgotten about this option and a special thanks goes out to Bhavik Merchant (Blog | Twitter) for reminding me about this through Twitter land yesterday.

image

The nice thing about doing this with the BIDSHelper add-in is that it will not only assign the package a new ID value, but also all of the other items within the package!

image

This is a great feature because if you have the setup like I did above with a Data Flow Task and are going to use Data Viewers you might have run into this issue before – SSIS package copy paste data flow task (dft) data viewer issue.   This bug is in BIDS 2005 and 2008 (has been fixed in 2008 R2 though).  When you add a Data Viewer into your cloned package and then go back to your template package you will see that the Data Viewer exists in both!  Wow, what a cool feature and it really freaks you out when you run your cloned package and get all of these popups, especially if you have 10 or 15 cloned packages in a single project.  Really neat, NOT.

Cloned Package 2

image

Template Package

image

But if you are using the BIDSHelper add-in this would resolve the issue because it assigns new ID values to all of the items, so this would not be an issue.

Cloned Package 1

image

Lets fast forward now into BIDS 2008 R2.  If you copy and paste a package in BIDS 2008 R2 the new package automatically gets assigned a new ID (GUID) value.  So the issue of having multiple packages with the same ID values has been resolved out-of-the-box with no additional actions required by the developer.  A nice little feature that I experienced yesterday and noticed.  I had still been using the generate new ID option without even realizing the copied package already had a new ID value. 

BIDS 2008 R2

Template Package

image

Cloned Package

image

The trick is that you have to actually rename the package and it will assign a new ID value to the copied package.

This doesn’t assign new values to the other items, but the issue that you experience with the Data Flow Task and Data Viewers is fixed in BIDS 2008 R2.

Cloned Package

image

Template Package

image

So with BIDS 2008 R2 you get new ID values for each copied SSIS package and the issue with using Data Viewers in copied Data Flow Tasks has been fixed as well.  You might still want to leverage the functionality of the BIDSHelper though to assign all items in the package a new GUID (ID) value.

Pretty cool and I didn’t even realize this was updated in BIDS 2008 R2.

UPDATE (4/11/2011): Acutally realized that this is fixed in BIDS 2008 as well.  The reason I didn’t notice it before is that because you need to actually rename the package file after the copy/paste.  Once you do this it will provide the package a new GUID (ID) value.

Posted in Integration Services | Tagged: | 5 Comments »

SSIS Data Flow vs. Hand Coded ETL

Posted by denglishbi on February 27, 2011

[tweetmeme source=”denglishbi” only_single=”false”]

“Why do I need an ETL tool?”

“Why would I want to use Integration Services (SSIS)?”

“Why use a source-to-target setup in ETL process?”

It is questions and statements like this that I hear often.  These questions can generate lots of discussion and there are many different topics that you could go into in regards to the benefits of using an ETL tool or just pure coded stored procedures or T-SQL.  Some of the benefits you can get with ETL tools are:

  • control flow,
  • transformations,
  • logging,
  • visual representation of work,
  • more options than just source-to-target,
  • reusability,
  • audit details,
  • configurations,
  • memory management,
  • etc.

I could go on, but this is not the topic for this blog posting.  What I want to discuss is a scenario comparing a potential consequence of using stored procedure or T-SQL ETL over what I would consider more standard ETL using source and target components.  The other item that I want to include and point out in this example is the use of ‘SELECT * FROM’ and its impact.  In all of these examples we will be leveraging a database view for our source and this is simply used to add an additional layer of abstraction and reference point instead of using the underlying table directly.

Situation

Let me set the stage for this.  In the data warehouse we need to load Employee information to reference.  For simplicity we are not interested in tracking any sort of changes over time and just want a current view of the Employees.  All we want to do is take the information from one table and completely refresh a data warehouse table that we will use to deliver information to our end-users.

Development

Piece of cake.  Simply delete the data from our destination data warehouse table, select data from the employee source table, and insert the source contents into the destination.  We have a planning meeting and discuss the different options that could go with to perform this operation – stored procedure, T-SQL, or SSIS.  Sure there are other options that we could possibly consider and for what it is worth, the first two are the same, just determines where you want the logic to sit.

So we decide to evaluate the different options.  In this example we are going to set all three options up in an SSIS package to evaluate.

Option 1: Stored Procedure

The first one is to call a stored procedure where the logic for the load will reside.  We create the stored procedure:

<br>CREATE PROCEDURE [dbo].[usp_DWRefresh] AS<br>BEGIN<br>SET NOCOUNT ON;<br>--clear out DW employee table<br>TRUNCATE TABLE DWEmployee;<br>--reload DW employee table<br>INSERT INTO DWEmployee<br>SELECT * FROM v_Employee;<br>END<br>

And then in the SSIS package we utilize a Execute SQL Task to call the stored procedure:

image

image

Ok, that one is done, now on to the second option using T-SQL.

Option 2: T-SQL

In this example we are simply going to use a Execute SQL Task in SSIS and place the logic that was in the stored procedure directly in the task, so now the logic is embedded in the package.

image

image

Nothing fancy going on here, simply truncate table and insert into table select * from…

Now we go on to the final option that we are going to look at, using a Data Flow Task.

Option 3: Data Flow Task (SSIS)

In this final option we are going to compare we are going to continue to use the ‘SELECT * FROM’ logic for the source, but this time we are going to include this logic in a Data Flow Task and use a source-to-target ETL setup leveraging components.

image

In this package we first use an Execute SQL Task to clear out the data in our destination table using the truncate table logic that was included in the T-SQL example.  Next we use a Data Flow Task to transfer the data from our source and load it into our target (destination) table that we will use in our data warehouse.

image

OLE DB Source setup:

imageimage

OLE DB Destination setup:

image image

One thing to point out here is that even though we are using the ‘SELECT * FROM’ setup for the source the column names are still being set and mapped directly to the destination table.  This is important and you will see why as we continue on in our evaluation.

Evaluation Process

Okay, the stage is set and we are ready to test everything out.  We run the package with the different options:

image

Everything looks great.

Gotcha

But wait, nothing stays this simple or never changes right?  Right.  Now we have a user that decides that they would like to include a new column in v_Employee that provides them the full name of the employee combining the employee’s last name and first name together.  Not a problem and makes complete sense to have this setup in the source to reference and centralize this logic.  The new column is setup in the view and saved with the following highlighted logic below added.

image

What does this do to our logic that we setup?

Option 1: Stored Procedure

image

Process breaks with message

Error Message:

[Execute SQL Task] Error: Executing the query “usp_DWRefresh” failed with the following error: “Column name or number of supplied values does not match table definition.”. Possible failure reasons: Problems with the query, “ResultSet” property not set correctly, parameters not set correctly, or connection not established correctly.

Option 2: T-SQL

image

Process breaks with the following message

Error Message:

[Execute SQL Task] Error: Executing the query ”    TRUNCATE TABLE DWEmployee;
  
    INSERT INT…” failed with the following error: “Column name or number of supplied values does not match table definition.”. Possible failure reasons: Problems with the query, “ResultSet” property not set correctly, parameters not set correctly, or connection not established correctly.

Option 3: Data Flow Task (SSIS)

image

Process continues to work with the following warning/informational message

Warning Message:

[OLEDB_SRC_v_Employee [1]] Warning: The external columns for component “OLEDB_SRC_v_Employee” (1) are out of synchronization with the data source columns. The column “FullName” needs to be added to the external columns.

Interesting…  Ok, so why would I want to use an ETL tool and even more than that, a standard source-to-target data process?  The data flow setup continues to work because during the setup the individual columns get referenced in the flow and mapped in the destination.  If either the source or target have additional columns added that is okay.  We will be alerted of this, but it will not break our process.  Items that could potentially impact our process would be if data types change or column names are modified.

Conclusion

Maybe this is not being too fair, because who would ever go with a ‘SELECT * FROM’ setup to load a destination table.  I mean our sources and targets never change, right?  Well I have seen it and when you are in a crunch you just might put something like this together without realizing the impact it might have down the road.  Maybe you set this logic up just to test something out and we know what happens to test and proof of concept work right…they sometimes get moved to production.

So please be careful and really evaluation not only the logic you use when loading data, but also the option and tool you will use for the process.  Tools like Integrations Services (SSIS) have tons of benefits that far out weigh the alternatives.

Some additional postings and articles that you might want to look at are below that further expand on the reason to leverage ETL or help explain reasoning and purposes:

Posted in Integration Services | Tagged: | 4 Comments »

Calculating Cumulative Values in Microsoft BI Tools

Posted by denglishbi on January 24, 2011

In working on a business intelligence project a common situation you will find yourself in is having to come up with a cumulative value (running value) calculation.  The challenge that you will find is that depending on what product you are working with the calculation that you need is going to be different.  In this post I will compare how this can be done with T-SQL, Excel, SSIS, SSRS, MDX, and DAX.  Along the way I will provide some additional reference links to other options, content, and I will point out some differences in how the totals are being calculated so you will know what to expect with the results.

In these examples I will be working with the Contoso Retail Data Warehouse data sample.  These examples below are not necessarily going to be the optimal approaches, just showing how you can go about getting at the desired result.

SQL Server T-SQL

So let’s start out taking a quick look how calculating the sales over each of the years directly against the data in the database through SQL Server Management Studio.

SELECT d.CalendarYear,

       SUM(f.SalesAmount) SalesAmount,

       SUM(f.SalesAmount) + COALESCE((SELECT SUM(SalesAmount)

                        FROM FactSales fs JOIN DimDate dd ON fs.DateKey = dd.DateKey 

                        WHERE dd.CalendarYear < d.CalendarYear),0) CumSalesAmount

FROM FactSales f

INNER JOIN DimDate d ON f.DateKey = d.DateKey

GROUP BY d.CalendarYear

ORDER BY d.CalendarYear

And the results for this are:

CalendarYear SalesAmount           CumSalesAmount
———— ——————— ———————
2007         4561940955.0212       4561940955.0212
2008         4111233534.6841       8673174489.7053
2009         3740483119.1823       12413657608.8876

 

Not too bad.  We have an inner query that is being used to return the pervious year(s) values and append that onto the current year (if they exist).  If no value is returned then a zero is used in its place.  For additional examples and options take a look at this posting by Garth Wells – Calculating Running Totals.

Microsoft Excel

So now lets take a look at how this would be done if we were using Excel. 

image

There area a few different options you have here, in this example I enter a formula in the C2 cell of =SUM($B$2:B2) and then copy that formula down into cells C3 and C4.

SQL Server Integration Services (SSIS)

Now we will take a quick look at how this could be done in the ETL process.  As we are pulling data from a source we can evaluate the values in the data flow and accumulate the values inside a script component.  In the source we will simply get the sales for each year and then append the values in the flow to each other to get at the same results displayed above.

image

The cumulative value is being performed in the script component with some code.  We have a variable setup that we are using to append the sales amount of each row in the data flow and then sending this value back out in a new output column called CumulativeSales.

image

Nothing too fancy going on here.  I don’t have a destination in the flow, just added a derived column to get the data viewer included so that I could run the data flow to show the results.

SQL Server Reporting Services (SSRS)

Now lets switch over to Reporting Services.  In this example we will not do the cumulative value in the dataset query, we will do this value calculation in the report. 

image

In this example we leverage the RunningValue aggregate function that is available in Reporting Services.  The expression that we use in the Cumulative Sales column is =RunningValue(Fields!SalesAmount.Value,Sum,”DataSet1″)

image

This function returns a running aggregate of the values in our specified dataset using the SUM function which is also specified in the expression.

SQL Server Analysis Services (SSAS) MDX

Now how about taking a look at Analysis Services.  How would we go about doing this calculation using MDX?

WITH MEMBER [Measures].[Cumulative Sales] AS

    SUM({null:[Date].[Calendar Year].CurrentMember},[Measures].[Sales Amount])

    

SELECT {[Measures].[Sales Amount],[Measures].[Cumulative Sales]} ON 0,

    NON EMPTY [Date].[Calendar Year].[Calendar Year].Members ON 1

FROM [Sales]

And the results of this are:

image

We can apply some filters into the MDX statement to remove the years that currently don’t have Sales Amount associated with them.

WITH MEMBER [Measures].[Cumulative Sales] AS

    SUM({null:[Date].[Calendar Year].CurrentMember},[Measures].[Sales Amount])

    

SELECT {[Measures].[Sales Amount],[Measures].[Cumulative Sales]} ON 0,

    NONEMPTY([Date].[Calendar Year].[Calendar Year].Members,[Measures].[Sales Amount]) ON 1

FROM [Sales]

This returns what we have seen in our other results:

image

There are additional ways in going about this calculation and for more options take a look at this recent posting here by Amish Manubhai Shah – Various way to calculate running total from All Period to Currentmember. Now I want to show how this solution differs a bit from the other ones we have seen.  What happens if I apply a filter to the query to only display 2008 and 2009?  What would you expect?

image

Might not be what you would want, but then again maybe it is.  The Cumulative Sales calculation is summing up the values from the beginning of time (the [All] level).  So the first value that we see in 2008 is including the Sales Amount from 2007 as well (if there was sales in 2006 it would include that as well and so on). 

You can make a few modifications to the calculation and setup a named set to reference the items and come up with something along these lines.

WITH DYNAMIC SET [SelectedYears] AS Existing [Date].[Calendar Year].[Calendar Year].Members

 

 MEMBER MEASURES.[Cumulative Sales] as 

    SUM({EXISTS([Date].[Calendar YQMD].[Calendar Year].Members,

    [SelectedYears]).Item(0):[Date].[Calendar YQMD].CurrentMember},

    [Measures].[Sales Amount])

 

SELECT {Measures.[Sales Amount], Measures.[Cumulative Sales]} ON 0,

    [Date].[Calendar YQMD].[Calendar Year].Members ON 1

FROM [Sales]

WHERE {[Date].[Calendar Year].&[2008],[Date].[Calendar Year].&[2009]}

With the following results:

image

Here a dynamic set is used to get the references of the members that are in the slicer (WHERE clause) and feed this information into the calculation for the cumulative sales to grab the first item to come up with the reference to the 2008 year.  This will be the starting point for the cumulative total instead of from the beginning of time.

You might just want to look at year-to-date cumulative totals and be able to drilldown into a user defined hierarchy if you have one setup like Year-Quarter-Month-Date.  If we modify our calculation a bit we might end up with something like the following:

WITH MEMBER [Measures].[Cumulative Sales] AS

    SUM(YTD([Date].[Calendar YQMD].CurrentMember),[Measures].[Sales Amount])

 

SELECT {[Measures].[Sales Amount],[Measures].[Cumulative Sales]} ON 0,

    Hierarchize({[Date].[Calendar YQMD].[Calendar Year].Members,

    [Date].[Calendar YQMD].[Calendar Quarter].Members}) ON 1

FROM [Sales]

WHERE {[Date].[Calendar Year].&[2008],

    [Date].[Calendar Year].&[2009]}

 

With the following results:

image

Here we can see that our cumulative total stops at the year level and we can see at the lower levels that the values are accumulating as expected, so the cumulative total at Q4 is same as the Yearly value.  What we have now is a YTD calculation.

PowerPivot for Excel 2010 DAX

We have one more product that I want to take a look at and that would be PowerPivot.  This is going to be fairly similar to the Analysis Services solution since it actually is Analysis Services behind the scenes, the only difference here is that we will be using DAX instead of MDX.

So for the first example we will setup a cumulative sales calculation just like we did with the first MDX example.  The calculation we use is as follows:

=CALCULATE(SUM(FactSales[SalesAmount]),

DATESBETWEEN(DimDate[Datekey],

FIRSTDATE(ALL(DimDate[Datekey])),

LASTDATE(DimDate[Datekey])))

So we end up with the following:

image

You could go ahead and add some additional logic to evaluate the Sales Amount to determine if a value exists, but you get the general idea here.

Once again this is calculating the cumulative total from the beginning of all time and that is done by using the ALL reference in the calculation for the FIRSTDATE.  If we filter the years and only display 2008 and 2009 we will see the similar results like we did with MDX where the cumulative sales amount for 2008 includes the 2007 sales.

image

The one nice thing about PowerPivot is that this is in Excel, so if you want to do anything a little more custom or make any references to cells you can do that and use Excel functions as well.  As far as setting up a cumulative sales amount value like the second MDX example I can’t determine an equivalent in DAX.  Maybe someone else has an idea if this can be done so that it only does this based on the Date values being evaluated.  If you do, please leave a comment.

You can do the YTD calculation in DAX and here is formula for that:

=TOTALYTD(SUM(FactSales[SalesAmount]),DimDate[Datekey])

And the results look like this if we break out the years by the quarters:

image

Once again we can see that the yearly total is equivalent to the Q4 for the same year.  For some additional information and explanation of the PowerPivot calculations take a look at a great post by Kasper de Jonge here Accumulate values (running value) over multiple years in PowerPivot using DAX.

Conclusion

Cumulative total (running value) calculations are all calculated differently within each of the Microsoft products.  We took a look at some examples of how this can be done in T-SQL, Excel, SSIS, SSRS, MDX, and DAX.  These are not the only options, so if you want to see more take a look at the additional posts that I included as reference points the different sections.  It is important to understand how the formulas need to be configured and also what the results are once the calculations and logic is put into place.  I think that one of the easiest ones, besides the Excel formula, would have to be the SSRS with the RunningValue aggregate function.  I like how this is setup and it evaluates everything based on the context that is being referenced.  So if we go back to the SSRS report and apply a filter to the tablix to remove 2007 from what is displayed we would end up with the following:

image

Very simple and easy to understand, but that is just my personal opinion.

Posted in Analysis Services, Integration Services, PowerPivot, Reporting Services, SQL Server | Tagged: , , , , | 10 Comments »

Informatica PowerCenter vs. Integration Services (SSIS) Comparison

Posted by denglishbi on June 12, 2010

This past year I got an opportunity to finally take the great Informatica product for a test drive.  I guess it wasn’t really a test drive since this was the ETL tool that was being used for the project, but it was my first time using the product (other than a brief hands-on with it during a vendor selection process when I worked at the casino).  As far as I know, this product hasn’t really changed since I previously saw it, so I was a bit surprised, especially with the rave reviews and with it being a leader according to Gartner – Magic Quadrant for Data Integration Tools November 2009.

Magic Quadrant for Data Integration Tools

Magic Quadrant for Data Integration Tools

Source: Gartner (November 2009)

The version of the product that I used was PowerCenter 8.6 (with a hotfix that provides the Mapping Analyst for Excel feature) and I know there is a new version out now that was recently released, but that is not the version that we were using and almost certain that the client will not be upgrading to new version going forward (will most likely be moving to Integration Services).

So let’s get back to my experience with the product.  Just a warning, this by no means is an extensive critique or in-depth review of the entire product, just my usage of the product.  In the past I used DataStage and DTS, but most recently I have been using SSIS for the past four years.

What I liked:

  • Versioning – has to be enabled out of the gate, but once turned on you have check-in/check-out versioning capabilities with the product which is a must in any development environment. one thing to note that the file and context menu options were not always available when I wanted them, so the user experience was a bit unpleasant.
  • Metadata repository – a huge plus I will have to admit. being able to search for items and access them for a central tool is very powerful and a nice feature – if you need this.
  • Reusable components – for example, items like sources and targets were centralized and reusable throughout the environment, so if you imported a new table definition of an existing table this would be reflected in any mappings it was being referenced in.
  • Mapping Architect for Visio – built-in feature even with Standard Edition that allows you to generate a Visio diagram of mapping
  • Mapping Analyst for Excel – another built-in feature once hotfix applied with Standard Edition that allows you to create source to target Excel files for mappings and this is bidirectional for creating mappings as well.
  • Built-in mapping documentation – has to be configured on server along with security permissions, but Data Analyzer report gets created with tabular report output off all the components in a mapping, not extremely useful, but it is something – no visuals like with DataStage documenation generator, but you can get that with the Visio generator component.
  • Traceability – you have the ability to select a column (port) in the mapping and trace it back or forward within the mapping to see where it was generated from or where it ends up in the mapping.

What I didn’t like:

  • Too clunky – not a good user experience or development environment. menu options too random or not where expected, seemed really buggy at times with context menus not being where I would expect.
  • ODBC data source setup – had to create DSNs for reference and utilize third party providers from DataDirect for connectivity
  • Not intuitive UI at all – even out of the gate when you open the Designer tool if you want to edit a mapping you first need to make sure you have opened the Mapping Designer Tool (can’t just double click on the mapping). Even to rename items this was buried at times and confusing – sometimes you could do this if you edited the item through context menu, other times you had to access from the file menu.  Even the warning and error icons used were confusing and didn’t provide tooltips – message displayed in separate windows.
    • Here is a invalid mapping – informatica mapping invalid
    • Here is a warning message for a mapping – informatica mapping warning
    • Validation message in output window – validate window
  • Confusing layout – within a mapping you can create multiple flows, but then I wondered which source to target flow will occur first and in what order will they happen. Well needless to say this was buried in the file menu under ‘Mappings’ –> ‘Target Load Plan…’ to get at the mapping execution sequence
  • Too many steps – in order to create an ETL mapping you need to do at least the minimum steps using two separate components: create repository folder, create ODBC DSN (local workstation and eventually on the server for a data source), import a source definition (source analyzer designer), import a target definition (target designer), create a mapping (mapping designer – source, source qualifier (required to define source query and columns – called ports), transforms, router, update, destination) – now if you want to run the mapping you need to open up more components to create a session, then a workflow, and finally you can execute the workflow and monitor the progress.  Wow, just wanted to do a simple dump from source to target and I just ate up half a day.
  • Can’t add annotations – on the mapping design surface you don’t have the ability to simply add a textbox to provide any comments or documentation about the flow. this would all have to be buried in each item in the designer in the description area. I find this feature in SSIS extremely useful and helpful for development and documentation purposes.
  • Single repository folder support – can only be connected to a single repository folder at a time. Unable to work or view two different mappings located in separate repository folders.
  • No ability to watch the data flow during runtime – you couldn’t watch the records and progress of a mapping while it was being execute. There is an ability to monitor stats, but you can’t see what is going on during each stage of the mapping visually like in DataStage or SSIS.  In the Monitor stats window you can only see what is going on for the source and target, not each step of the mapping like for expressions, lookups, routers, etc.
  • Extensibility / Flexibility – just wasn’t there for me. We had to implement some script files in order to create a looping mechanism to re-run a single mapping in a batch mode for historical load purposes – seemed like a basic operation that I could quickly setup in SSIS, but in Informatica I need to place two script files on the server and create two workflows to setup a looping process.  With SSIS I can easily extend the functionality with custom components, script components (.NET code), and open source components (CodePlex.com or SQLIS.com).
  • Trust but Verify – as one of my colleagues always told me (I think he knows who he is) you always need to double-check everything. If you modify a session in a workflow it might not be reflect in the task editor. I had this happen to me a few times.  Make sure you modify in the task editor and it should reflect properly in the workflow.
  • Viewing source and target data – now that I think about it I always went into an external tool to view my data. I don’t believe this functionality is available in the tool and if it is, I simply overlooked it or it wasn’t pointed out to me. You could validate your source qualifier queries, but that was about it.

Hopping between the different products within PowerCenter seemed really clunky to me.  I guess with SSIS at the most I might have two open with Business Intelligence Development Studio and SQL Server Management Studio, but with PowerCenter you would have at least three if not four if you need to access the Repository.  I also had a really hard time with the flexibility of the tool.  For the price of it I wanted to do some fairly basic items like adding annotations, dynamically setting source file names, looping, passing of variable values between mappings, watch mapping during process, add some custom code, etc. and it just wasn’t there for me or I just wasn’t aware of it.  Granted I did not go through any formal training and learned the product basically on my own in two days, but I was really blown away how even the DataStage product that I used over four or five years ago was light years ahead of this product in my opinion and SSIS is a dream.

Now with SSIS you don’t get a metadata repository or documentation, but with external add-ins (which most are free and open source – surprising that I bring up open source when mentioning Microsoft, but it is free and made available by the community).  Even the Metadata Repository Toolkit was upgraded for SQL 2008 and there is just a ton of SSIS components available on CodePlex along with numerous blog postings, forums help online, video tutorials, virtual labs, and much more all free with tons of samples to get you started. 

The fact that you get the SQL Server database engine, Integrations Services, Reporting Services, and Analysis Services (along with Data Mining) in the SQL Server 2008 product for a fraction of the price of just Informatica PowerCenter just blows me away; I just don’t get the comparison or logic as to why people would spend that kind of money to implement the product.  I can easily overlook some of the advantages of the Informatica product for the flexibility, extensibility, and additional features and functionality that I gain with SSIS any day of the week.  Clearly I was confused as far as the Magic Quadrant goes and couldn’t understand why SSIS was not higher up and pushing behind Informatica and IBM.sql server 2008

Additional references to check out:

Posted in Integration Services | Tagged: | 24 Comments »

PASSMN May 19 SSIS Presentation Follow-up

Posted by denglishbi on May 20, 2009

I have posted this content on the PASSMN user site here – May 19, 2009 – Dan English: SSIS Team Development, Deployment and Configuration, but if you do not have access I have also uploaded the presentation and SSIS demo project files to my SkyDrive area and those are available below.  Here is a link to the PASSMN May speaker content – PASSMN May 19, 2009 Monthly Meeting.

Thanks for everyone that showed up and I really enjoyed the great questions everyone asked and making this interactive.  Don’t forget to download and install the latest release of BIDS Helper if you don’t have it already.  As I stated at the start of the presentation this content was from a presentation that Dave Pendleton and I did back in January if you want to check out the complete presentation I have that available here – SQL Server Integration Services – Enterprise Manageability Follow-up.

I really would like to see these meetings to turn into more of a collaboration where we can discuss each others issues and work on solutions.  I think that would be very valuable to our community for us to interact more and bounce ideas off each other.  After all, we are all dealing with the same products and most likely have either experienced or read about the same types of issues.  If nothing else we might be able to share some insight that might lead to a solutionSmile

http://cid-cd3e77e793df6178.skydrive.live.com/embedrowdetail.aspx/Public/Presentations/PASSMN/200905/SSIS|_PASSMN|_20090519.pdfhttp://cid-cd3e77e793df6178.skydrive.live.com/embedrowdetail.aspx/Public/Presentations/PASSMN/200905/SSISDemos.zip

Here are a couple of other links of reference I made to the SSIS samples on CodePlex and also the Microsoft Data Warehouse Toolkit Book Website:

All of the resource links are included in the presentation download.  Plus there are a couple of SSIS videos you can watch from last year’s PASS National conference here:

http://www.sqlpass.org/LearningCenter/SummitOnDemand.aspx

Enjoy and let me know if you have any additional questions.  Thanks.

Don’t forget to check out the SQL Server 2008 R2 site and sign up to be notified when the CTP becomes available – http://www.microsoft.com/sqlserver/2008/en/us/r2.aspx.

 

  

Posted in Integration Services | Tagged: , | 1 Comment »

SSIS 2008 ADO NET Source issue with SSAS data source – error code 0×80004002

Posted by denglishbi on March 29, 2009

A few weeks back I upgraded a SSIS 2005 package to SSIS 2008 and ran into a issue with my Data Flow Task source.  In the original package I was using a DataReader source to query information from Analysis Services (SSAS) and during the upgrade this gets converted to a ADO NET source in SSIS 2008.  The reason I was using a DataReader with an ADO.NET connection manager was because you could get around the annoying about being unable to map external data properly and that it will use the DT_WSTR data type when you try to preview the results of your query for the data source.

OLE DB SSAS Warning

I did a blog posting last year Using Reporting Services (SSRS) with SSAS data where I went through some different options of getting SSAS data into a format to easily use with SSRS.  This is when I basically decided that I would use the DataReader over the OLE DB source when using SSAS.  Also, if you use the OLE DB data source and have the Data Flow Task embedded in a Loop container this could cause some issues depending on the connection manager setting (more on this later).

So back to the ADO NET source issue… after I did my upgrade and ran my package I noticed a couple of things now with the ADO NET source.  First you get a warning message now in regards to the data type that I did not get with the DataReader source in SSIS 2005:

Warning    3    Validation warning. ADO Source: {568FF45A-6A41-43F8-A122-4BBFF40DD7BF}: The data type "System.Object" found on column "[Measures].[Reseller Sales Amount]" is not supported for the component "ADO NET Source" (1). This column will be converted to  DT_NTEXT.      Package2.dtsx    0    0   

And then now when the package is executed the ADO NET source does not work pulling to extract the SSAS data in the Data Flow Task even though you can Preview the data just fine:

image 

image

And the Error messages that I got were not what I would consider helpful in resolving the issue:

[ADO NET Source [43]] Error: SSIS Error Code DTS_E_INDUCEDTRANSFORMFAILUREONERROR. The "component "ADO NET Source" (43)" failed because error code 0x80004002 occurred, and the error row disposition on "output column "[Measures].[Reseller Sales Amount]" (67)" specifies failure on error. An error occurred on the specified object of the specified component. There may be error messages posted before this with more information about the failure.


[ADO NET Source [43]] Error: The component "ADO NET Source" (43) was unable to process the data. Pipeline component has returned HRESULT error code 0xC0209029 from a method call.

[SSIS.Pipeline] Error: SSIS Error Code DTS_E_PRIMEOUTPUTFAILED. The PrimeOutput method on component "ADO NET Source" (43) returned error code 0xC02090F5. The component returned a failure code when the pipeline engine called PrimeOutput(). The meaning of the failure code is defined by the component, but the error is fatal and the pipeline stopped executing. There may be error messages posted before this with more information about the failure.

I was able to find a KB Article that talks about a similar issue when using Oracle (http://support.microsoft.com/kb/959794), so I thought I just needed to install the SQL Server CU2 or higher.  I went with the SQL Server 2008 CTP SP1 just to make sure I was using the latest and greatest and unfortunately this did not resolve my issue.  So what do I do now… post the issue on Microsoft ConnectNerd  I went ahead and logged the bug on Connect here – SSIS 2008 ADO NET Source issue with SSAS data source – error code 0x80004002 and have already gone back and forth a couple of times with Microsoft about the issue.  You might sense some of my frustration in the bug posting.  I did decide to move on with the development and switch the data source over to an OLE DB source.  This brings up what I consider to be a bug with that data source is that when the Data Flow Task is embedded in a Foreach Loop container you need to make sure that you put a Script Task after (or this could be before I guess) the Data Flow Task that puts the thread to sleep prior to performing the next loop.  If you don’t you will get the following error:

[SSAS OLEDB Cube Source [928]] Error: SSIS Error Code DTS_E_OLEDBERROR.  An OLE DB error has occurred. Error code: 0x80040E05.
An OLE DB record is available.  Source: "Microsoft OLE DB Provider for Analysis Services 2008."  Hresult: 0x00000001  Description: "Error Code = 0x80040E05, External Code = 0x00000000:.".

[SSIS.Pipeline] Error: component "SSAS OLEDB Cube Source" (928) failed the pre-execute phase and returned error code 0xC0202009.

So what do you need to get around this issue?  You need to make sure that the ‘RetainSameConnection’ property on the OLE DB connection manager is set to ‘False’.  I also had some initial issues after changing that property value, so I also added a Script Task to the Control Flow within the Foreach Loop Container after the Data Flow Task with a single line of code that puts the thread to sleep (yes I am a fan of Visual Basic and the code for C# is not different for this except you need to include semicolons):

    Public Sub Main()

        System.Threading.Thread.Sleep(15000)
        Dts.TaskResult = ScriptResults.Success

    End Sub

Once you have verified the connection manager property and have this added to the Control Flow the package will run fine.  Still get the annoying warning messages when trying to preview and work with the OLE DB source, but at least the package runs.

Here is the final layout of my SSIS 2008 package:

image

If you would like this issue with the ADO NET source resolved so that you can use this with SSAS then please vote for this bug here – SSIS 2008 ADO NET Source issue with SSAS data source – error code 0x80004002Thumbs-up

Also, if you want to setup the SQL command for the ADO NET source so that it is dynamic you can do this within the Control Flow by using an expression on the Data Flow Task for the ADO NET source SqlCommand property (same thing for SSIS 2005 and the DataReader source).  It would be nice if you could just reference a variable within the ADO NET data source editor like the OLE DB source, but oh wellThinking

Posted in Integration Services | Tagged: , | 7 Comments »

Integration Services (SSIS) Package Designing Tips

Posted by denglishbi on February 1, 2009

I recently co-presented at a local event in Bloomington, MN talking about SQL Server Integration Services – Enterprise Manageability.  There was a lot of content to discuss and along the way I mentioned a few items in regards to designing SSIS packages, but I forgot to mention one item in particular – layout.  By package layout I am talking about how the items are placed on the control and data flow along with how they are joined together.

Lets take a look at an example from the Microsoft Data Warehouse Toolkit Book website.  The package that I will use is the DimProduct.dtsx.  Here are a couple of screenshots of the initial control and data flows:

Control Flow

Data Flow

image

image

Look at the item and precedence constraint alignment

Once again, look at the item and data flow alignment

What we see here is that the items that have been placed in the package are not neatly aligned and the precedence  constraints and data flow paths are not nice and straight (when they can be).  If I was creating some documentation or handing this off to a customer I would probably want to take a little time to polish it up and make it look a little cleaner.  Doing this by clicking and dragging the items can be extremely difficult (just like it is in creating a Database Diagram in SQL Server Management Studio).  So how can you do this more effectively?  Easy, enable the ‘Layout’ toolbar within the Business Intelligence Development Studio (BIDS).  You can do this by right-clicking on the tool bar and selecting ‘Layout’ or you can do this through the file menu by selecting View—>Toolbars—>Layout.

image

Now by selecting items within either of the flows I can utilize the different types of options in the ‘Layout’ toolbar to nicely center, top, and middle align the items and make the package layout look like the following:

Control Flow

Data Flow

image

image

Here the items have been center aligned, the annotation is top aligned with the top item, the precedence constraints are nice and straight, and I have enabled the precedence constraint labels.

Once again you can see that the certain items have been center aligned, middle aligned, top aligned, the data flow lines are nice and straight (when they can be), and items have been properly sized to display all of the text (source and annotation items).

This is kind of a nitpicky thing, but now the package looks really clean that the items have been nicely aligned.  Make sure that you properly size items so that you can read the names and annotations properly too.  This makes it much easier to read without having to read properties.  You can also see that in the control flow I have enabled the option to display the precedence constraint labels.  This makes it very easy to see what is going on and helpful for people that have color blindness and have difficulty distinguishing between the green and red lines.  To enable the labels you need to go into the file menu in BIDS (Tools—>Options—>Business Intelligence Designers—>Integration Services Designers) and check the box ‘Show precedence constraint labels’ in the Accessibility section.

image 

In review, here are the items that I would suggest when designing packages:

  1. Enable ‘Layout’ Toolbar
  2. Align items in the control and data flows
  3. Enable ‘Show precedence constraint labels’ (check out the update below in the posting for additional property setting)
  4. Properly size items so that all text is readable

On a side note, effectively naming the items and including annotations are other best practices that should always be included when creating SSIS packages.  I would have included some additional annotations in the package above to add more explanation to what the package is actually doing.  To add annotations you right-click within the control or data flow and select ‘Add Annotation’.  To add additional lines in the annotation perform a CTRL+Enter.  If you want to watch a quick video on documenting SSIS packages take a look at this one by the Midnight DBA – Commenting Tribal Knowledge in SSIS Packages.

Additional design best practices to add to the list above would be:

  1. Effectively name control and data flow items
  2. Include annotations

One last thing…hopefully in the future snap lines will be implemented in the SSIS designer.  It has been part of Visual Studio development environment since Visual Studio 2005 and was implemented in SSRS 2008, so why not across the entire toolset?  Hmmm…

UPDATE (10/17/2009): Jamie Thomson did a recent posting about setting the ShowAnnotation property of the precendence constraints to ‘ConstraintOptions’.  This will show the label and also any expressions that are being used which is very valuable information and improves the self docummenting capabilites of SSIS so that you don’t have to manual enter this information into a annotation.  Check out the posting here Things you might not know about Precedence Constraints : SSIS Nugget.

Posted in Integration Services | Tagged: | 6 Comments »

SQL Server Integration Services – Enterprise Manageability Follow-up

Posted by denglishbi on January 26, 2009

Last Thursday David Pendleton, Magenic Senior Consultant, and I got a unique opportunity to do a presentation at our local Microsoft Office in Bloomington, MN to approximately 65 attendees.  I was really happy at the great turnout we had and we had some really good questions asked during and after each presentation.  I want to thank everyone that attended and also provide everyone access to our materials.  I have uploaded them all to my SkyDrive account and they are available here to download – SSIS Enterprise Manageability.  You will find a copy of the presentation material and the demos for review.

I also wanted to include some additional source information for reference here if anyone is interested in doing some additional reading:

Managing and Deploying SQL Server Integration Services

What’s new in SQL Server 2008 for SSIS

dtexec.exe what are you doing? (PowerShell monitoring example)

Loading and Running a Remote Package Programmatically

How to: Implement a Signing Policy by Setting a Registry Value

If you are interested in taking a look at some videos I have a couple that I have made in regards to the Data Profiling Task and also the Cache Transform.  I do have more videos available for viewing on SoapBox and YouTube.

If you have any additional questions about the material that was presented feel free to drop me a line through my blog and I will follow-up with you.  Don’t forget a few of the tips that I pointed out in regards to including the –nosplash in your shortcuts, setting the show precedence constraint labels in the Tools Options, and also that the default Scripting code is set in the Tools Options for C# or Visual Basic in the Integration Services Designers (under Business Intelligence Designers).  Don’t forget to download and install BIDS Helper too – a must have if you are doing SSIS or SSAS development.

image    image

Another great source for getting information in regards to SSIS questions is to take a look at the MSDN forums here – SQL Server Integration Services.  And you never know, I might be the one respondingNerd

I also want to thank everyone that made this possibly and once again a special thank you to everyone that attended and made this a great eventHot

Forgot to mention two other resources that I mentioned during the presentation:

The Microsoft Data Warehouse Toolkit Book Website

The Project REAL download (Project REAL main site: Project REAL – Business Intelligence in Practice)

EnjoyOpen-mouthed

Posted in Integration Services | Tagged: , | 1 Comment »

Did you know? SQL Server 2008 includes Cache Transform screencast

Posted by denglishbi on April 15, 2008

SQL Server 2008 includes Cache Transform:

The Cache Transform transformation writes data from a connected data source in the data flow to a Cache connection manager.  The cache can be populated from text files, XML files, Excel files, etc.; you are not limited to only populating the cache from an OLE DB source.  The Lookup transformation in a package performs lookups on the data using the Cache connection manager.

You can configure the Cache connection manager to save the data to a cache file (.caw).  The cache file can be populated once and reused throughout the ETL process within multiple Lookup transformations for performing surrogate key lookups and populating the data warehouse.

Read more about this in the SQL Server 2008 BOL ‘Cache Transform’.

Check out a screencast about the new cache transform transformation feature in SQL Server 2008  here.

Did you know? SQL Server 2008 includes Cache Transform from Dan English on Vimeo.

If you are interested in additional information in regards to this topic check out these other blog postings and sites: 

You can also check out a video on YouTube done by Jamie Thomson in regards to this new transformation here.  The last time I viewed it there wasn’t any audio, just video.

You can download the SSIS samples from CodePlex and take a look at a sample utilizing the cache transform here.

Posted in Integration Services | Tagged: , , | 2 Comments »

Did you know? SQL Server 2008 includes Data Profiling Task screencast

Posted by denglishbi on April 6, 2008

SQL Server 2008 includes Data Profiling Task:

The Data Profiling task is a new task in the Integration Services toolbox. You can use this task inside an Integration Services package to profile data that is stored in SQL Server. The information provided by the profile helps you identify potential problems with data quality. The Data Profiling task provides profiles that help identify data quality problems within individual columns and with column relationships:

Profiles that help identify problems within individual columns

· The distribution of lengths in the column values.

· The percentage of null values.

· The distribution of values in the column.

· Column statistics for numeric columns.

· Regular expressions that match string columns.

Profiles that help identify problems with column relationships

· Candidate key columns.

· Functional dependencies between columns.

· The inclusion of the set of values in one column in the set of values in another column.

Read more about this in the SQL Server 2008 BOL ‘Data Profiling Task’.

Check out a screencast about the new data profiling task feature in SQL Server 2008  here.

Did you know? SQL Server 2008 includes Data Profiling Task from Dan English on Vimeo.

If you are interested in additional information in regards to this topic check out these other blog postings:

http://www.sqlskills.com/blogs/stacia/2007/11/27/SQLServer2008DataProfilingTaskTheQuickVersion.aspx

OR

SSIS: Data Profiling Task: Part 1 – Introduction
Data Profiling Task: Part 2 – Column Length Distribution
Data Profiling Task: Part 3 – Column Null Ratio
Data Profiling Task: Part 4 – Column Pattern
Data Profiling Task: Part 5 – Column Statistics

Data Profiling Task: Part 6 – Column Value Distribution
Data Profiling Task: Part 7 – Functional Dependency
Data Profiling Task: Part 8 – Candidate Key
Data Profiling Task: Part 9 – Value Inclusion

SSIS: Data Profiling Task: Part 10 – Parsing the output

and if you want to know how to dynamically profile tables check out this blog posting which includes the code here.

Did you know? SQL Server 2008 includes Data Profiling Task from Dan English on Vimeo.

If you are interested in additional information in regards to this topic check out these other blog postings:

http://www.sqlskills.com/blogs/stacia/2007/11/27/SQLServer2008DataProfilingTaskTheQuickVersion.aspx

OR

SSIS: Data Profiling Task: Part 1 – Introduction
Data Profiling Task: Part 2 – Column Length Distribution
Data Profiling Task: Part 3 – Column Null Ratio
Data Profiling Task: Part 4 – Column Pattern
Data Profiling Task: Part 5 – Column Statistics

Data Profiling Task: Part 6 – Column Value Distribution
Data Profiling Task: Part 7 – Functional Dependency
Data Profiling Task: Part 8 – Candidate Key
Data Profiling Task: Part 9 – Value Inclusion

SSIS: Data Profiling Task: Part 10 – Parsing the output

and if you want to know how to dynamically profile tables check out this blog posting which includes the code here.

Posted in Integration Services | Tagged: , , | 1 Comment »