Dan English's BI Blog

Welcome to my random thoughts in regards to Business Intelligence, databases, and other technologies

SSIS Data Flow vs. Hand Coded ETL

Posted by denglishbi on February 27, 2011

[tweetmeme source=”denglishbi” only_single=”false”]

“Why do I need an ETL tool?”

“Why would I want to use Integration Services (SSIS)?”

“Why use a source-to-target setup in ETL process?”

It is questions and statements like this that I hear often.  These questions can generate lots of discussion and there are many different topics that you could go into in regards to the benefits of using an ETL tool or just pure coded stored procedures or T-SQL.  Some of the benefits you can get with ETL tools are:

  • control flow,
  • transformations,
  • logging,
  • visual representation of work,
  • more options than just source-to-target,
  • reusability,
  • audit details,
  • configurations,
  • memory management,
  • etc.

I could go on, but this is not the topic for this blog posting.  What I want to discuss is a scenario comparing a potential consequence of using stored procedure or T-SQL ETL over what I would consider more standard ETL using source and target components.  The other item that I want to include and point out in this example is the use of ‘SELECT * FROM’ and its impact.  In all of these examples we will be leveraging a database view for our source and this is simply used to add an additional layer of abstraction and reference point instead of using the underlying table directly.

Situation

Let me set the stage for this.  In the data warehouse we need to load Employee information to reference.  For simplicity we are not interested in tracking any sort of changes over time and just want a current view of the Employees.  All we want to do is take the information from one table and completely refresh a data warehouse table that we will use to deliver information to our end-users.

Development

Piece of cake.  Simply delete the data from our destination data warehouse table, select data from the employee source table, and insert the source contents into the destination.  We have a planning meeting and discuss the different options that could go with to perform this operation – stored procedure, T-SQL, or SSIS.  Sure there are other options that we could possibly consider and for what it is worth, the first two are the same, just determines where you want the logic to sit.

So we decide to evaluate the different options.  In this example we are going to set all three options up in an SSIS package to evaluate.

Option 1: Stored Procedure

The first one is to call a stored procedure where the logic for the load will reside.  We create the stored procedure:

<br>CREATE PROCEDURE [dbo].[usp_DWRefresh] AS<br>BEGIN<br>SET NOCOUNT ON;<br>--clear out DW employee table<br>TRUNCATE TABLE DWEmployee;<br>--reload DW employee table<br>INSERT INTO DWEmployee<br>SELECT * FROM v_Employee;<br>END<br>

And then in the SSIS package we utilize a Execute SQL Task to call the stored procedure:

image

image

Ok, that one is done, now on to the second option using T-SQL.

Option 2: T-SQL

In this example we are simply going to use a Execute SQL Task in SSIS and place the logic that was in the stored procedure directly in the task, so now the logic is embedded in the package.

image

image

Nothing fancy going on here, simply truncate table and insert into table select * from…

Now we go on to the final option that we are going to look at, using a Data Flow Task.

Option 3: Data Flow Task (SSIS)

In this final option we are going to compare we are going to continue to use the ‘SELECT * FROM’ logic for the source, but this time we are going to include this logic in a Data Flow Task and use a source-to-target ETL setup leveraging components.

image

In this package we first use an Execute SQL Task to clear out the data in our destination table using the truncate table logic that was included in the T-SQL example.  Next we use a Data Flow Task to transfer the data from our source and load it into our target (destination) table that we will use in our data warehouse.

image

OLE DB Source setup:

imageimage

OLE DB Destination setup:

image image

One thing to point out here is that even though we are using the ‘SELECT * FROM’ setup for the source the column names are still being set and mapped directly to the destination table.  This is important and you will see why as we continue on in our evaluation.

Evaluation Process

Okay, the stage is set and we are ready to test everything out.  We run the package with the different options:

image

Everything looks great.

Gotcha

But wait, nothing stays this simple or never changes right?  Right.  Now we have a user that decides that they would like to include a new column in v_Employee that provides them the full name of the employee combining the employee’s last name and first name together.  Not a problem and makes complete sense to have this setup in the source to reference and centralize this logic.  The new column is setup in the view and saved with the following highlighted logic below added.

image

What does this do to our logic that we setup?

Option 1: Stored Procedure

image

Process breaks with message

Error Message:

[Execute SQL Task] Error: Executing the query “usp_DWRefresh” failed with the following error: “Column name or number of supplied values does not match table definition.”. Possible failure reasons: Problems with the query, “ResultSet” property not set correctly, parameters not set correctly, or connection not established correctly.

Option 2: T-SQL

image

Process breaks with the following message

Error Message:

[Execute SQL Task] Error: Executing the query ”    TRUNCATE TABLE DWEmployee;
  
    INSERT INT…” failed with the following error: “Column name or number of supplied values does not match table definition.”. Possible failure reasons: Problems with the query, “ResultSet” property not set correctly, parameters not set correctly, or connection not established correctly.

Option 3: Data Flow Task (SSIS)

image

Process continues to work with the following warning/informational message

Warning Message:

[OLEDB_SRC_v_Employee [1]] Warning: The external columns for component “OLEDB_SRC_v_Employee” (1) are out of synchronization with the data source columns. The column “FullName” needs to be added to the external columns.

Interesting…  Ok, so why would I want to use an ETL tool and even more than that, a standard source-to-target data process?  The data flow setup continues to work because during the setup the individual columns get referenced in the flow and mapped in the destination.  If either the source or target have additional columns added that is okay.  We will be alerted of this, but it will not break our process.  Items that could potentially impact our process would be if data types change or column names are modified.

Conclusion

Maybe this is not being too fair, because who would ever go with a ‘SELECT * FROM’ setup to load a destination table.  I mean our sources and targets never change, right?  Well I have seen it and when you are in a crunch you just might put something like this together without realizing the impact it might have down the road.  Maybe you set this logic up just to test something out and we know what happens to test and proof of concept work right…they sometimes get moved to production.

So please be careful and really evaluation not only the logic you use when loading data, but also the option and tool you will use for the process.  Tools like Integrations Services (SSIS) have tons of benefits that far out weigh the alternatives.

Some additional postings and articles that you might want to look at are below that further expand on the reason to leverage ETL or help explain reasoning and purposes:

Advertisements

4 Responses to “SSIS Data Flow vs. Hand Coded ETL”

  1. Andy P said

    You’re right… this isn’t being fair. What you describe is not really an advantage of SSIS over T-SQL: it’s just another example of why you should avoid SELECT * in production code.

    If you remove or rename a column on the view you’re selecting from – even if that column isn’t actually being mapped to anything in the destination table – then all three of the example solutions will break. An explicit list of columns will protect all three solutions from this problem.

    There may be advantages to using SSIS – but in my opinion, this is not really one of them.

    • denglishbi said

      Completely agree, but it does happen. Explicit list of columns is definitely the way to go, no doubt about that. Was simply going over a situation I encountered to show how SSIS would have prevented the issue.

  2. Chris said

    This technique seems to work well for a Truncate-Reload process. I tend to write Change-Data-Capture as seperate Delete, Update, and Insert steps, which doesn’t seem to have direct support in SSIS – in other words, a sproc still has to be used for the Delete and Update steps… so why not use a sproc for the Insert step as well? Other than that, agreed that SSIS is better laid out logically.

    • denglishbi said

      Typically you wouldn’t be doing a delete in a data warehouse scenario, but for updates you would want to use a stored procedure or at least a SQL Execute Task in SSIS to perform a set based update instead of using the OLE DB Command task which does the row-by-row update.

      On the insert side of the world the reason I would use SSIS is so that it pulls in the row metadata and maps the data accordingly. Using SSIS for additional items like transformations makes it more logically to follow instead of burying the logic in a stored procedure for someone to try and reverse engineer. I have seen some very complex stored procedures which are just awful to try and follow the logic and of course have no comments included.

      The other thing is that in the example I displayed if someone uses a SELECT * FROM, which is a big no-no, you can protect yourself to ensure that only the columns that you intended to insert are being loaded into your destination table.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

 
%d bloggers like this: