February 2011
S	M	T	W	T	F	S
	1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28

Archive for February, 2011

SSIS Data Flow vs. Hand Coded ETL

Posted by denglishbi on February 27, 2011

[tweetmeme source=”denglishbi” only_single=”false”]

“Why do I need an ETL tool?”

“Why would I want to use Integration Services (SSIS)?”

“Why use a source-to-target setup in ETL process?”

It is questions and statements like this that I hear often. These questions can generate lots of discussion and there are many different topics that you could go into in regards to the benefits of using an ETL tool or just pure coded stored procedures or T-SQL. Some of the benefits you can get with ETL tools are:

control flow,
transformations,
logging,
visual representation of work,
more options than just source-to-target,
reusability,
audit details,
configurations,
memory management,
etc.

I could go on, but this is not the topic for this blog posting. What I want to discuss is a scenario comparing a potential consequence of using stored procedure or T-SQL ETL over what I would consider more standard ETL using source and target components. The other item that I want to include and point out in this example is the use of ‘SELECT * FROM’ and its impact. In all of these examples we will be leveraging a database view for our source and this is simply used to add an additional layer of abstraction and reference point instead of using the underlying table directly.

Situation

Let me set the stage for this. In the data warehouse we need to load Employee information to reference. For simplicity we are not interested in tracking any sort of changes over time and just want a current view of the Employees. All we want to do is take the information from one table and completely refresh a data warehouse table that we will use to deliver information to our end-users.

Development

Piece of cake. Simply delete the data from our destination data warehouse table, select data from the employee source table, and insert the source contents into the destination. We have a planning meeting and discuss the different options that could go with to perform this operation – stored procedure, T-SQL, or SSIS. Sure there are other options that we could possibly consider and for what it is worth, the first two are the same, just determines where you want the logic to sit.

So we decide to evaluate the different options. In this example we are going to set all three options up in an SSIS package to evaluate.

Option 1: Stored Procedure

The first one is to call a stored procedure where the logic for the load will reside. We create the stored procedure:

<br>CREATE PROCEDURE [dbo].[usp_DWRefresh] AS<br>BEGIN<br>SET NOCOUNT ON;<br>--clear out DW employee table<br>TRUNCATE TABLE DWEmployee;<br>--reload DW employee table<br>INSERT INTO DWEmployee<br>SELECT * FROM v_Employee;<br>END<br>

And then in the SSIS package we utilize a Execute SQL Task to call the stored procedure:

Ok, that one is done, now on to the second option using T-SQL.

Option 2: T-SQL

In this example we are simply going to use a Execute SQL Task in SSIS and place the logic that was in the stored procedure directly in the task, so now the logic is embedded in the package.

Nothing fancy going on here, simply truncate table and insert into table select * from…

Now we go on to the final option that we are going to look at, using a Data Flow Task.

Option 3: Data Flow Task (SSIS)

In this final option we are going to compare we are going to continue to use the ‘SELECT * FROM’ logic for the source, but this time we are going to include this logic in a Data Flow Task and use a source-to-target ETL setup leveraging components.

In this package we first use an Execute SQL Task to clear out the data in our destination table using the truncate table logic that was included in the T-SQL example. Next we use a Data Flow Task to transfer the data from our source and load it into our target (destination) table that we will use in our data warehouse.

OLE DB Source setup:

OLE DB Destination setup:

One thing to point out here is that even though we are using the ‘SELECT * FROM’ setup for the source the column names are still being set and mapped directly to the destination table. This is important and you will see why as we continue on in our evaluation.

Evaluation Process

Okay, the stage is set and we are ready to test everything out. We run the package with the different options:

Everything looks great.

Gotcha

But wait, nothing stays this simple or never changes right? Right. Now we have a user that decides that they would like to include a new column in v_Employee that provides them the full name of the employee combining the employee’s last name and first name together. Not a problem and makes complete sense to have this setup in the source to reference and centralize this logic. The new column is setup in the view and saved with the following highlighted logic below added.

What does this do to our logic that we setup?