SSIS Data Flow vs. Hand Coded ETL
Posted by denglishbi on February 27, 2011
“Why do I need an ETL tool?”
“Why would I want to use Integration Services (SSIS)?”
“Why use a source-to-target setup in ETL process?”
It is questions and statements like this that I hear often. These questions can generate lots of discussion and there are many different topics that you could go into in regards to the benefits of using an ETL tool or just pure coded stored procedures or T-SQL. Some of the benefits you can get with ETL tools are:
- control flow,
- visual representation of work,
- more options than just source-to-target,
- audit details,
- memory management,
I could go on, but this is not the topic for this blog posting. What I want to discuss is a scenario comparing a potential consequence of using stored procedure or T-SQL ETL over what I would consider more standard ETL using source and target components. The other item that I want to include and point out in this example is the use of ‘SELECT * FROM’ and its impact. In all of these examples we will be leveraging a database view for our source and this is simply used to add an additional layer of abstraction and reference point instead of using the underlying table directly.
Let me set the stage for this. In the data warehouse we need to load Employee information to reference. For simplicity we are not interested in tracking any sort of changes over time and just want a current view of the Employees. All we want to do is take the information from one table and completely refresh a data warehouse table that we will use to deliver information to our end-users.
Piece of cake. Simply delete the data from our destination data warehouse table, select data from the employee source table, and insert the source contents into the destination. We have a planning meeting and discuss the different options that could go with to perform this operation – stored procedure, T-SQL, or SSIS. Sure there are other options that we could possibly consider and for what it is worth, the first two are the same, just determines where you want the logic to sit.
So we decide to evaluate the different options. In this example we are going to set all three options up in an SSIS package to evaluate.
Option 1: Stored Procedure
The first one is to call a stored procedure where the logic for the load will reside. We create the stored procedure:
<br>CREATE PROCEDURE [dbo].[usp_DWRefresh] AS<br>BEGIN<br>SET NOCOUNT ON;<br>--clear out DW employee table<br>TRUNCATE TABLE DWEmployee;<br>--reload DW employee table<br>INSERT INTO DWEmployee<br>SELECT * FROM v_Employee;<br>END<br>
And then in the SSIS package we utilize a Execute SQL Task to call the stored procedure:
Ok, that one is done, now on to the second option using T-SQL.
Option 2: T-SQL
In this example we are simply going to use a Execute SQL Task in SSIS and place the logic that was in the stored procedure directly in the task, so now the logic is embedded in the package.
Nothing fancy going on here, simply truncate table and insert into table select * from…
Now we go on to the final option that we are going to look at, using a Data Flow Task.
Option 3: Data Flow Task (SSIS)
In this final option we are going to compare we are going to continue to use the ‘SELECT * FROM’ logic for the source, but this time we are going to include this logic in a Data Flow Task and use a source-to-target ETL setup leveraging components.
In this package we first use an Execute SQL Task to clear out the data in our destination table using the truncate table logic that was included in the T-SQL example. Next we use a Data Flow Task to transfer the data from our source and load it into our target (destination) table that we will use in our data warehouse.
OLE DB Source setup:
OLE DB Destination setup:
One thing to point out here is that even though we are using the ‘SELECT * FROM’ setup for the source the column names are still being set and mapped directly to the destination table. This is important and you will see why as we continue on in our evaluation.
Okay, the stage is set and we are ready to test everything out. We run the package with the different options:
Everything looks great.
But wait, nothing stays this simple or never changes right? Right. Now we have a user that decides that they would like to include a new column in v_Employee that provides them the full name of the employee combining the employee’s last name and first name together. Not a problem and makes complete sense to have this setup in the source to reference and centralize this logic. The new column is setup in the view and saved with the following highlighted logic below added.
What does this do to our logic that we setup?
Option 1: Stored Procedure
Process breaks with message
[Execute SQL Task] Error: Executing the query “usp_DWRefresh” failed with the following error: “Column name or number of supplied values does not match table definition.”. Possible failure reasons: Problems with the query, “ResultSet” property not set correctly, parameters not set correctly, or connection not established correctly.
Option 2: T-SQL
Process breaks with the following message
[Execute SQL Task] Error: Executing the query ” TRUNCATE TABLE DWEmployee;
INSERT INT…” failed with the following error: “Column name or number of supplied values does not match table definition.”. Possible failure reasons: Problems with the query, “ResultSet” property not set correctly, parameters not set correctly, or connection not established correctly.
Option 3: Data Flow Task (SSIS)
Process continues to work with the following warning/informational message
[OLEDB_SRC_v_Employee ] Warning: The external columns for component “OLEDB_SRC_v_Employee” (1) are out of synchronization with the data source columns. The column “FullName” needs to be added to the external columns.
Interesting… Ok, so why would I want to use an ETL tool and even more than that, a standard source-to-target data process? The data flow setup continues to work because during the setup the individual columns get referenced in the flow and mapped in the destination. If either the source or target have additional columns added that is okay. We will be alerted of this, but it will not break our process. Items that could potentially impact our process would be if data types change or column names are modified.
Maybe this is not being too fair, because who would ever go with a ‘SELECT * FROM’ setup to load a destination table. I mean our sources and targets never change, right? Well I have seen it and when you are in a crunch you just might put something like this together without realizing the impact it might have down the road. Maybe you set this logic up just to test something out and we know what happens to test and proof of concept work right…they sometimes get moved to production.
So please be careful and really evaluation not only the logic you use when loading data, but also the option and tool you will use for the process. Tools like Integrations Services (SSIS) have tons of benefits that far out weigh the alternatives.
Some additional postings and articles that you might want to look at are below that further expand on the reason to leverage ETL or help explain reasoning and purposes:
- SSIS: Data flows or T-SQL (Jamie Thomson)
- Advantages of using SSIS packages over stored procedures?
- Kimball University: Should You Use An ETL Tool? (Joy Mundy)