Friday, March 12, 2010

SSIS - Lookup Cache Modes - Full, Partial, None

There are many, many resources out on the ‘Net regarding SSIS and the Lookup component and what each of its cache modes are and how to implement them in your own package. This is going to be a technical post, for those of you interested in what each cache mode does behind the scenes.


For this post, use the following schema and data:
create table fact_sales
(id int identity(1,1),
sales_rep_id int,
sales_dollars decimal(18,2)
)

create table dim_sales_rep
( id int identity(1,1),
first_name varchar(30),
last_name varchar(50)
)

insert into fact_sales (sales_rep_id, sales_dollars) values (1,120.99);
insert into fact_sales (sales_rep_id, sales_dollars) values (2,24.87);
insert into fact_sales (sales_rep_id, sales_dollars) values (3,98.11);
insert into fact_sales (sales_rep_id, sales_dollars) values (4,70.64);
insert into fact_sales (sales_rep_id, sales_dollars) values (4,114.19);
insert into fact_sales (sales_rep_id, sales_dollars) values (4,37.00);
insert into fact_sales (sales_rep_id, sales_dollars) values (5,401.50);

insert into dim_sales_rep (first_name, last_name) values ('K','S');
insert into dim_sales_rep (first_name, last_name) values ('S','D');
insert into dim_sales_rep (first_name, last_name) values ('Larry','White');
insert into dim_sales_rep (first_name, last_name) values ('Carrie','Green');
insert into dim_sales_rep (first_name, last_name) values ('Adam','Smith');


FULL Cache Mode
First, it is always advisable to build a query for the lookup, instead of choosing a table in the Table/View drop-down. The primary reason is so that you can limit the resultset to only the columns needed to perform the lookup as well as return any columns needed downstream, and to have the ability to add a WHERE clause if needed.

The full cache mode will run the specified query (or its own depending on how you assigned the lookup table) and attempt to cache all of the results. It will execute this query very early on in the package execution to ensure that the first set of rows coming out of the source(s) are cached. If SSIS runs out of memory on the machine though, the data flow will fail as the lookup component will not spool its memory overflow to disk. Be cautious of this fact. Once the data is cached, the lookup component will not go back to the database to retrieve its records, so long as the data flow is not restarted. (In SQL Server 2008, you can now reuse lookup caches.)

Using SQL Profiler, you can see that only one database call is made:
declare @p1 int
set @p1=1
exec sp_prepare @p1 output,NULL,N'select sales_rep_id, sales_dollars
from fact_sales',1
select @p1
go
exec sp_execute 1
go
SET NO_BROWSETABLE ON
go
declare @p1 int
set @p1=1
exec sp_prepare @p1 output,NULL,N'select id, first_name, last_name
from dim_sales_rep',1
select @p1
go
exec sp_execute 1
go
exec sp_unprepare 1
go
exec sp_unprepare 1
go


PARTIAL Cache Mode
Partial cache mode will not execute a query immediately at package execution. Instead, it will wait until its first input row arrives. Once the row arrives, whatever lookup value (in this case, sales_rep_id) is being passed in, will get substituted for a parameter, and then SSIS will send the query to the database for retrieval. At this point, all of the data returned will be cached for future lookups. If a new sales_rep_id is encountered, then the query will have to be re-executed, and the new resultset will get added to the lookup cache.

In other words, in the above data, if my source is “select sales_rep_id, sales_dollars from fact_sales”, we should have five database calls made by the lookup component. Even though for sales_rep_id = 4 we have three entries, in partial cache mode the first time we retrieve the lookup records for sales_rep_id = 4, the results will be cached, allowing future occurrences of sales_rep_id = 4 to be retrieved from cache.

This is illustrated in the SQL Profiler data:
exec sp_executesql N'select * from (select id, first_name, last_name
from dim_sales_rep) [refTable]
where [refTable].[id] = @P1',N'@P1 int',1
go
exec sp_executesql N'select * from (select id, first_name, last_name
from dim_sales_rep) [refTable]
where [refTable].[id] = @P1',N'@P1 int',2
go
exec sp_executesql N'select * from (select id, first_name, last_name
from dim_sales_rep) [refTable]
where [refTable].[id] = @P1',N'@P1 int',3
go
exec sp_executesql N'select * from (select id, first_name, last_name
from dim_sales_rep) [refTable]
where [refTable].[id] = @P1',N'@P1 int',4
go
exec sp_executesql N'select * from (select id, first_name, last_name
from dim_sales_rep) [refTable]
where [refTable].[id] = @P1',N'@P1 int',5
go
exec sp_unprepare 1
go

In the above data, you can see at the end each sales_rep_id being passed in. Note that we only have one line for sales_rep_id = 4. That’s because the remaining two records were bounced against the lookup cache, avoiding a trip to the database.
NO Cache Mode
Using the NO Cache Mode will essentially tell SSIS that you want each incoming row (from fact_sales in this case) to be bounced against the database. Since we have seven fact_sales rows, we will see seven calls to the database - MOST of the time. It is important to note that even though we are telling the lookup component to avoid caching rows, it will keep the last match in memory and use it for the next comparison. If the next comparison’s key value matches the value still in memory, a database call is avoided, and the value is carried forward.

In our example data above, if we sort by sales_rep_id, we will still only have five calls to the database because after we lookup our first value of sales_rep_id = 4, it will be reused for the subsequent lookups for sales_rep_id = 4. If we sort our data by sales_dollars, we will have six database calls, because only two sales_rep_id = 4 records are together and hence the first lookup is only used once.

Here is a simple table illustrating each no cache example mentioned above:
SALES_REP_ID, SALES_DOLLARS, LOOKUP DATABASE CALL Y or N
1 120.99 Y
2 24.87 Y
3 98.11 Y
4 70.64 Y
4 114.19 N
4 37.00 N
5 401.50 Y

SALES_REP_ID, SALES_DOLLARS, LOOKUP DATABASE CALL Y or N
2 24.87 Y
4 37.00 Y
4 70.64 N
3 98.11 Y
4 114.19 Y
1 120.99 Y
5 401.50 Y

The SQL Profiler data for the second example above is here:
exec sp_executesql N'select * from (select id, first_name, last_name
from dim_sales_rep) [refTable]
where [refTable].[id] = @P1',N'@P1 int',2
go
exec sp_executesql N'select * from (select id, first_name, last_name
from dim_sales_rep) [refTable]
where [refTable].[id] = @P1',N'@P1 int',4
go
exec sp_executesql N'select * from (select id, first_name, last_name
from dim_sales_rep) [refTable]
where [refTable].[id] = @P1',N'@P1 int',3
go
exec sp_executesql N'select * from (select id, first_name, last_name
from dim_sales_rep) [refTable]
where [refTable].[id] = @P1',N'@P1 int',4
go
exec sp_executesql N'select * from (select id, first_name, last_name
from dim_sales_rep) [refTable]
where [refTable].[id] = @P1',N'@P1 int',1
go
exec sp_executesql N'select * from (select id, first_name, last_name
from dim_sales_rep) [refTable]
where [refTable].[id] = @P1',N'@P1 int',5
go

No comments:

Post a Comment