Tracking History with Slowly Changing Dimensions

Posted on September 24, 2011

0



In this post I’d like to show a few of the different ways to maintain history. As anyone who has been in data-warehousing for a while can attest to, the two most common scenarios that business users want to see are the data as it was (Kimball Slowly Changing Dimension Type 2) , or as-is (Kimball Slowly Changing Dimension Type 1).  For example, how did the sales figures look under the sales team hierarchy as the occurred, versus the current hierarchy.

Tracking History

Date Ranges

The common way to track changes in an Inmon or 3rd Normal Form data warehouse is to use the Natural/Business Key combined with a start and end date.

So you many end up with a structure looking like this:

Fig 1

Any changes are tracked by joining on the Business Key, in this case the CustomerLabel, where the DateKey in the fact table is between the Start and End dates of the Customer Dimension.

The problem with doing joins based on ranges is that they are slower than a simple one-to-one join. Add in any additional tables if you have snow-flaked your schema and you get additional complexity to ensure the correct start date is between the start and end dates of the snow-flaked dimension, as well as a decrease in performance.

As I have noted in a previous post https://jsimonbi.wordpress.com/2011/05/16/dimensional-modeling-worst-practices/, having natural keys in your fact table may also incur a performance penalty because you get less rows returned per page.

Surrogate Keys

Advocated by Ralph Kimball and his group, the use of surrogate keys brings many benefits (see here http://www.kimballgroup.com/html/articles_search/articles1998/9805d05.html ), and of particular importance is improved query performance. Surrogate keys should be of an integer type, with the size dependent on your data volumes. The bigger your fact table volumes, the more you want to ensure that your surrogate keys are only as big as required. So if you have a table with only ten values, and little chance of  new values, and additionally you have many millions or billions of rows in your fact table, use a TinyInt.

The below diagrams explain how a surrogate key works:

Fig 2

In Fig2 you can see that Clark Kent has a surrogate key of 3 and that sales records for Clark are associated to Clark via the CustomerKey.

Fig 3

In Fig 3 you can see that Clark has moved from VIC to NSW, so a new record has been created, and any sales from that period forward will use the latest surrogate key.

Fig 4 is an example of the structure that will be used to provide comparative performance numbers later in this post.

Fig 4

 

Sequences

I’m currently working in a large Teradata environment, which uses a 3rd Normal Form data model. To improve performance we have created conformed dimensions, and used sequences in place of the date range to retain history and improve performance when joining to the fact table.
The table structure is illustrated in Fig 5:

Fig 5

To better illustrate how the table sequence table would be populated see Fig 6

Fig 6

To generate the sequence number we can use a simple ROW_NUMBER() windowing function as illustrated here:


SELECT CustomerLabel

, ROW_NUMBER() OVER (PARTITION BY CustomerLabel ORDER BY StartDt ASC) AS SeqNo

FROM dbo.DimCustomer  

So why would you use a sequence number? It makes sense if you have an existing 3rd Normal Form data warehouse and want to improve query performance of queries using date ranges.  Each day you can refresh your dimension table, as well as provide an easy mechanism to provide the latest view of a dimension member by removing the sequence number from the join and including a qualifier that the Current indicator = 1.

Performance Considerations

I used the ContosoRetailDW database available from www.codeplex.com . I created new fact tables using only the columns required for each scenario to join to the customer dimension. In addition, I wanted to test against a large dimension with many changes. So I iterated through the customer dimension table and made a change to the location every two days.

I ran the following queries:


-- Surrogate Key

DBCC FREEPROCCACHE

DBCC DROPCLEANBUFFERS

SELECT d.CalendarMonth, c.Location, SUM(SalesAmount) AS SalesAmount

FROM dbo.FactOnlineSales_SK f

INNER JOIN dbo.DimDate d

ON f.DateKey = d.Datekey

INNER JOIN dbo.DimCustomer_SK c

ON f.CustomerKey = c.CustomerKey

WHERE d.CalendarMonth = 200708

GROUP BY d.CalendarMonth, c.Location

GO 3

--Date Range

DBCC FREEPROCCACHE

DBCC DROPCLEANBUFFERS

SELECT d.CalendarMonth, c.Location, SUM(f.SalesAmount) AS SalesAmount

FROM dbo.FactOnlineSales_Range f

INNER JOIN dbo.DimDate d

ON f.DateKey = d.Datekey

INNER JOIN dbo.DimCustomer_Range c

ON f.CustomerLabel = c.CustomerLabel

AND f.DateKey BETWEEN c.StartDt AND c.EndDt

WHERE d.CalendarMonth = 200708

GROUP BY d.CalendarMonth, c.Location

GO 3

---- Seq No

DBCC FREEPROCCACHE

DBCC DROPCLEANBUFFERS

SELECT d.CalendarMonth, c.Location, SUM(SalesAmount) AS SalesAmount

FROM dbo.FactOnlineSales_Seq f

INNER JOIN dbo.DimDate d

ON f.DateKey = d.Datekey

INNER JOIN dbo.DimCustomer_Seq c

ON f.CustomerLabel = c.CustomerLabel

AND f.CustomerSeqNo = c.SeqNo

WHERE d.CalendarMonth = 200708

GROUP BY d.CalendarMonth, c.Location

GO 3

Before looking at the results, it’s worth noting that the design of the dimension table will have a direct impact on the width of the fact table and the number of logical reads required to fulfil a query.

Here is a table showing the width of each table, the number of logical reads and the query duration results:

Fig 7

Surrogate keys clearly out perform the Sequence join, and far outperforms using a date range which as expected provided poor performance.

 

Current Version

With a typical SCD Type 2 solution as proposed by Ralph Kimball, you decide those attributes where history will be tracked and build your ETL accordingly. So what if we want to see the most recent values for ALL dimension attributes?

There are two general ways to resolve this issue:

  1. Create a dimension view that does a self-join to get the latest version of the dimension
  2. Place the Natural Key in the fact table and join to the dimension table on the natural key where IsCurrent=1

So how do these compare for performance?

I kept to the same table structures as above, and tested each:


-- Check for latest version

-- Surrogate Key with self-join

DBCC FREEPROCCACHE

DBCC DROPCLEANBUFFERS

SELECT CalendarMonth, c1.Location, SUM(SalesAmount) AS SalesAmount

FROM dbo.FactOnlineSales_SK f

INNER JOIN dbo.DimDate d

ON f.DateKey = d.Datekey

INNER JOIN dbo.DimCustomer_SK c

ON f.CustomerKey = c.CustomerKey

INNER JOIN dbo.DimCustomer_SK c1

ON c.CustomerLabel = c1.CustomerLabel

AND c1.IsCurrent = 1

WHERE d.CalendarMonth = 200708

GROUP BY CalendarMonth, c1.Location

GO 3

-- Range

DBCC FREEPROCCACHE

DBCC DROPCLEANBUFFERS

SELECT CalendarMonth, Location, SUM(SalesAmount) AS SalesAmount

FROM dbo.FactOnlineSales_Range f

INNER JOIN dbo.DimDate d

ON f.DateKey = d.Datekey

INNER JOIN dbo.DimCustomer_Range c

ON f.CustomerLabel = c.CustomerLabel

AND c.IsCurrent =1

WHERE d.CalendarMonth = 200708

GROUP BY CalendarMonth, Location

GO 3

---- Seq No

DBCC FREEPROCCACHE

DBCC DROPCLEANBUFFERS

SELECT d.CalendarMonth, c.Location, SUM(SalesAmount) AS SalesAmount

FROM dbo.FactOnlineSales_Seq f

INNER JOIN dbo.DimDate d

ON f.DateKey = d.Datekey

INNER JOIN dbo.DimCustomer_Seq c

ON f.CustomerLabel = c.CustomerLabel

AND c.IsCurrent =1

WHERE d.CalendarMonth = 200708

GROUP BY d.CalendarMonth, c.Location

GO 3

Fig 8 below illustrates the performance differences:

Fig 8

The self-join using a surrogate key performs the worst, although not a great deal poorer than the other two methods.

Conclusions

There is no doubt that using a surrogate key provides superior performance for regular reporting, as well as providing a level of abstraction from the source system and all of the other benefits of surrogate keys. Even though the self-join is not as performant as joining on the natural key, the performance is adequate. Should there be a requirement by the business to show the most recent version of all attributes on a regular basis, it might be worth including the natural key into your fact tables.

If you are working on a 3rd Normal Form data warehouse that uses date ranges as part of the join, you may want to investigate using a sequence number and materializing that in your fact table to improve performance.

Advertisements
Posted in: Data Modeling, SQL