Dimensional Modeling Worst Practices

Posted on May 16, 2011

2



As a consultant I see a lot of BI/DW solutions with various clients, and am often called in to fix performance issues. Often times developers/architects will ignore the Ralph Kimball methodology, or try to “improve it” by breaking a few rules. So I thought it might be worthwhile for me to post some of the Worst Practices for Dimensional Modeling.

The most common issue I see with dimensional models is that the architect or developer will include the natural key or a description in the fact table in order to improve performance. The logic is that since snow-flaking is bad, because of the extra joins (this is not the reason for avoiding snow-flaking btw), then taking out as many joins as possible has to be good, right?

Let’s look at how this actually performs compared to the Kimball methodology.

I’m using the Contoso Retail data set available from www.codeplex.com

I’ve got the original table, a table with the natural keys included and a table with the natural keys and a comments column (another worst practice) included. In the wide fact tables I’ve also included currency conversion values for the most common conversions, as this is something I’ve also seen done to “improve performance”.

Below are the three fact tables with their column sizes:

If we run the following code and check the Estimated Execution plan we can see the width of each table:

SELECT TOP (1) *
FROM dbo.FactOnlineSales_Orig
--148 B
SELECT TOP (1) *
FROM dbo.FactOnlineSales_Wide
--740 B
SELECT TOP (1) *
FROM dbo.FactOnlineSales_WideComments
-- 786

We can see that the tables with the natural keys and with the comments are substantially wider than the original. What this means is that when we do a scan, we get less records per page, pushing up the I/O for the wider tables.

So how do these wider tables perform compared to a more normalised fact table? I ran four queries simulating normal data warehouse queries, clearing the cache before each query and then running the query three times. See below for the queries:


/********************************
Narrow
*********************************/

DBCC FREEPROCCACHE
DBCC DROPCLEANBUFFERS
go
SELECT c.CustomerLabel, SUM(SalesAmount) AS SaleAmount
FROM FactOnlineSales_Orig f (NOLOCK)
JOIN dbo.DimCustomer c (NOLOCK)
ON f.CustomerKey = c.CustomerKey
GROUP BY c.CustomerLabel
GO 3
DBCC FREEPROCCACHE
DBCC DROPCLEANBUFFERS
GO
SELECT c.CustomerLabel, SUM(SalesAmount) AS SaleAmount
FROM FactOnlineSales_Orig f (NOLOCK)
JOIN dbo.DimCustomer c (NOLOCK)
ON f.CustomerKey = c.CustomerKey
JOIN dbo.DimDate d (NOLOCK)
ON f.DateKey = d.Datekey
WHERE d.CalendarYear = 2008
GROUP BY c.CustomerLabel
GO 3
DBCC FREEPROCCACHE
DBCC DROPCLEANBUFFERS
GO
SELECT d.CalendarYear, c.CustomerLabel, SUM(SalesAmount) AS SaleAmount
FROM FactOnlineSales_Orig f (NOLOCK)
JOIN dbo.DimCustomer c (NOLOCK)
ON f.CustomerKey = c.CustomerKey
JOIN dbo.DimDate d (NOLOCK)
ON f.DateKey = d.Datekey
GROUP BY d.CalendarYear,c.CustomerLabel
GO 3
DBCC FREEPROCCACHE
DBCC DROPCLEANBUFFERS
GO
SELECT p.ProductName, SUM(SalesAmount*fer.EndOfDayRate) AS SaleAmount
FROM dbo.FactOnlineSales_Orig f (NOLOCK)
JOIN dbo.DimProduct p (NOLOCK)
ON f.ProductKey = p.ProductKey
INNER JOIN dbo.FactExchangeRate fer (NOLOCK)
ON fer.DateKey = f.DateKey
AND fer.CurrencyKey = 3 -- AUD
GROUP BY p.ProductName
GO 3
/********************************
Wide
*********************************/
DBCC FREEPROCCACHE
DBCC DROPCLEANBUFFERS
GO
SELECT f.CustomerLabel, SUM(SalesAmount) AS SaleAmount
FROM FactOnlineSales_Wide f (NOLOCK)
GROUP BY f.CustomerLabel
GO 3
DBCC FREEPROCCACHE
DBCC DROPCLEANBUFFERS
GO
SELECT f.CustomerLabel, SUM(SalesAmount) AS SaleAmount
FROM FactOnlineSales_Wide f (NOLOCK)
JOIN dbo.DimDate d (NOLOCK)
ON f.DateKey = d.Datekey
WHERE d.CalendarYear = 2008
GROUP BY f.CustomerLabel
GO 3
DBCC FREEPROCCACHE
DBCC DROPCLEANBUFFERS
GO
SELECT d.CalendarYear,f.CustomerLabel, SUM(SalesAmount) AS SaleAmount
FROM FactOnlineSales_Wide f (NOLOCK)
JOIN dbo.DimDate d (NOLOCK)
ON f.DateKey = d.Datekey
GROUP BY d.CalendarYear,f.CustomerLabel
GO 3
DBCC FREEPROCCACHE
DBCC DROPCLEANBUFFERS
GO
SELECT p.ProductName, SUM(f.SalesAmount*f.AUDEndOfDayRate) AS SaleAmount
FROM dbo.FactOnlineSales_Wide f (NOLOCK)
JOIN dbo.DimProduct p (NOLOCK)
ON f.ProductKey = p.ProductKey
GROUP BY p.ProductName
GO 3
/****************************
 Wide Comments
 ****************************/
DBCC FREEPROCCACHE
DBCC DROPCLEANBUFFERS
GO
SELECT f.CustomerLabel, SUM(SalesAmount) AS SaleAmount
FROM FactOnlineSales_WideComments f (NOLOCK)
GROUP BY f.CustomerLabel
GO 3
DBCC FREEPROCCACHE
DBCC DROPCLEANBUFFERS
GO
SELECT f.CustomerLabel, SUM(SalesAmount) AS SaleAmount
FROM FactOnlineSales_WideComments f (NOLOCK)
JOIN dbo.DimDate d (NOLOCK)
ON f.DateKey = d.Datekey
WHERE d.CalendarYear = 2008
GROUP BY f.CustomerLabel
GO 3
DBCC FREEPROCCACHE
DBCC DROPCLEANBUFFERS
GO
SELECT d.CalendarYear,f.CustomerLabel, SUM(SalesAmount) AS SaleAmount
FROM FactOnlineSales_WideComments f (NOLOCK)
JOIN dbo.DimDate d (NOLOCK)
ON f.DateKey = d.Datekey
GROUP BY d.CalendarYear,f.CustomerLabel
GO 3
DBCC FREEPROCCACHE
DBCC DROPCLEANBUFFERS
GO
SELECT p.ProductName, SUM(f.SalesAmount*f.AUDEndOfDayRate) AS SaleAmount
FROM dbo.FactOnlineSales_WideComments f (NOLOCK)
JOIN dbo.DimProduct p (NOLOCK)
ON f.ProductKey = p.ProductKey
GROUP BY p.ProductName
GO 3

 

Here are the results of the queries.

We can see from the results that even though the CPU may be lower on occassion for the wide version of the fact table and the wide version with comments, the duration is lower for the narrow fact table in all cases except the Product Currency conversion, where it is slightly higher than the denormalised versions. If you look at the number of reads for each query you can see why – the number of reads is substantially higher for the wider fact tables.

There has been a number of times where I have gone to a client site to improve query performance of their data warehouse or data mart. The most significant impact I have made has been to make the fact tables narrow (i.e. to normalise the fact table), and to move any extranneous attributes to dimensions, including comments.

There are valid reasons why you may want to include a natural key in the fact table – if you have to do a lot of historic versus current version queries (i.e. Type 6 SCD) then it may be more performant to include the natural key in the fact table. However this is something that should be rare and carefully considered before you include the natural key into the fact table.

So think carefully before denormalising your fact tables.

Posted in: Data Modeling, SQL