This is a question I see a lot on forums (and also get asked) and it’s generally based on confusion around events and processes (such as an incident, a bank application process etc).
Let’s remember what the difference is between a fact and a dimension.
A fact table stores a business process. A fact is therefore an event or an occurence of something in that business process – whether that fact is a sales transaction, a daily snapshot amount, or a process step. A dimension provides the context around that fact e.g. location, product etc.
When it comes to modeling processes such as a mortgage application, what we are measuring is an occurence of a process step. The process step itself is a dimension as it provides the context to that business process.
A common modeling mistake is to consider the process we are measuring as the dimension e.g. an Order number, or an application number, and then to model these as a Slowly Changing Dimensions Type 2 Dimension. These are degenerate dimensions but should not have their own table, as by itself the OrderNumber or Application Number does not give me a lot of value.
I’m going to use the sales order process here as an example, because I can use existing AdventureWorks database tables to illustrate my points.
What I’ve done is create a lifecycle status dimension table with values of Ordered, Due and Shipped. I’ve also created a trimmed down version of FactInternetSales with dimensions of Date, Customer and Status, and a degenerate dimension of the SalesOrderNumber and the SalesOrderLineNumber.
Figure 1 below is the correct way to model a process:
Here is the code I used to generate DimStatus and to create and populate the Fact table:
CREATE TABLE DimStatus(StatusKey INT NOT NULL PRIMARY KEY, StatusName VARCHAR(50)) GO INSERT INTO DimStatus VALUES (1, 'Ordered') , (2,'Due') , (3,'Shipped') GO CREATE TABLE dbo.FactProcessModel( DateKey INT , CustomerKey INT , SalesOrderNumber nvarchar(20) , SalesOrderLineNumber tinyint , StatusKey INT , Cnt INT) GO ;WITH CTE AS ( SELECT CustomerKey, SalesOrderNumber, SalesOrderLineNumber, OrderDateKey AS [Ordered], DueDateKey as [Due], ShipDateKey as [Shipped] FROM dbo.FactInternetSales ) , Transposed AS ( SELECT SalesOrderNumber, SalesOrderLineNumber, CustomerKey, DateKey, StatusName FROM cte c UNPIVOT(DateKey FOR StatusName IN ([Ordered],[Due],[Shipped]) ) u ) INSERT INTO dbo.FactProcessModel SELECT DateKey, CustomerKey, SalesOrderNumber, SalesOrderLineNumber, StatusKey, 1 AS Cnt FROM Transposed t JOIN dbo.DimStatus s ON t.StatusName = s.StatusName GO
Figure 2 shows the alternative method that I often see proposed, which is to make the order a Slowly Changing Dimension Type 2, to capture the changes.
CREATE TABLE DimOrder( OrderKey INT NOT NULL IDENTITY(1,1) , SalesOrderNumber nvarchar(20) NOT NULL , SalesOrderLineNumber tinyint NOT NULL , LifecycleStatus VARCHAR(50) NOT NULL , StartDate DATE NOT NULL , EndDate DATE NOT NULL, CONSTRAINT PK_DimOrder PRIMARY KEY (OrderKey) ) GO CREATE INDEX UIX_DimOrder ON DimOrder (SalesOrderNumber, SalesOrderLineNumber, StartDate) GO ;WITH CTE AS ( SELECT SalesOrderNumber , SalesOrderLineNumber , 'Ordered' AS LifecycleStatus , OrderDate AS StartDate , DATEADD(d,-1, ShipDate) AS EndDate , ROW_NUMBER() OVER (PARTITION BY SalesOrderNumber, SalesOrderLineNumber, OrderDate ORDER BY (select(0))) AS rn FROM dbo.FactInternetSales) INSERT INTO DimOrder SELECT SalesOrderNumber, SalesOrderLineNumber, LifecycleStatus, StartDate, EndDate FROM CTE WHERE rn = 1 GO ;WITH CTE AS ( SELECT SalesOrderNumber , SalesOrderLineNumber , 'Shipped' AS LifecycleStatus , ShipDate AS StartDate , DATEADD(d,-1, DueDate) AS EndDate , ROW_NUMBER() OVER (PARTITION BY SalesOrderNumber, SalesOrderLineNumber, ShipDate ORDER BY (select(0))) AS rn FROM dbo.FactInternetSales) INSERT INTO DimOrder SELECT SalesOrderNumber, SalesOrderLineNumber, LifecycleStatus, StartDate, EndDate FROM CTE WHERE rn = 1 GO ;WITH CTE AS ( SELECT SalesOrderNumber , SalesOrderLineNumber , 'Due' AS LifecycleStatus , DueDate AS StartDate , '2999-12-31' AS EndDate , ROW_NUMBER() OVER (PARTITION BY SalesOrderNumber, SalesOrderLineNumber, DueDate ORDER BY (select(0))) AS rn FROM dbo.FactInternetSales) INSERT INTO DimOrder SELECT SalesOrderNumber, SalesOrderLineNumber, LifecycleStatus, StartDate, EndDate FROM CTE WHERE rn = 1 GO CREATE TABLE dbo.FactProcessModelSCD( DateKey INT , CustomerKey INT , OrderKey INT , Cnt INT) GO ;WITH CTE AS ( SELECT CustomerKey , SalesOrderNumber , SalesOrderLineNumber , OrderDateKey AS [Ordered] , DueDateKey as [Due] , ShipDateKey as [Shipped] FROM dbo.FactInternetSales ) , Transposed AS ( SELECT SalesOrderNumber, SalesOrderLineNumber, CustomerKey, DateKey, StatusName FROM cte c UNPIVOT(DateKey FOR StatusName IN ([Ordered],[Due],[Shipped]) ) u ) INSERT INTO dbo.FactProcessModelSCD SELECT d.DateKey, CustomerKey, orderkey, 1 AS Cnt FROM Transposed t JOIN dbo.DimDate d ON d.datekey = t.dateKey JOIN dbo.DimOrder o ON t.SalesOrderNumber = o.SalesOrderNumber AND t.SalesOrderLineNumber= o.SalesOrderLineNumber AND d.FullDateAlternateKey BETWEEN o.StartDate AND o.EndDate GO
It’s important to note what we’ve done in Figure 2: we’ve created a dimension table with the same number of rows as our fact table !
Let’s take an example query that we might like to ask: “Show me those customers with Orders that are due on June 1st, 2004, and the count of those orders for each customer”
Here are the queries:
SELECT c.LastName, c.FirstName, SUM(Cnt) FROM dbo.FactProcessModel f JOIN dbo.DimStatus s ON s.StatusKey = f.StatusKey JOIN dbo.DimCustomer c ON c.CustomerKey = f.CustomerKey WHERE StatusName = 'Due' AND DateKey = 20040601 GROUP BY c.LastName, c.FirstName GO SELECT c.LastName, c.FirstName, SUM(Cnt) FROM dbo.FactProcessModelSCD f JOIN dbo.DimOrder o ON f.OrderKey = o.OrderKey JOIN dbo.DimCustomer c ON c.CustomerKey = f.CustomerKey WHERE o.LifecycleStatus = 'Due' AND DateKey = 20040601 GROUP BY c.LastName, c.FirstName
The second query (on the example using a SCD Type2) has double the estimated cost, and is twice as slow on a small dataset. On a large fact table with 50 million records, if we joined to a dimension of the same row count, the difference in queries would be far more significant. In an environment like where I am working where each dimension lookup is a join on the natural key and where the fact transaction date is between the start and end dates of the dimension table, expect the performance to be abysmal.
So listed below are the disadvantages of using a Slowly Changing Dimension to capture the changes in status of a process:
- Poor query performance
- Increased storage requirements
- More ETL development and testing
- Increased ETL processing times
In my next post I will examine the advantages and disadvantages of using the above transactional model to capture process changes, as compared with using an accumulating snapshot fact table, including when using SSAS.
Jagjit Thind
November 23, 2011
Hi,
You mentioned the 2nd query is slower, but it is based on the correct way to model a process (i.e. the alternative approach to having order as scd type 2). Is it really slower?
Thanks for this modelling post, keep them coming.
Jagjit
jsimonbi
November 23, 2011
Hi Jagjit,
Good pickup! That was a cut and paste error. I’ve updated the post.
Flavio
December 4, 2011
Hi John, nice post! Thx for your explanation. If I plan a BI environment to primary school, do I should use the grades (exams) as a fact? Thx
jsimonbi
December 5, 2011
Sure. Use a regular transaction fact table with dimensions like student, date, class, teacher etc and you could use the exam results as a measure. You may also want to use a factless fact table to store the intersection of student, class, teacher, year etc.
Jaime
February 20, 2012
Hi John
I’m developing my thesis and had the same question some time ago: exist an status dimension? or is correct to define an status dimension?
My first solution of natural way was define an status dimension but I couldn’t find literature about this scenario.
Thank for your explanation.
Brian
March 15, 2012
Great explanation. Thanks for the great post.
Phil
May 29, 2012
I’m guessing with a Status Start Date and Status End Date added to the fact in the first method, I can then achieve all I would have gained in the second method – with no extra rows and no second equi-rowed SCD. But would the two date columns in the fact be a significant overhead? If I need to know how long orders spent in a given status, I can’t do it with the first method as provided, can I? Do I need separate fact builds for each different purpose (eg. order counts vs. status durations) or can we build a general purpose fact for a business process without too much baggage?
jsimonbi
May 29, 2012
Hi Phil.
You could include a start and end date, but you’d probably be better off pivoting the data into an Accumulating snapshot to provide lag times, the most common scenario for process based reporting.