Process Modeling – Is this a fact or a dimension?

Posted on November 23, 2011

This is a question I see a lot on forums (and also get asked) and it’s generally based on confusion around events and processes (such as an incident, a bank application process etc).

Let’s remember what the difference is between a fact and a dimension.

A fact table stores a business process. A fact is therefore an event or an occurence of something in that business process – whether that fact is a sales transaction, a daily snapshot amount, or a process step. A dimension provides the context around that fact e.g. location, product etc.

When it comes to modeling processes such as a mortgage application, what we are measuring is an occurence of a process step. The process step itself is a dimension as it provides the context to that business process.

A common modeling mistake is to consider the process we are measuring as the dimension e.g. an Order number, or an application number, and then to model these as a Slowly Changing Dimensions Type 2 Dimension. These are degenerate dimensions but should not have their own table, as by itself the OrderNumber or Application Number does not give me a lot of value.

I’m going to use the sales order process here as an example, because I can use existing AdventureWorks database tables to illustrate my points.

What I’ve done is create a lifecycle status dimension table with values of Ordered, Due and Shipped. I’ve also created a trimmed down version of FactInternetSales with dimensions of Date, Customer and Status, and a degenerate dimension of the SalesOrderNumber and the SalesOrderLineNumber.

Figure 1 below is the correct way to model a process:

Figure 1

Here is the code I used to generate DimStatus and to create and populate the Fact table:

CREATE TABLE DimStatus(StatusKey INT NOT NULL PRIMARY KEY, StatusName VARCHAR(50))
GO
INSERT INTO DimStatus
VALUES (1, 'Ordered')
, (2,'Due')
, (3,'Shipped')
GO

CREATE TABLE dbo.FactProcessModel(
  DateKey INT
, CustomerKey INT
, SalesOrderNumber nvarchar(20)
, SalesOrderLineNumber tinyint
, StatusKey INT
, Cnt INT)
GO

;WITH CTE AS (
	SELECT CustomerKey, SalesOrderNumber, SalesOrderLineNumber, OrderDateKey AS [Ordered], DueDateKey as [Due], ShipDateKey as [Shipped]
	FROM dbo.FactInternetSales
	)
, Transposed AS (
SELECT SalesOrderNumber, SalesOrderLineNumber, CustomerKey, DateKey, StatusName
FROM cte c
UNPIVOT(DateKey FOR StatusName IN ([Ordered],[Due],[Shipped])
 ) u
 )
INSERT INTO dbo.FactProcessModel
SELECT DateKey, CustomerKey, SalesOrderNumber, SalesOrderLineNumber, StatusKey, 1 AS Cnt
FROM Transposed t
JOIN dbo.DimStatus s
ON t.StatusName = s.StatusName

GO

Figure 2 shows the alternative method that I often see proposed, which is to make the order a Slowly Changing Dimension Type 2, to capture the changes.

Figure 2

Here is the code I used to create and populate DimOrder and FactProcessModelSCD:

CREATE TABLE DimOrder( OrderKey INT NOT NULL IDENTITY(1,1)
, SalesOrderNumber nvarchar(20) NOT NULL
, SalesOrderLineNumber tinyint NOT NULL
, LifecycleStatus VARCHAR(50) NOT NULL
, StartDate DATE NOT NULL
, EndDate DATE NOT NULL,
CONSTRAINT PK_DimOrder PRIMARY KEY (OrderKey)
)
GO
CREATE INDEX UIX_DimOrder ON DimOrder (SalesOrderNumber, SalesOrderLineNumber, StartDate)
GO

;WITH CTE AS (
	SELECT
	  SalesOrderNumber
	, SalesOrderLineNumber
	, 'Ordered' AS LifecycleStatus
	, OrderDate AS StartDate
	, DATEADD(d,-1, ShipDate) AS EndDate
	, ROW_NUMBER() OVER (PARTITION BY SalesOrderNumber, SalesOrderLineNumber, OrderDate ORDER BY (select(0))) AS rn
	FROM dbo.FactInternetSales)
INSERT INTO DimOrder
SELECT SalesOrderNumber, SalesOrderLineNumber, LifecycleStatus, StartDate, EndDate
FROM CTE
WHERE rn = 1
GO

;WITH CTE AS (
	SELECT
	  SalesOrderNumber
	, SalesOrderLineNumber
	, 'Shipped' AS LifecycleStatus
	, ShipDate AS StartDate
	, DATEADD(d,-1, DueDate) AS EndDate
	, ROW_NUMBER() OVER (PARTITION BY SalesOrderNumber, SalesOrderLineNumber, ShipDate ORDER BY (select(0))) AS rn
	FROM dbo.FactInternetSales)
INSERT INTO DimOrder
SELECT SalesOrderNumber, SalesOrderLineNumber, LifecycleStatus, StartDate, EndDate
FROM CTE
WHERE rn = 1
GO
;WITH CTE AS (
	SELECT
      SalesOrderNumber
	, SalesOrderLineNumber
	, 'Due' AS LifecycleStatus
	, DueDate AS StartDate
	, '2999-12-31' AS EndDate
	, ROW_NUMBER() OVER (PARTITION BY SalesOrderNumber, SalesOrderLineNumber, DueDate ORDER BY (select(0))) AS rn
	FROM dbo.FactInternetSales)
INSERT INTO DimOrder
SELECT SalesOrderNumber, SalesOrderLineNumber, LifecycleStatus, StartDate, EndDate
FROM CTE
WHERE rn = 1
GO

CREATE TABLE dbo.FactProcessModelSCD(
  DateKey INT
, CustomerKey INT
, OrderKey INT
, Cnt INT)
GO

;WITH CTE AS (
	SELECT CustomerKey
		, SalesOrderNumber
		, SalesOrderLineNumber
		, OrderDateKey AS [Ordered]
		, DueDateKey as [Due]
		, ShipDateKey as [Shipped]
	FROM dbo.FactInternetSales
	)
, Transposed AS (
SELECT SalesOrderNumber, SalesOrderLineNumber, CustomerKey, DateKey, StatusName
FROM cte c
UNPIVOT(DateKey FOR StatusName IN ([Ordered],[Due],[Shipped])
 ) u
 )
INSERT INTO dbo.FactProcessModelSCD
SELECT d.DateKey, CustomerKey, orderkey, 1 AS Cnt
FROM Transposed t
JOIN dbo.DimDate d
ON d.datekey = t.dateKey
JOIN dbo.DimOrder o
ON t.SalesOrderNumber = o.SalesOrderNumber
AND t.SalesOrderLineNumber= o.SalesOrderLineNumber
AND d.FullDateAlternateKey BETWEEN o.StartDate AND o.EndDate
GO

It’s important to note what we’ve done in Figure 2: we’ve created a dimension table with the same number of rows as our fact table !

Let’s take an example query that we might like to ask: “Show me those customers with Orders that are due on June 1st, 2004, and the count of those orders for each customer”

Here are the queries:

SELECT c.LastName, c.FirstName, SUM(Cnt)
FROM dbo.FactProcessModel f
JOIN dbo.DimStatus s
ON s.StatusKey = f.StatusKey
JOIN dbo.DimCustomer c
ON c.CustomerKey = f.CustomerKey
WHERE StatusName = 'Due'
AND DateKey = 20040601
GROUP BY c.LastName, c.FirstName
GO
SELECT c.LastName, c.FirstName, SUM(Cnt)
FROM dbo.FactProcessModelSCD f
JOIN dbo.DimOrder o
ON f.OrderKey = o.OrderKey
JOIN dbo.DimCustomer c
ON c.CustomerKey = f.CustomerKey
WHERE o.LifecycleStatus = 'Due'
AND DateKey = 20040601
GROUP BY c.LastName, c.FirstName

The second query (on the example using a SCD Type2) has double the estimated cost, and is twice as slow on a small dataset. On a large fact table with 50 million records, if we joined to a dimension of the same row count, the difference in queries would be far more significant. In an environment like where I am working where each dimension lookup is a join on the natural key and where the fact transaction date is between the start and end dates of the dimension table, expect the performance to be abysmal.

So listed below are the disadvantages of using a Slowly Changing Dimension to capture the changes in status of a process:

Poor query performance
Increased storage requirements
More ETL development and testing
Increased ETL processing times

In my next post I will examine the advantages and disadvantages of using the above transactional model to capture process changes, as compared with using an accumulating snapshot fact table, including when using SSAS.

Tagged: Dimensional Modeling, SCD Type 2, Slowly Changing Dimension, T-SQL

Posted in: Data Modeling, SQL

8 Responses “Process Modeling – Is this a fact or a dimension?” →

Jagjit Thind

November 23, 2011

Hi,
You mentioned the 2nd query is slower, but it is based on the correct way to model a process (i.e. the alternative approach to having order as scd type 2). Is it really slower?

Thanks for this modelling post, keep them coming.

Jagjit

Reply

jsimonbi

November 23, 2011

Hi Jagjit,
Good pickup! That was a cut and paste error. I’ve updated the post.

Reply

Flavio

December 4, 2011

Hi John, nice post! Thx for your explanation. If I plan a BI environment to primary school, do I should use the grades (exams) as a fact? Thx

Reply

jsimonbi

December 5, 2011

Sure. Use a regular transaction fact table with dimensions like student, date, class, teacher etc and you could use the exam results as a measure. You may also want to use a factless fact table to store the intersection of student, class, teacher, year etc.

Reply

Jaime

February 20, 2012

Hi John
I’m developing my thesis and had the same question some time ago: exist an status dimension? or is correct to define an status dimension?
My first solution of natural way was define an status dimension but I couldn’t find literature about this scenario.
Thank for your explanation.

Reply
Brian

March 15, 2012

Great explanation. Thanks for the great post.

Reply
Phil

May 29, 2012

I’m guessing with a Status Start Date and Status End Date added to the fact in the first method, I can then achieve all I would have gained in the second method – with no extra rows and no second equi-rowed SCD. But would the two date columns in the fact be a significant overhead? If I need to know how long orders spent in a given status, I can’t do it with the first method as provided, can I? Do I need separate fact builds for each different purpose (eg. order counts vs. status durations) or can we build a general purpose fact for a business process without too much baggage?

Reply

jsimonbi

May 29, 2012

Hi Phil.
You could include a start and end date, but you’d probably be better off pivoting the data into an Accumulating snapshot to provide lag times, the most common scenario for process based reporting.

Reply

Process Modeling – Is this a fact or a dimension?

Leave a comment Cancel reply

Archives

Categories

Process Modeling – Is this a fact or a dimension?

Share this:

Related

Leave a comment Cancel reply

Archives

Categories