Process Modeling – Is this a fact or a dimension?

Posted on November 23, 2011

8



This is a question I see a lot on forums (and also get asked) and it’s generally based on confusion around events and processes (such as an incident, a bank application process etc).

Let’s remember what the difference is between a fact and a dimension.

A fact table stores a business process.  A fact is therefore an event or an occurence of something  in that business process – whether that fact is a sales transaction, a daily snapshot amount,  or a process step. A dimension provides the context around that fact e.g. location, product etc.

When it comes to modeling processes such as a mortgage application, what we are measuring is an occurence of a process step. The process step itself is a dimension as it provides the context to that business process.

A common modeling mistake is to consider the process we are measuring as the dimension e.g. an Order number, or an application number, and then to model these as a Slowly Changing Dimensions Type 2 Dimension. These are degenerate dimensions but should not have their own table, as by itself the OrderNumber or Application Number does not give me a lot of value.

I’m going to use the sales order process here as an example, because I can use existing AdventureWorks database tables to illustrate my points.

What I’ve done is create a lifecycle status dimension table with values of Ordered, Due and Shipped. I’ve also created a trimmed down version of FactInternetSales with dimensions of Date, Customer and Status, and a degenerate dimension of the SalesOrderNumber and the SalesOrderLineNumber. 

Figure 1 below is the correct way to model a process:

Figure 1

Here is the code I used to generate DimStatus and to create and populate the Fact table:

CREATE TABLE DimStatus(StatusKey INT NOT NULL PRIMARY KEY, StatusName VARCHAR(50))
GO
INSERT INTO DimStatus
VALUES (1, 'Ordered')
, (2,'Due')
, (3,'Shipped')
GO

CREATE TABLE dbo.FactProcessModel(
  DateKey INT
, CustomerKey INT
, SalesOrderNumber nvarchar(20)
, SalesOrderLineNumber tinyint
, StatusKey INT
, Cnt INT)
GO

;WITH CTE AS (
	SELECT CustomerKey, SalesOrderNumber, SalesOrderLineNumber, OrderDateKey AS [Ordered], DueDateKey as [Due], ShipDateKey as [Shipped]
	FROM dbo.FactInternetSales
	)
, Transposed AS (
SELECT SalesOrderNumber, SalesOrderLineNumber, CustomerKey, DateKey, StatusName
FROM cte c
UNPIVOT(DateKey FOR StatusName IN ([Ordered],[Due],[Shipped])
 ) u
 )
INSERT INTO dbo.FactProcessModel
SELECT DateKey, CustomerKey, SalesOrderNumber, SalesOrderLineNumber, StatusKey, 1 AS Cnt
FROM Transposed t
JOIN dbo.DimStatus s
ON t.StatusName = s.StatusName

GO
 

Figure 2 shows the alternative method that I often see proposed, which is to make the order a Slowly Changing Dimension Type 2, to capture the changes.

Figure 2

Here is the code I used to create and populate DimOrder and FactProcessModelSCD:
 
CREATE TABLE DimOrder( OrderKey INT NOT NULL IDENTITY(1,1)
, SalesOrderNumber nvarchar(20) NOT NULL
, SalesOrderLineNumber tinyint NOT NULL
, LifecycleStatus VARCHAR(50) NOT NULL
, StartDate DATE NOT NULL
, EndDate DATE NOT NULL,
CONSTRAINT PK_DimOrder PRIMARY KEY (OrderKey)
)
GO
CREATE INDEX UIX_DimOrder ON DimOrder (SalesOrderNumber, SalesOrderLineNumber, StartDate)
GO

;WITH CTE AS (
	SELECT
	  SalesOrderNumber
	, SalesOrderLineNumber
	, 'Ordered' AS LifecycleStatus
	, OrderDate AS StartDate
	, DATEADD(d,-1, ShipDate) AS EndDate
	, ROW_NUMBER() OVER (PARTITION BY SalesOrderNumber, SalesOrderLineNumber, OrderDate ORDER BY (select(0))) AS rn
	FROM dbo.FactInternetSales)
INSERT INTO DimOrder
SELECT SalesOrderNumber, SalesOrderLineNumber, LifecycleStatus, StartDate, EndDate
FROM CTE
WHERE rn = 1
GO

;WITH CTE AS (
	SELECT
	  SalesOrderNumber
	, SalesOrderLineNumber
	, 'Shipped' AS LifecycleStatus
	, ShipDate AS StartDate
	, DATEADD(d,-1, DueDate) AS EndDate
	, ROW_NUMBER() OVER (PARTITION BY SalesOrderNumber, SalesOrderLineNumber, ShipDate ORDER BY (select(0))) AS rn
	FROM dbo.FactInternetSales)
INSERT INTO DimOrder
SELECT SalesOrderNumber, SalesOrderLineNumber, LifecycleStatus, StartDate, EndDate
FROM CTE
WHERE rn = 1
GO
;WITH CTE AS (
	SELECT
      SalesOrderNumber
	, SalesOrderLineNumber
	, 'Due' AS LifecycleStatus
	, DueDate AS StartDate
	, '2999-12-31' AS EndDate
	, ROW_NUMBER() OVER (PARTITION BY SalesOrderNumber, SalesOrderLineNumber, DueDate ORDER BY (select(0))) AS rn
	FROM dbo.FactInternetSales)
INSERT INTO DimOrder
SELECT SalesOrderNumber, SalesOrderLineNumber, LifecycleStatus, StartDate, EndDate
FROM CTE
WHERE rn = 1
GO

CREATE TABLE dbo.FactProcessModelSCD(
  DateKey INT
, CustomerKey INT
, OrderKey INT
, Cnt INT)
GO

;WITH CTE AS (
	SELECT CustomerKey
		, SalesOrderNumber
		, SalesOrderLineNumber
		, OrderDateKey AS [Ordered]
		, DueDateKey as [Due]
		, ShipDateKey as [Shipped]
	FROM dbo.FactInternetSales
	)
, Transposed AS (
SELECT SalesOrderNumber, SalesOrderLineNumber, CustomerKey, DateKey, StatusName
FROM cte c
UNPIVOT(DateKey FOR StatusName IN ([Ordered],[Due],[Shipped])
 ) u
 )
INSERT INTO dbo.FactProcessModelSCD
SELECT d.DateKey, CustomerKey, orderkey, 1 AS Cnt
FROM Transposed t
JOIN dbo.DimDate d
ON d.datekey = t.dateKey
JOIN dbo.DimOrder o
ON t.SalesOrderNumber = o.SalesOrderNumber
AND t.SalesOrderLineNumber= o.SalesOrderLineNumber
AND d.FullDateAlternateKey BETWEEN o.StartDate AND o.EndDate
GO
 

It’s important to note what we’ve done in Figure 2: we’ve created a dimension table with the same number of rows as our fact table !

Let’s take an example query that we might like to ask: “Show me those customers with Orders that are due on June 1st, 2004, and the count of those orders for each customer”

Here are the queries:

SELECT c.LastName, c.FirstName, SUM(Cnt)
FROM dbo.FactProcessModel f
JOIN dbo.DimStatus s
ON s.StatusKey = f.StatusKey
JOIN dbo.DimCustomer c
ON c.CustomerKey = f.CustomerKey
WHERE StatusName = 'Due'
AND DateKey = 20040601
GROUP BY c.LastName, c.FirstName
GO
SELECT c.LastName, c.FirstName, SUM(Cnt)
FROM dbo.FactProcessModelSCD f
JOIN dbo.DimOrder o
ON f.OrderKey = o.OrderKey
JOIN dbo.DimCustomer c
ON c.CustomerKey = f.CustomerKey
WHERE o.LifecycleStatus = 'Due'
AND DateKey = 20040601
GROUP BY c.LastName, c.FirstName
 

The second query (on the example using a SCD Type2) has double the estimated cost, and is twice as slow on a small dataset. On a large fact table with 50 million records, if we joined to a dimension of the same row count, the difference in queries would be far more significant. In an environment like where I am working where each dimension lookup is a join on the natural key and where the fact transaction date is between the start and end dates of the dimension table, expect the performance to be abysmal.

So listed below are the disadvantages of using a Slowly Changing Dimension to capture the changes in status of a process:

  1. Poor query performance
  2. Increased storage requirements
  3. More ETL development and testing
  4. Increased ETL processing times

In my next post I will examine the advantages and disadvantages of using the above transactional model to capture process changes, as compared with using an accumulating snapshot fact table, including when using SSAS.

Advertisements
Posted in: Data Modeling, SQL