Populating a many-to-many bridge table pt2

Posted on February 1, 2011

1



I have been testing the previous example code (see https://jsimonbi.wordpress.com/2011/01/25/populating-a-many-to-many-bridge-table/) to populate a many to many bridge table with a large dataset and wasn’t happy with the performance. So I’ve  gone back to the drawing board on how to do this. I wanted a set based solution, and something that was both simple and efficient.

To recap, if we are recording sales, and many salespeople can be associated with a sale, a bridge table would sit between the person dimension table and the sale fact table such as:

many-to-many bridge table

But how do we populate this bridge table?

Let’s set up the test table again:

IF OBJECT_ID('dbo.Test') IS NOT NULL DROP TABLE dbo.Test
GO
CREATE TABLE Test(SaleID INT, PersonID VARCHAR(10))
INSERT INTO Test
VALUES (1,1)
, (1,2)
, (2,1)
, (3,2)
, (4,1)
, (4,2)
, (5,3)
, (5,4)
GO

I’ve made a change to this table from my previous post and changed PersonID to a varchar to make the following code simpler.

Now, rather than using a recursive CTE and enumerated path with a cursor, I’m going to use the STUFF command to create an XML concatenated string.

;WITH CTE AS (
SELECT SaleID
, PersonID
, STUFF(
 (SELECT '.' + PersonID
 FROM Test a
 WHERE a.saleID = b.SaleID
 ORDER BY PersonID
 FOR XML PATH('')
 ),1,1,'') as TeamPath
FROM Test b)
SELECT DISTINCT
   DENSE_RANK() OVER (ORDER BY TeamPath ) AS SalesTeamID
, PersonID
FROM CTE
ORDER BY SalesTeamID

And that’s it! Simple, elegant and performant!

To get the allocation as in my previous post, we simply use another common table expression and window function e.g.:

WITH CTE AS (
SELECT SaleID
   , PersonID
   , STUFF(
    (SELECT '.' + PersonID
    FROM Test a
    WHERE a.SaleID = b.SaleID
    ORDER BY PersonID
    FOR XML PATH('')
    ),1,1,'') as TeamPath
FROM Test b)
, DistinctCount AS
(SELECT TeamPath
	, COUNT(DISTINCT PersonID) AS PersonCnt
FROM CTE
GROUP BY TeamPath)
, AllocationCTE AS
(
 SELECT
     PersonID
   , TeamPath
   , DENSE_RANK() OVER (ORDER BY TeamPath ) AS SalesTeamID
 FROM CTE )
SELECT DISTINCT
   a.SalesTeamID
 , a.PersonID
 , 1/CAST(b.PersonCnt AS DECIMAL(6,2)) AS Allocation
FROM AllocationCTE a
JOIN DistinctCount b
ON a.TeamPath = b.TeamPath
ORDER BY SalesTeamID

This returns:many-to-many bridge table contents

When I tested my previous code that uses the cursor and recursive CTE – without the allocation – and based on a dataset of around 680 thousand records, the time to populate the bridge table was 39 minutes.

The above example code takes 1 minute and 17 seconds on the same data set. When I looked at the query plan I noticed a Lazy Spool. With the size of the dataset this made sense so I loaded the initial CTE into a temp table, put an index on TeamPath and ran the DENSE_RANK function on that. This reduced the total time to 33 seconds. A vast improvement over my original code. See below for the sample code:

SELECT SaleID
 , PersonID
 , CAST(STUFF(
  (SELECT '.' + PersonID
  FROM Test a
  WHERE a.SaleID = b.SaleID
  ORDER BY PersonID
  FOR XML PATH('')
  ),1,1,'') AS VARCHAR(200)) AS TeamPath
INTO #Initial
FROM Test b
GO
CREATE INDEX IX ON #Initial(TeamPath)
GO
SELECT DENSE_RANK() OVER (ORDER BY TeamPath ) AS SalesTeamID
, PersonID
FROM #Initial
ORDER BY SalesTeamID

We need to cast the XML string as a varchar so that we can place an index on it.

The lesson learned here is to not get too focused on a particular method, and to always look for more performant alternatives.

Advertisements
Posted in: Data Modeling, SQL