Data Modeling Conundrum Follow up

Posted on March 17, 2011


As a follow up to my previous post on a Data Modeling Conundrum in January (see I left it open as to what the appropriate dimensional model should be.

To recap from my first post, let’s say that I’m trying to track Marine Incidents. Zero or many vessels may be involved in an incident, and zero or many persons may be involved in an incident. A vessel will have attributes relevant to the incident, such as it’s role in the incident and any damage sustained during the incident. The same applies to a person e.g. Valid Licence, Injury status etc.

In my earlier post, I presented 4 options:

  1. All rows in one fact table
  2. Bridge table
  3. Fact tables joining to fact tables
  4. Hybrid fact table

Back in 2006 I used Option 3:
Multiple Fact Tables joined on Degenerate Dimension

Notice I have my primary fact table FactIncident joining to FactIncidentPerson and FactIncidentVessel on the degenerate dimension IncidentID. Now in Kimball methodology this is traditionally seen as a “No-No”.

It’s worth noting that Options 2, 3 and 4 are all deficient in being able to query which person was on which vessel during an incident. To do this, we need a factless fact table as seen in the above diagram with the table IncidentPersonVessel.

I had thought while writing the first post that Option 4 would be the optimal solution because the number of times a vessel would be involved with other vessels in an incident would be reduced using the bridge table as shown below:

However I was wrong.

Firstly, it’s highly unlikely in this scenario that a vessel may be involved in an accident with the same vessels on more than one occasion.

Secondly, to properly join the Fact table to the bridge table, the Bridge Group key needs to be a combination not just of the vessels, but of their roles and damage sustained in an incident. Otherwise it’s logically possible to have two vessels involved in two separate incidents but with different role and damage values and if the group key is based only on the vessels you get duplicates. Having the extended attributes of the bridge tables as part of the group key causes the cardinality to be high enough to essentially render the bridge table redundant and to be the same as the FactIncidentVessel table in Option 3.

The only other way to get around this is to include the attributes VesselRoleKey, VesselDamageKey and PersonRoleKey in the factless fact table. However you’re then left with essentially Option 1 but with even more fact tables!

This leaves us with Options 3 as shown above and Option 1.

I created some dummy data and tested a variety of queries to test their performance. While I could have done a variety of things such as adding covering indexes and using query hints to improve the performance for each Option, I decided not to and just let the Query Optimizer do its thing with a simple but appropriate indexing strategy.

The tests I ran were:

  1. Incident count over months
  2. Person Count over months by IncidentType
  3. Person City Count over months by IncidentType
  4. Display all records for a month by an incident type and vessel role
  5. Show vessel and person details of incidents for a particular month
  6. Count of Incidents involving 3 or more vessels over months
  7. Count of vessels and incidents by Weather condition by month

The results can be seen in the below chart:

From the chart we can see that the multiple-fact-table solution (Option 3) is nearly 24% faster than the all-in-one fact table solution for the above queries.  However for queries where we need to display details for both persons and vessels, such as Tests 4 and 5, the all-in-one approach performs best.
Note, that I am definitely not advocating to not include bridge tables in your solutions. Bridge tables provide tremendous performance benefits but they simply don’t apply to this situation.

Posted in: Data Modeling, SQL