Archive for the 'Dimensional modelling' Category

Apr 30 2008

Handling many to many relationships (corrections made)

(There were issues with the design in my original post. Those have been corrected. I have let the errors stay and have highlighted where I have made corrections so that you can understand better)

Consider this situation.

2 new guys Ravi and Raj go to our bank and open new savings accounts 101 and 102. And Ravi deposits 100 and Raj deposits 144. This can be represented by the following.

CLICK ON THE IMAGES BELOW TO SEE A LARGER PICTURE

Now suddenly, Ramu and Ramya come to the bank and open a joint account 103. And they deposit 1000. How will you represent this? What will you enter for CUST_KEY ? Will you enter 3 or 4? It was the same transaction of Rs 1000. But which customer to indicate? What will you fill in the red ? position below?

Continue Reading »

No responses yet

Feb 26 2008

Denormalization.. normalization.. confusion

I am a big fan of Vincent McBurney. Today I was reading this post.

I agree completely with his view on Denormalization. (Scroll to the bottom of the article).

At times, I have had difficulties in recalling if normalization is splitting or joining. And it does make a lot more sense to just call them ’split’ and ‘merge’ instead of ‘normalize’ and ‘denormalize’.

Just another instance where plain common sense could have been helped everyone.

One response so far

Jul 11 2007

The maximum size of a fact table.

Published by BI Explorer under Dimensional modelling

Well, this is just a workout. It may or may not make much sense!

Here I have detailed how we can calculate the maximum size of a fact table.

Assume our datamart has 5 dimensions: Time_dim, customer_dim, product_dim, product_warehouse_dim, supplier_dim. And our fact table has 5 foreign keys, 15 measures, 2 degenerate columns

Time_dim has 20 years of rows on daily grain: so 365 X 20 = 7300 rows
There are 4000 customers in customer_dim.
200 products.
10 warehouses.
12 suppliers.

So the maximum possible fact table rows is
7300 x 4000 x 200 x 10 x 12 = 7008000000000 (7008000 million or 7008 billion)

Now this is the maximum no of rows possible in the fact table.(Assuming that every customer brought every product from every store supplied by every supplier on every day…..etc… it goes on … permutations and combinations)

Assuming that the 22 columns have an average size of 5 bytes, we need 110 bytes per row.

So now, 110 x 7008 billion = 71794 GB

The maximum possible size for this fact table (assuming that dimension count stays same) is 71794 GB.

The maximum possible increase per year could be 3590 GB.

No responses yet

Jul 04 2007

Why surrogate key?

Hi,
Why should we use surrogate keys in a data warehouse?

Here is why.

1. To separate the DWH from the operational environment
Think of this. You have a dimension loading from transactional tables. All of a sudden, the OLTP people decide to re-use all closed / inactive account id’s. For them, it wont matter a bit. For the DWH team, it will.
The incoming record will have an account id that is already there in DWH. It will overwrite the old record. This is unacceptable.

Consider another case where you have a customer dimension loaded from OLTP tables. And your company purchases another company. And now, the acquired company has similar type of primary key logic for its customers. It clashes with your own primary keys! Oops, how do we integrate them?

Surrogate keys!! Use surrogate keys!! This will shield the DWH from operational changes.

2. Performance
The fact table will have a composite key. If surrogate keys are used, then in the fact table, we will have integers for keys.
In most cases, all we need is only 4 bytes for each surrogate key column. So, if you have 5 dimensions, all we need is 20 bytes!
If you have a 10 million rows, you only need 20 MB space for it.
If you had primary keys instead, you might have need several bytes for each key. Remember, in this case, each byte increases space by 1 MB. So, if you have a billion rows, it will increase size by 1 GB. For every single byte increase!!

Also, the queries will run faster when you join on integers rather than varchar. Using surrogate keys will enable faster querying.

And add to it, the option of being able to partition as per time_key or any other suitable key. Since it is a sequence, you can easily partition it.

3. History preservation
Consider this. Today, in July, I make a purchase. My current customer record will say

Prem 101 Bangalore Male

Next month, I move to Hyderabad. Make a purchase there.

Prem 101 Hyderabad Male

Now, you want to query for all purchases made by customers in Bangalore in July. How can I get it? My record says I am in Hyderabad! The way to handle this is by using SCD’s. So I need to use surrogate keys.

1 Prem 101 Bangalore Male record “Status_close”
2 Prem 101 Hyderabad Male record “Status_current”

This will enable you to preserve history. Note the additional column to indicate the current record.

4. To enable special situations that cant be done by OLTP primary keys
How can you have a OLTP primary key for a ‘not found’ record in a dimension? My fact table has a record that doesnt have a match yet in the dimension table. (Late arriving dimension.. assume) How do I handle it? I need to say it maps to a ‘not found’ record in the dimension table.

Surrogate key usage will enable such option!

No responses yet

May 31 2007

Semi-additive facts

Hi,
What are semi-additive facts?
Well, facts that can be aggregated across a few dimensions, but not all.

Take the case of an inventory fact table.

We track inventory every day for every product at every warehouse.

To get the inventory on a given day for all products, I aggregate the inventory fact measures across the product dimension filtered for that particular day. The following query will give the inventory for every product for today.

Select product_dim.name, sum(Inv_fact.inventory_qty)
from
product_dim, inv_fact, date_dim
where
date_dim.date_key= inv_fact.date_key and
date_dim.calendar_dt=sysdate and
product_dim.product_key = inv_fact.product_key
group by product_dim

However, I cannot add all the inventory quantities every day for a product, say for a month and get anything meaningful. Aggregation across time is meaningless. (of course, you can use it to calculate the average inventory per day)

Such facts that can be aggregated across certain dimensions but not across all dimensions are called Semi-additive or partial facts.

No responses yet