Archive for the 'DWH concepts' Category

Apr 30 2008

Handling many to many relationships (corrections made)

(There were issues with the design in my original post. Those have been corrected. I have let the errors stay and have highlighted where I have made corrections so that you can understand better)

Consider this situation.

2 new guys Ravi and Raj go to our bank and open new savings accounts 101 and 102. And Ravi deposits 100 and Raj deposits 144. This can be represented by the following.

CLICK ON THE IMAGES BELOW TO SEE A LARGER PICTURE

Now suddenly, Ramu and Ramya come to the bank and open a joint account 103. And they deposit 1000. How will you represent this? What will you enter for CUST_KEY ? Will you enter 3 or 4? It was the same transaction of Rs 1000. But which customer to indicate? What will you fill in the red ? position below?

Continue Reading »

No responses yet

Jul 04 2007

Why surrogate key?

Hi,
Why should we use surrogate keys in a data warehouse?

Here is why.

1. To separate the DWH from the operational environment
Think of this. You have a dimension loading from transactional tables. All of a sudden, the OLTP people decide to re-use all closed / inactive account id’s. For them, it wont matter a bit. For the DWH team, it will.
The incoming record will have an account id that is already there in DWH. It will overwrite the old record. This is unacceptable.

Consider another case where you have a customer dimension loaded from OLTP tables. And your company purchases another company. And now, the acquired company has similar type of primary key logic for its customers. It clashes with your own primary keys! Oops, how do we integrate them?

Surrogate keys!! Use surrogate keys!! This will shield the DWH from operational changes.

2. Performance
The fact table will have a composite key. If surrogate keys are used, then in the fact table, we will have integers for keys.
In most cases, all we need is only 4 bytes for each surrogate key column. So, if you have 5 dimensions, all we need is 20 bytes!
If you have a 10 million rows, you only need 20 MB space for it.
If you had primary keys instead, you might have need several bytes for each key. Remember, in this case, each byte increases space by 1 MB. So, if you have a billion rows, it will increase size by 1 GB. For every single byte increase!!

Also, the queries will run faster when you join on integers rather than varchar. Using surrogate keys will enable faster querying.

And add to it, the option of being able to partition as per time_key or any other suitable key. Since it is a sequence, you can easily partition it.

3. History preservation
Consider this. Today, in July, I make a purchase. My current customer record will say

Prem 101 Bangalore Male

Next month, I move to Hyderabad. Make a purchase there.

Prem 101 Hyderabad Male

Now, you want to query for all purchases made by customers in Bangalore in July. How can I get it? My record says I am in Hyderabad! The way to handle this is by using SCD’s. So I need to use surrogate keys.

1 Prem 101 Bangalore Male record “Status_close”
2 Prem 101 Hyderabad Male record “Status_current”

This will enable you to preserve history. Note the additional column to indicate the current record.

4. To enable special situations that cant be done by OLTP primary keys
How can you have a OLTP primary key for a ‘not found’ record in a dimension? My fact table has a record that doesnt have a match yet in the dimension table. (Late arriving dimension.. assume) How do I handle it? I need to say it maps to a ‘not found’ record in the dimension table.

Surrogate key usage will enable such option!

No responses yet

May 31 2007

Semi-additive facts

Hi,
What are semi-additive facts?
Well, facts that can be aggregated across a few dimensions, but not all.

Take the case of an inventory fact table.

We track inventory every day for every product at every warehouse.

To get the inventory on a given day for all products, I aggregate the inventory fact measures across the product dimension filtered for that particular day. The following query will give the inventory for every product for today.

Select product_dim.name, sum(Inv_fact.inventory_qty)
from
product_dim, inv_fact, date_dim
where
date_dim.date_key= inv_fact.date_key and
date_dim.calendar_dt=sysdate and
product_dim.product_key = inv_fact.product_key
group by product_dim

However, I cannot add all the inventory quantities every day for a product, say for a month and get anything meaningful. Aggregation across time is meaningless. (of course, you can use it to calculate the average inventory per day)

Such facts that can be aggregated across certain dimensions but not across all dimensions are called Semi-additive or partial facts.

No responses yet

May 25 2007

Rapidly changing dimensions

Published by BI Explorer under DWH concepts, ETL

Dimensions can be loaded in different ways. Slowly changing dimension loading is one. Even in this there are
1. Type 1 - any changes in existing data is updated. Used for correction handling.

2. Type 2 - change is not updated, but a new record is inserted with change and the old data is kept as such and flagged as old version. For example, Prem is a record in Customer dimension. Today Prem lives in Hyderabad. So the current record in customer dim says
#1 — Prem — Hyderabad — Male

Then I move to Bangalore. Now how do I update this? So I say

# no — Name — Place — Sex — Current version
#1 — Prem — Hyd — M — No
#2 — Prem — Blr — M — Yes

Why should I have the old record?
There might be a case where you want to know all the sales that was done by customers in Hyderabad last year. And Prem had a purchase in Hyderabad last year. I want this to be tracked. I dont want the sale to point to Prem living in Bangalore, but in Hyderabad.

This way, if Prem moves 10 times, we will have 10 records in customer each pointing to a different Prem as per his location. This will keep the data accurate.

3. Type 3
Here instead of having all the history, we keep selected history data. Like 3 versions. The last 3 location moves of Prem will be tracked, not more!

Okay, what are rapidly changing dimensions?
Any dimension whose attributes change too rapidly. For example, credit rating of Prem. Assume that your rating agency rates people every 3 months (assume for example purpose), then every 3 months Prem can have a possibly different rating!

# no — Name — Place — Sex — Cr rate –Current version
#1 — Prem — Hyd — M — AAA –No
#2 — Prem — Blr — M — AA –No
#2 — Prem — Blr — M — A –No
#2 — Prem — Blr — M — B –Yes

So if you have around 3000000 customer records and if they keep changing every quarter as they do here, customer dimension will grow exponentially. That defeats the purpose of a warehouse, effective querying.

What do we do?
Remove the rapidly changing attributes and make a new dimension out of it. Pull the Credit rating and make a dimension out of it and call it credit rating behaviour dimension.

So the customer will not grow exponentially. The rating dimension will. But this will be used only by queries that need rating info. Any query that doesnt need rating info and only other customer info will go to the original customer dimension.

No responses yet

May 25 2007

What is a data warehouse?

Published by BI Explorer under DWH concepts, ETL, OLAP

What is a datawarehouse?

In one line, it is a repository of several years of data using which reports can be created for business purposes (to take decisions).

It is a database specifically modeled and fine-tuned for analysis and decision making.

Continue Reading »

One response so far