Hi,
Why should we use surrogate keys in a data warehouse?
Here is why.
1. To separate the DWH from the operational environment
Think of this. You have a dimension loading from transactional tables. All of a sudden, the OLTP people decide to re-use all closed / inactive account id’s. For them, it wont matter a bit. For the DWH team, it will.
The incoming record will have an account id that is already there in DWH. It will overwrite the old record. This is unacceptable.
Consider another case where you have a customer dimension loaded from OLTP tables. And your company purchases another company. And now, the acquired company has similar type of primary key logic for its customers. It clashes with your own primary keys! Oops, how do we integrate them?
Surrogate keys!! Use surrogate keys!! This will shield the DWH from operational changes.
2. Performance
The fact table will have a composite key. If surrogate keys are used, then in the fact table, we will have integers for keys.
In most cases, all we need is only 4 bytes for each surrogate key column. So, if you have 5 dimensions, all we need is 20 bytes!
If you have a 10 million rows, you only need 20 MB space for it.
If you had primary keys instead, you might have need several bytes for each key. Remember, in this case, each byte increases space by 1 MB. So, if you have a billion rows, it will increase size by 1 GB. For every single byte increase!!
Also, the queries will run faster when you join on integers rather than varchar. Using surrogate keys will enable faster querying.
And add to it, the option of being able to partition as per time_key or any other suitable key. Since it is a sequence, you can easily partition it.
3. History preservation
Consider this. Today, in July, I make a purchase. My current customer record will say
Prem 101 Bangalore Male
Next month, I move to Hyderabad. Make a purchase there.
Prem 101 Hyderabad Male
Now, you want to query for all purchases made by customers in Bangalore in July. How can I get it? My record says I am in Hyderabad! The way to handle this is by using SCD’s. So I need to use surrogate keys.
1 Prem 101 Bangalore Male record “Status_close”
2 Prem 101 Hyderabad Male record “Status_current”
This will enable you to preserve history. Note the additional column to indicate the current record.
4. To enable special situations that cant be done by OLTP primary keys
How can you have a OLTP primary key for a ‘not found’ record in a dimension? My fact table has a record that doesnt have a match yet in the dimension table. (Late arriving dimension.. assume) How do I handle it? I need to say it maps to a ‘not found’ record in the dimension table.
Surrogate key usage will enable such option!