To everything there a season, and a time to every purpose, and let's keep track of those changes with a slowly changing dimension.
What is the slowly changing dimension? It is a table that changes slowly over time. Suppose we have the following set of US Senators loaded into our database.
It's a perfect slowly changing dimension. According to our constitution Article I, Section 3, clause 2, one-third of these senators will be subject to change.
Immediately after [the Senate of the United States] shall be assembled in Consequence of the first Election, they shall be divided as equally as may be into three classes. The Seats of the Senators of the first Class shall be vacated at the Expiration of the second Year, of the second Class at the Expiration of the fourth Year, and of the third Class at the Expiration of the sixth Year, so that one third may be chosen every second Year
There are many ways to implement a slowly changing dimension. Keeping a simple pattern can simplify the process of deciding how you will change.
Determine the columns that identify a unique row and the columns you want to track.
For example, I want to have a unique senator for each state and class. State senators do not overlap by class so that only one in a state can be up for election in an election cycle. Right now, Georgia is an exception, but we won't go into that.
Say we have staged the raw data for the above CSV file into a table. To process the staged data into a table that captures the current senate at the current time, we insert rows into a dimension table. (I've abbreviated a number of columns.)
Note that I've identified the unique rows as my MATCH_GUID and the CHANGE_HASH as has values. I use the hash to minimize CPU processing time on SQL.
The MATCH_GUID matches the same senator to any changes in their contact information, party, and we insert new rows if there is no existing row or if some attribute of the senator has changed.
insert into OWLMTN.WAREHOUSE.DIM_CONGRESS_SEAT(CONGRESS_SEAT_GUID,
LAST_NAME,
FIRST_NAME,
FULL_NAME,
BIRTHDAY,
GENDER,
TYPE,
STATE,
DISTRICT,
SENATE_CLASS,
PARTY,
ADDRESS,
PHONE,
CONTACT_FORM,
DW_MATCH_HASH)
with source as (
SELECT hash(STATE,
SENATE_CLASS) MATCH_GUID,
LAST_NAME,
FIRST_NAME,
FULL_NAME,
BIRTHDAY,
GENDER,
TYPE,
STATE,
DISTRICT,
SENATE_CLASS,
PARTY,
ADDRESS,
PHONE,
CONTACT_FORM,
hash(
LAST_NAME,
FIRST_NAME,
FULL_NAME,
BIRTHDAY,
GENDER,
TYPE,
DISTRICT,
PARTY,
ADDRESS,
PHONE,
CONTACT_FORM) CHANGE_HASH
from OWLMTN.STAGE.CONGRESS_MEMBERS sor
where type = 'sen'
),
target as (
select CONGRESS_SEAT_GUID,
DW_MATCH_HASH match_hash
from OWLMTN.WAREHOUSE.DIM_CONGRESS_SEAT tar
where DW_ACTIVE = 'T'
)
select source.MATCH_GUID,
source.LAST_NAME,
source.FIRST_NAME,
source.FULL_NAME,
source.BIRTHDAY,
source.GENDER,
source.TYPE,
source.STATE,
source.DISTRICT,
source.SENATE_CLASS,
source.PARTY,
source.ADDRESS,
source.PHONE,
source.CONTACT_FORM,
source.CHANGE_HASH
from source
left outer join target
on target.CONGRESS_SEAT_GUID = source.MATCH_GUID
where target.match_hash is null
or target.match_hash <> source.CHANGE_HASH;
The update query will find any new rows and mark them as active.
-- Arrange the dimensions on the SCD.
update OWLMTN.WAREHOUSE.DIM_CONGRESS_SEAT new_data
set new_data.DW_ACTIVE = update_logic.DW_ACTIVE,
new_data.DW_FROM_DATE = update_logic.new_from_dw_date,
new_data.DW_TO_DATE = update_logic.new_to_dw_date,
DW_UPDATE_DATE = current_timestamp()
from (
with updated_gui as (
select DIM_CONGRESS_SEAT_KEY, CONGRESS_SEAT_GUID
from OWLMTN.WAREHOUSE.DIM_CONGRESS_SEAT
where dw_from_date is null
)
select current_row.DIM_CONGRESS_SEAT_KEY,
current_row.dw_active as old_active,
case when current_row.DW_FROM_DATE is null
then TRUE
else FALSE end as DW_ACTIVE,
current_row.DW_FROM_DATE,
case when current_row.DW_FROM_DATE is null
then CURRENT_TIMESTAMP()
else current_row.DW_FROM_DATE end as new_from_dw_date,
current_row.DW_TO_DATE,
case when current_row.DW_FROM_DATE is null
then to_timestamp_ltz('2099-12-31 00:00:00')
else CURRENT_TIMESTAMP() end as new_to_dw_date,
current_row.DW_MATCH_HASH
from updated_gui
inner join OWLMTN.WAREHOUSE.DIM_CONGRESS_SEAT current_row
on updated_gui.CONGRESS_SEAT_GUID = current_row.CONGRESS_SEAT_GUID
and (DW_FROM_DATE is NULL or current_row.DW_ACTIVE=TRUE)
left outer join OWLMTN.WAREHOUSE.DIM_CONGRESS_SEAT old
on current_row.CONGRESS_SEAT_GUID = old.CONGRESS_SEAT_GUID
and old.dw_ACTIVE
) update_logic
where new_data.DIM_CONGRESS_SEAT_KEY = update_logic.DIM_CONGRESS_SEAT_KEY;
So what happens after November 3 and there is a change in the seat status? Let's track the election from my favorite state, Colorado to hypothetically project a seat change.
According to the data projection on Azri.us, Hickenlooper has a 10 point advantage over Gardner, who currently holds the seat.
Viewing the hash comparison in the previous insert query, we see that the MATCH_GUID stays the same (same seat) but the CHANGE_HASH changes (new senator).
So after the insert detects a changed record, the new record is not active, and we need to activate that record in a transactional manner that sets the old record inactive as it activates the new method.
So then running the query that shifts the date on the currently active row will move the inactive to active.
In this way, we can track the evolution of an item described as "unique" in time.
So if I want to know who was the senator of class 2 for Colorado at a specific time, I can specify the time and it will never change.
Note that the DW_FROM_DATE is less than the date and the DW_TO_DATE is greater than or equal to the date. This will give a unique result at a single point in time.
select CONGRESS_SEAT_GUID, FULL_NAME, DW_MATCH_HASH, DW_ACTIVE, DW_FROM_DATE, DW_TO_DATE
from owlmtn.WAREHOUSE.DIM_CONGRESS_SEAT
where state = 'CO' and SENATE_CLASS = 2
and DW_FROM_DATE < '2020-10-30 21:51:58.166000000 -07:00'
and DW_TO_DATE >= '2020-10-30 21:51:58.166000000 -07:00';
Comments