Search
  • Tim Burns

It's Time to Change Dimensions

To everything there a season, and a time to every purpose, and let's keep track of those changes with a slowly changing dimension.


What is the slowly changing dimension? It is a table that changes slowly over time. Suppose we have the following set of US Senators loaded into our database.


https://theunitedstates.io/congress-legislators/legislators-current.csv


It's a perfect slowly changing dimension. According to our constitution Article I, Section 3, clause 2, one-third of these senators will be subject to change.


Immediately after [the Senate of the United States] shall be assembled in Consequence of the first Election, they shall be divided as equally as may be into three classes. The Seats of the Senators of the first Class shall be vacated at the Expiration of the second Year, of the second Class at the Expiration of the fourth Year, and of the third Class at the Expiration of the sixth Year, so that one third may be chosen every second Year

There are many ways to implement a slowly changing dimension. Keeping a simple pattern can simplify the process of deciding how you will change.


Determine the columns that identify a unique row and the columns you want to track.


For example, I want to have a unique senator for each state and class. State senators do not overlap by class so that only one in a state can be up for election in an election cycle. Right now, Georgia is an exception, but we won't go into that.


Say we have staged the raw data for the above CSV file into a table. To process the staged data into a table that captures the current senate at the current time, we insert rows into a dimension table. (I've abbreviated a number of columns.)


Note that I've identified the unique rows as my MATCH_GUID and the CHANGE_HASH as has values. I use the hash to minimize CPU processing time on SQL.


The MATCH_GUID matches the same senator to any changes in their contact information, party, and we insert new rows if there is no existing row or if some attribute of the senator has changed.


insert into OWLMTN.WAREHOUSE.DIM_CONGRESS_SEAT(CONGRESS_SEAT_GUID,
                                               LAST_NAME,
                                               FIRST_NAME,
                                               FULL_NAME,
                                               BIRTHDAY,
                                               GENDER,
                                               TYPE,
                                               STATE,
                                               DISTRICT,
                                               SENATE_CLASS,
                                               PARTY,
                                               ADDRESS,
                                               PHONE,
                                               CONTACT_FORM,
                                               DW_MATCH_HASH)
with source as (
    SELECT hash(STATE,
                SENATE_CLASS)    MATCH_GUID,
           LAST_NAME,
           FIRST_NAME,
           FULL_NAME,
           BIRTHDAY,
           GENDER,
           TYPE,
           STATE,
           DISTRICT,
           SENATE_CLASS,
           PARTY,
           ADDRESS,
           PHONE,
           CONTACT_FORM,

           hash(
                   LAST_NAME,
                   FIRST_NAME,
                   FULL_NAME,
                   BIRTHDAY,
                   GENDER,
                   TYPE,
                   DISTRICT,
                   PARTY,
                   ADDRESS,
                   PHONE,
                   CONTACT_FORM) CHANGE_HASH
    from OWLMTN.STAGE.CONGRESS_MEMBERS sor
    where type = 'sen'
),
     target as (
         select CONGRESS_SEAT_GUID,
                DW_MATCH_HASH match_hash
         from OWLMTN.WAREHOUSE.DIM_CONGRESS_SEAT tar
         where DW_ACTIVE = 'T'
     )
select source.MATCH_GUID,
       source.LAST_NAME,
       source.FIRST_NAME,
       source.FULL_NAME,
       source.BIRTHDAY,
       source.GENDER,
       source.TYPE,
       source.STATE,
       source.DISTRICT,
       source.SENATE_CLASS,
       source.PARTY,
       source.ADDRESS,
       source.PHONE,
       source.CONTACT_FORM,
       source.CHANGE_HASH
from source
         left outer join target
                         on target.CONGRESS_SEAT_GUID = source.MATCH_GUID
where target.match_hash is null
   or target.match_hash <> source.CHANGE_HASH;


The update query will find any new rows and mark them as active.



-- Arrange the dimensions on the SCD.
update OWLMTN.WAREHOUSE.DIM_CONGRESS_SEAT new_data
    set new_data.DW_ACTIVE = update_logic.DW_ACTIVE,
        new_data.DW_FROM_DATE = update_logic.new_from_dw_date,
        new_data.DW_TO_DATE = update_logic.new_to_dw_date,
        DW_UPDATE_DATE = current_timestamp()
from (

    with updated_gui as (
        select DIM_CONGRESS_SEAT_KEY, CONGRESS_SEAT_GUID
        from OWLMTN.WAREHOUSE.DIM_CONGRESS_SEAT
        where dw_from_date is null
        )
    select current_row.DIM_CONGRESS_SEAT_KEY,
           current_row.dw_active as old_active,
           case when current_row.DW_FROM_DATE is null
               then TRUE
               else FALSE end as DW_ACTIVE,

           current_row.DW_FROM_DATE,
            case when current_row.DW_FROM_DATE is null
               then CURRENT_TIMESTAMP()
               else current_row.DW_FROM_DATE end as new_from_dw_date,

           current_row.DW_TO_DATE,

           case when current_row.DW_FROM_DATE is null
                 then to_timestamp_ltz('2099-12-31 00:00:00')
               else CURRENT_TIMESTAMP() end as new_to_dw_date,
           current_row.DW_MATCH_HASH
        from updated_gui
        inner join OWLMTN.WAREHOUSE.DIM_CONGRESS_SEAT current_row
            on updated_gui.CONGRESS_SEAT_GUID = current_row.CONGRESS_SEAT_GUID
                   and (DW_FROM_DATE is NULL or current_row.DW_ACTIVE=TRUE)
        left outer join OWLMTN.WAREHOUSE.DIM_CONGRESS_SEAT old
            on current_row.CONGRESS_SEAT_GUID = old.CONGRESS_SEAT_GUID
                   and old.dw_ACTIVE

      ) update_logic
where new_data.DIM_CONGRESS_SEAT_KEY = update_logic.DIM_CONGRESS_SEAT_KEY;


So what happens after November 3 and there is a change in the seat status? Let's track the election from my favorite state, Colorado to hypothetically project a seat change.


According to the data projection on Azri.us, Hickenlooper has a 10 point advantage over Gardner, who currently holds the seat.


Viewing the hash comparison in the previous insert query, we see that the MATCH_GUID stays the same (same seat) but the CHANGE_HASH changes (new senator).


So after the insert detects a changed record, the new record is not active, and we need to activate that record in a transactional manner that sets the old record inactive as it activates the new method.

So then running the query that shifts the date on the currently active row will move the inactive to active.



In this way, we can track the evolution of an item described as "unique" in time.


So if I want to know who was the senator of class 2 for Colorado at a specific time, I can specify the time and it will never change.


Note that the DW_FROM_DATE is less than the date and the DW_TO_DATE is greater than or equal to the date. This will give a unique result at a single point in time.

select CONGRESS_SEAT_GUID, FULL_NAME, DW_MATCH_HASH, DW_ACTIVE, DW_FROM_DATE, DW_TO_DATE
from owlmtn.WAREHOUSE.DIM_CONGRESS_SEAT
where state = 'CO' and SENATE_CLASS = 2
  and DW_FROM_DATE < '2020-10-30 21:51:58.166000000 -07:00'
  and DW_TO_DATE >= '2020-10-30 21:51:58.166000000 -07:00';


2 views0 comments