Snowflake Data Ingestion Design Choices
Updated: Oct 1, 2021
Photo by Author's Wife on Herring Cove Beach in Provincetown
As I was playing around with the KEXP Playlist data, I realized that we have a design choice to make when running the COPY statement for ingestion. The data will start off in our S3 stage bucket and our goal is to have a shaped FACT table that we can use to build do analytics in a BI tool.
Exploring the data by querying with SELECT directly on the STAGE works well.
However, when we try to reuse that same select statement to copy the data into the FACT_KEXP_PLAYLIST table, we get an error.
[0A000] SQL compilation error: COPY statement only supports simple SELECT from stage statements for import.
The most straightforward method to transform the data is loading into a VARIANT type in Snowflake. It is a flexible and reusable design pattern that works well for ingesting API data because the main limitation is the 16 MB limit compressed on the VARIANT contents. API endpoints generally return at most 10,000 records at a time, so exceeding that limit is unlikely.
A Simple Data Architecture Pattern to Ingest Fact Data
The COPY statement now will load the API export data verbatim into a VARIANT column.
copy into stage.raw_kexp_playlist (filename, file_row_number, value) from ( select metadata$filename filename, metadata$file_row_number, stg."$1" from @owlmtn.stage.AZRIUS_STAGE_TEST stg) pattern = 'stage/kexp/.*', file_format = (type = json);
The TRANSFORM will slice the JSON data up into columns.
insert into WAREHOUSE.fact_kexp_playlist(LOAD_ID, PLAY_TYPE, AIRDATE, ALBUM, ARTIST, SONG) select load_id, tab.value:play_type::String, tab.value:airdate::TIMESTAMP_LTZ, tab.value:album::String, tab.value:artist::String, tab.value:song::String from stage.raw_kexp_playlist stg, table (flatten(stg.value:results)) tab where load_id > (select coalesce(max(load_id), 0) from WAREHOUSE.FACT_KEXP_PLAYLIST);
The DDL scripts are here for reference.
Loading API output like the JSON data from KEXP Playlist export is an extremely common data warehousing task. Although this is a simple article, any serious project should strive for the same simplicity. SQL transformations are extremely powerful tools in a Software Engineers' toolbox for processing large amounts of data.
Here I've shown a technique that can be used to process huge amounts of data using the Snowflake COPY command a transform within Snowflake.
I've also given some standard default populated columns like DW_CREATE_DATE, LOAD_ID, and PLAYLIST_ID. Using standard defaults from the start lets you build maintainable and agile data warehouses that evolve with new requirements.