top of page
Search
Writer's pictureTim Burns

Snowflake Clustering and Micro-Partitions


The Author Hiking in the Rainy White Mountains


Snowflake Clustering and Micro-Partitions can improve performance and reduce costs in your Data Warehouse. In this blog post, I explore the clustering and micro-partition statistics of my KEXP Snowflake Data to see how a Snowflake defaults configured my simple data warehouse.


First, I query tables in the warehouse with clustering.

SELECT TABLE_CATALOG, TABLE_SCHEMA, TABLE_NAME, CLUSTERING_KEY
  FROM INFORMATION_SCHEMA.TABLES
 WHERE CLUSTERING_KEY IS NOT NULL;

The result set is empty, so I need to configure a clustering key for my main table, FACT_KEXP_PLAYLIST.


What are some candidates for the clustering key? I want to create the partition key to minimize the number of micro-partitions to scan to perform my central table join. The view DDL for the analysis export points to the logical choice for a clustering key.

create or replace view WAREHOUSE.VIEW_KEXP_PLAYLIST_SHOW
as
with show as (
    select show_id,
                  ...
           max(DW_UPDATE_DATE) LAST_UPDATED
    from WAREHOUSE.DIM_KEXP_SHOW
    group by show_id,
                  ...
             start_time
)
select PLAYLIST_ID,
       show.SHOW_ID,
         ...
       max(DW_UPDATE_DATE)                     playlist_last_updated
from WAREHOUSE.FACT_KEXP_PLAYLIST plays
         left outer join show
                         on plays.SHOW_ID = show.SHOW_ID
group by PLAYLIST_ID,
         show.SHOW_ID,
         ...
order by PLAYLIST_ID desc;

Here are the two suggestions Snowflake gives for selecting a cluster key:

  1. First, cluster most actively used columns. Often this will be date columns like paid_date between date_1 and date_2.

  2. If there is room for additional cluster keys, consider columns frequently used in joins.

In the case of the KEXP Data Warehouse, SHOW_ID is a unique ID associated with the current show the DJ plays, and each show contains a set of songs. So, it satisfies both criteria well and is an excellent selection.

alter table warehouse.FACT_KEXP_PLAYLIST
cluster by (SHOW_ID);

Why not include AIRDATE? AIRDATE is equivalent to a unique key that would create unnecessary overhead for the clustering. Per the snowflake documentation.

  1. Maintaining clustering on a column is more expensive if that column (or expression) has higher cardinality.

Now when I query to find my clustered tables.

+-------------+------------+------------------+---------------+
|TABLE_CATALOG|TABLE_SCHEMA|TABLE_NAME        |CLUSTERING_KEY |
+-------------+------------+------------------+---------------+
|OWLMTN       |WAREHOUSE   |FACT_KEXP_PLAYLIST|LINEAR(SHOW_ID)|
+-------------+------------+------------------+---------------+

Clustering Depth is the average depth of the clustering on a table. It is always more than one, and smaller is better.

select system$clustering_depth('WAREHOUSE.FACT_KEXP_PLAYLIST');

2

The clustering information query gives a detailed JSON output of the clustering data.


with sys_cluster_info as
         (
             SELECT system$clustering_information('WAREHOUSE.FACT_KEXP_PLAYLIST')
         )
select PARSE_JSON($1):cluster_by_keys                as cluster_by_keys,
       PARSE_JSON($1):total_partition_count          as total_partition_count,
       PARSE_JSON($1):total_constant_partition_count as total_constant_partition_count,
       PARSE_JSON($1):average_overlaps               as average_overlaps,
       PARSE_JSON($1):average_depth                  as average_depth,
       PARSE_JSON($1):partition_depth_histogram      as partition_depth_histogram
from sys_cluster_info;


+-----------------+---------------------+------------------------------+-------------+----------------+
|CLUSTER_BY_KEYS  |TOTAL_PARTITION_COUNT|TOTAL_CONSTANT_PARTITION_COUNT|AVERAGE_DEPTH|AVERAGE_OVERLAPS|
+-----------------+---------------------+------------------------------+-------------+----------------+
|"LINEAR(SHOW_ID)"|13                   |4                             |4.3077       |4.4615          |
+-----------------+---------------------+------------------------------+-------------+----------------+




Running through the Data Warehouse Pipeline will refresh the existing data, and then we can query. See my source for the KEXP Snowflake Pipeline. The pipeline is a simple AWS/Snowflake integration that runs on the command line with a Makefile using SNOWSQL.


The pipeline performs the following sequence

  1. Download the latest playlist data from KEXP

  2. Push the local data to an AWS S3 Bucket

  3. Use SnowSQL to COPY the AWS data into Snowflake

  4. Use SnowSQL to COPY the Snowflake out into an export folder

  5. Copy the AWS data locally and use sed to concatenate the data into a CSV file

After running the pipeline, we should incur some clustering overhead on Snowflake using the warehouse AUTOMATIC_CLUSTERING.




27 views0 comments

Recent Posts

See All

Commentaires


bottom of page