Studying Snowflake - Using Stages for Data Discovery

Tim Burns
Feb 12, 2023
2 min read

The Author Chopping Wood on a Winters Day

Set up a Secure Storage Integration and Create Stages

The foundation of data discovery is a working pipeline to load the data into Snowflake, and the STAGE object is the foundation of the pipeline.

To query the data, I assume you have set up a storage integration. If you are using AWS, here is an article that describes how to set up a secure storage integration between S3 and Snowflake.

Securing the Snowflake Storage Integration on AWS

Identifying the Data Sources - The Stage Query using Variant

Multiple sources can be extracted into the storage integration with API endpoint data. A generic query on a stage defined across the API endpoint will allow you to view all the data extracted from the endpoint.

The variant query on the storage-level objects is a simple query to begin data analysis. On an existing system, start by looking at STAGE objects that exist.

--- Check to see the stages you can query
-- If none, then see this article: https://medium.com/snowflake/securing-the-snowflake-storage-integration-on-aws-21046672f1a8
show stages;

--- Show what is in the storage location
list @STAGE.KEXP_PUBLIC;

--- Narrow down to specific files you would like to view, for example Feb 7, 2023, International Clash Day
--- This shows logs, playlists and shows for international Clash Data
list @STAGE.KEXP_PUBLIC
    pattern = '.*20230207.*';



--- View the structure of the data object using a variant structure
select
       $1,
       metadata$filename                filename
from @owlmtn.stage.KEXP_PUBLIC (
         pattern =>'stage/kexp/logs/20230207/api20230207.*',
         file_format => stage.json_file_format) stg
order by metadata$filename desc;

The query will show the Variant (JSON) format of the data in the file.

{
  "airdate_after_date": "2023-02-06T09:27:22-0800",
  "airdate_before_date": "2023-02-07T09:27:22-0800",
  "number_songs": 392,
  "playlist_key": "stage/kexp/playlists/20230207092543/playlist20230207092543.json",
  "run_date_key": "20230207",
  "run_datetime_key": "20230207092722",
  "shows_key": "stage/kexp/shows/20230207092543/show20230207092543.json"
}

Using Directory Tables to Dig Deeper

Directory tables allow you to dig deeper to learn aspects of individual data sets within the storage container.

Snowflake Guide to Directory Tables

--- Specifying "directory" on a stage allows you to utilize STAGE_DIRECTORY_FILE_REGISTRATION_HISTORY
--- to do data discovery on a set of data.
create stage if not exists KEXP_SHOWS
  url='s3://owlmtn-stage-data/stage/kexp/shows/'
  storage_integration = OWLMTN_S3_DATA
  directory = (
    enable = true
    auto_refresh = true
  );

Here we see that the data set is 196 MB and contains 17242 files, a good starting point for filling out a worksheet on the data.

+---------------+------------+
|DATASET_SIZE_MB|NUMBER_FILES|
+---------------+------------+
|196.041545     |17242       |
+---------------+------------+

Data Ingestion and Stages - A Deeper Dive

Data Ingestion and stages merit deeper consideration. A set of documents relevant to stages are here.

Studying Snowflake - Using Stages for Data Discovery

Set up a Secure Storage Integration and Create Stages

Identifying the Data Sources - The Stage Query using Variant

Using Directory Tables to Dig Deeper

Data Ingestion and Stages - A Deeper Dive

Recent Posts

Comments