top of page
  • Writer's pictureTim Burns

Studying Snowflake - Using Stages for Data Discovery

The Author Chopping Wood on a Winters Day

Set up a Secure Storage Integration and Create Stages

The foundation of data discovery is a working pipeline to load the data into Snowflake, and the STAGE object is the foundation of the pipeline.

To query the data, I assume you have set up a storage integration. If you are using AWS, here is an article that describes how to set up a secure storage integration between S3 and Snowflake.

Identifying the Data Sources - The Stage Query using Variant

Multiple sources can be extracted into the storage integration with API endpoint data. A generic query on a stage defined across the API endpoint will allow you to view all the data extracted from the endpoint.

The variant query on the storage-level objects is a simple query to begin data analysis. On an existing system, start by looking at STAGE objects that exist.

--- Check to see the stages you can query
-- If none, then see this article:
show stages;

--- Show what is in the storage location

--- Narrow down to specific files you would like to view, for example Feb 7, 2023, International Clash Day
--- This shows logs, playlists and shows for international Clash Data
    pattern = '.*20230207.*';

--- View the structure of the data object using a variant structure
       metadata$filename                filename
from @owlmtn.stage.KEXP_PUBLIC (
         pattern =>'stage/kexp/logs/20230207/api20230207.*',
         file_format => stage.json_file_format) stg
order by metadata$filename desc;

The query will show the Variant (JSON) format of the data in the file.

  "airdate_after_date": "2023-02-06T09:27:22-0800",
  "airdate_before_date": "2023-02-07T09:27:22-0800",
  "number_songs": 392,
  "playlist_key": "stage/kexp/playlists/20230207092543/playlist20230207092543.json",
  "run_date_key": "20230207",
  "run_datetime_key": "20230207092722",
  "shows_key": "stage/kexp/shows/20230207092543/show20230207092543.json"

Using Directory Tables to Dig Deeper

Directory tables allow you to dig deeper to learn aspects of individual data sets within the storage container.

--- Specifying "directory" on a stage allows you to utilize STAGE_DIRECTORY_FILE_REGISTRATION_HISTORY
--- to do data discovery on a set of data.
create stage if not exists KEXP_SHOWS
  storage_integration = OWLMTN_S3_DATA
  directory = (
    enable = true
    auto_refresh = true

Here we see that the data set is 196 MB and contains 17242 files, a good starting point for filling out a worksheet on the data.

|196.041545     |17242       |

Data Ingestion and Stages - A Deeper Dive

Data Ingestion and stages merit deeper consideration. A set of documents relevant to stages are here.

10 views0 comments
bottom of page