The Author Chopping Wood on a Winters Day
Set up a Secure Storage Integration and Create Stages
The foundation of data discovery is a working pipeline to load the data into Snowflake, and the STAGE object is the foundation of the pipeline.
To query the data, I assume you have set up a storage integration. If you are using AWS, here is an article that describes how to set up a secure storage integration between S3 and Snowflake.
Identifying the Data Sources - The Stage Query using Variant
Multiple sources can be extracted into the storage integration with API endpoint data. A generic query on a stage defined across the API endpoint will allow you to view all the data extracted from the endpoint.
The variant query on the storage-level objects is a simple query to begin data analysis. On an existing system, start by looking at STAGE objects that exist.
--- Check to see the stages you can query
-- If none, then see this article: https://medium.com/snowflake/securing-the-snowflake-storage-integration-on-aws-21046672f1a8
show stages;
--- Show what is in the storage location
list @STAGE.KEXP_PUBLIC;
--- Narrow down to specific files you would like to view, for example Feb 7, 2023, International Clash Day
--- This shows logs, playlists and shows for international Clash Data
list @STAGE.KEXP_PUBLIC
pattern = '.*20230207.*';
--- View the structure of the data object using a variant structure
select
$1,
metadata$filename filename
from @owlmtn.stage.KEXP_PUBLIC (
pattern =>'stage/kexp/logs/20230207/api20230207.*',
file_format => stage.json_file_format) stg
order by metadata$filename desc;
The query will show the Variant (JSON) format of the data in the file.
{
"airdate_after_date": "2023-02-06T09:27:22-0800",
"airdate_before_date": "2023-02-07T09:27:22-0800",
"number_songs": 392,
"playlist_key": "stage/kexp/playlists/20230207092543/playlist20230207092543.json",
"run_date_key": "20230207",
"run_datetime_key": "20230207092722",
"shows_key": "stage/kexp/shows/20230207092543/show20230207092543.json"
}
Using Directory Tables to Dig Deeper
Directory tables allow you to dig deeper to learn aspects of individual data sets within the storage container.
--- Specifying "directory" on a stage allows you to utilize STAGE_DIRECTORY_FILE_REGISTRATION_HISTORY
--- to do data discovery on a set of data.
create stage if not exists KEXP_SHOWS
url='s3://owlmtn-stage-data/stage/kexp/shows/'
storage_integration = OWLMTN_S3_DATA
directory = (
enable = true
auto_refresh = true
);
Here we see that the data set is 196 MB and contains 17242 files, a good starting point for filling out a worksheet on the data.
+---------------+------------+
|DATASET_SIZE_MB|NUMBER_FILES|
+---------------+------------+
|196.041545 |17242 |
+---------------+------------+
Data Ingestion and Stages - A Deeper Dive
Data Ingestion and stages merit deeper consideration. A set of documents relevant to stages are here.
Comments