Integrating Snowflake with Glue

Tim Burns
Apr 7, 2021
5 min read

Snowflake is an excellent modern database that couples the power of traditional SQL with modern data lake architecture. AWS Glue is a native ETL environment built into the AWS serverless ecosystem. Together they make a powerful combination for building a modern data lake.

In this article, I will address the fundamentals of creating a job in Glue to automatically load a Snowflake table using the powerful COPY command.

Integrate your Snowflake Credentials into Secrets Manager

AWS provides a utility called Secrets Manager to store passwords and it has a number of features, including automated password rotation that make it very attractive for secure storage.

The best way to utilize Secrets Manager is to store your Snowflake credentials using Multifactor Authentication.

Open up Secrets Manager and add the credentials of your Snowflake user. Fill in all the required fields in the manager and store the secret.

Fill in the Snowflake Connection Information

Record the secrets ID and add it to your AWS::IAM::Role specification in a Cloud Formation file. Use environment variables to store the secret name. Don't worry, we will use AWS roles to restrict secret access to the Glue job explicitly.

- PolicyName: "AllowSecretsManager"
  PolicyDocument:
    Version: "2012-10-17"
    Statement:
      - Effect: "Allow"
        Action: [
            "secretsmanager:GetSecretValue",
            "secretsmanager:DescribeSecret"
        ]
        Resource: [
            !Sub "arn:aws:secretsmanager:${AWS::Region}:${AWS::AccountId}:secret:${SecretName}*"

The policy will allow your Glue Job to connect to Snowflake to perform operations.

Snowflake Requirements

Create a Stage

The COPY requires STAGE connected to the S3 bucket in AWS. Follow the Snowflake method to create a stage. The best practice is to create a storage integration to your S3 bucket and then create multiple stages on the storage integration for application or customer. You can use SNOWSQL with variable substitution to automate.

!set variable_substitution=true;
create or replace stage &{database}.stage.OLYMPICS
storage_integration = &{storage_integration}
url = '&{S3DataHome}/stage/olympics/';

Create a Table in Snowflake to Contain the Data

We will create a table to contain the public data set containing athlete results for 120 years of Olympics events. The table below has the columns of the data set as well as standard columns that every data warehouse table should contain (highlighted in bold).

CREATE OR REPLACE TABLE STAGE.OLYMPICS_ATHELETE_EVENT
(
    FILE_LOAD_ID     INTEGER identity (1,1),
    FILE_NAME         VARCHAR,
    FILE_ROW_NUMBER         INTEGER,
    ID             VARCHAR(100) NOT NULL,
    NAME           VARCHAR(100) NOT NULL,
    SEX            VARCHAR(100) NOT NULL,
    AGE            VARCHAR(100) NOT NULL,
    HEIGHT         VARCHAR(100) NOT NULL,
    WEIGHT         VARCHAR(100) NOT NULL,
    TEAM           VARCHAR(100) NOT NULL,
    NOC            VARCHAR(100) NOT NULL,
    GAMES          VARCHAR(100) NOT NULL,
    YEAR           VARCHAR(100) NOT NULL,
    SEASON         VARCHAR(100) NOT NULL,
    CITY           VARCHAR(100) NOT NULL,
    SPORT          VARCHAR(100) NOT NULL,
    EVENT          VARCHAR(100) NOT NULL,
    MEDAL          VARCHAR(100) NOT NULL,
    DW_CREATE_DATE TIMESTAMPTZ           DEFAULT CURRENT_TIMESTAMP(),
    DW_CREATE_USER VARCHAR      NOT NULL DEFAULT CURRENT_USER(),
    DW_UPDATE_DATE TIMESTAMPTZ           DEFAULT CURRENT_TIMESTAMP(),
    DW_UPDATE_USER VARCHAR      NOT NULL DEFAULT CURRENT_USER()
);

Create a COPY Statement to Load Data from S3

The COPY statement in Snowflake is a powerful method to import and export data from the data warehouse. It will load any new data files and will ignore files previously loaded. Utilizing COPY is a simple and effective means of managing your warehouse.

copy into STAGE.OLYMPICS_ATHELETE_EVENT (FILE_NAME,
                                         FILE_ROW_NUMBER,
                                         ID,
                                         NAME,
                                         SEX,
                                         AGE,
                                         HEIGHT,
                                         WEIGHT,
                                         TEAM,
                                         NOC,
                                         GAMES,
                                         YEAR,
                                         SEASON,
                                         CITY,
                                         SPORT,
                                         EVENT,
                                         MEDAL)
    from (
        select METADATA$FILENAME        file_name,
               METADATA$FILE_ROW_NUMBER row_number,
               t.$1,
               t.$2,
               t.$3,
               t.$4,
               t.$5,
               t.$6,
               t.$7,
               t.$8,
               t.$9,
               t.$10,
               t.$11,
               t.$12,
               t.$13,
               t.$14,
               t.$15
        from @stage.OLYMPICS t
    )
    pattern = '.*athlete_events.csv.gz'
    on_error = CONTINUE
    force = false
    file_format = (field_optionally_enclosed_by = '"'
        type = 'csv'
        compression = GZIP
        field_delimiter = ','
        skip_header = 1);

AWS Glue Job Requirements

The Cloud Formation Script

The foundation of AWS is Cloud Formation. A basic Cloud Formation file contains a parameters section, a role section, and a component section. The component section above contains the Secrets Manager snippet earlier in the Article.

AWSTemplateFormatVersion: "2010-09-09"
Description: >
  This Template Configures a Job to Load Event Data into a Snowflake Table using Glue

Parameters:
  JobName:
    Type: String
    Description: "The Glue Job name used for the Stack and tags in the Snowflake query"

  JobSql:
    Type: String
    Description: "A SQL COPY function to load the data into Snowflake."

  S3DataHome:
    Type: String
    MinLength: "1"
    Description: "The S3 Bucket Containing the Data Lake Data"

  SecretName:
    Type: String
    Description: "The secret containing the Snowflake login information"

Resources:
  SnowflakeGlueJobRole:
    Type: "AWS::IAM::Role"
    Properties:
      AssumeRolePolicyDocument:
        Version: '2012-10-17'
        Statement:
          - Effect: Allow
            Principal:
              Service:
                - glue.amazonaws.com
            Action:
              - sts:AssumeRole
      Policies:
        - PolicyName: root
          PolicyDocument:
            Version: 2012-10-17
            Statement:
              - Effect: Allow
                Action:
                  - "s3:GetObject"
                  - "s3:PutObject"
                  - "s3:ListBucket"
                  - "s3:DeleteObject"
                Resource:
                  - !Sub "arn:aws:s3:::${S3DataHome}"
                  - !Sub "arn:aws:s3:::${S3DataHome}/*"

        - PolicyName: "AllowSecretsManager"
          PolicyDocument:
            Version: "2012-10-17"
            Statement:
              - Effect: "Allow"
                Action: [
                    "secretsmanager:GetSecretValue",
                    "secretsmanager:DescribeSecret"
                ]
                Resource: [
                    !Sub "arn:aws:secretsmanager:${AWS::Region}:${AWS::AccountId}:secret:${SecretName}*"
                ]

      ManagedPolicyArns:
        - arn:aws:iam::aws:policy/service-role/AWSGlueServiceRole
      Path: "/"

  LoadOlympicData:
    Type: AWS::Glue::Job
    Properties:
      Command:
        Name: pythonshell
        PythonVersion: 3
        ScriptLocation: !Sub "s3://${S3DataHome}/src/etl-scripts/copy_to_snowflake.py"
      GlueVersion: 1.0
      DefaultArguments:
        "--job-bookmark-option": "job-bookmark-enable"
        "--job-language": "python"
        "--extra-py-files": !Sub "s3://${S3DataHome}/lib/snowflake_connector_python-2.4.2-cp37-cp37m-manylinux2014_x86_64.whl"
        "--RegionName": !Sub "${AWS::Region}"
        "--SecretName": !Ref SecretName
        "--JobName": !Ref JobName
        "--JobBucket": !Ref S3DataHome
        "--JobSql": !Ref JobSql
      ExecutionProperty:
        MaxConcurrentRuns: 2
      MaxRetries: 0
      Name: snowflake-load-olympic-data
      Role: !Ref SnowflakeGlueJobRole

Create the Snowflake Python Wheel in Docker

Take careful note of the line with the Snowflake connector library. Since Snowflake is not native to AWS, you will need to provide a Wheel file with the compiled binaries of the Snowflake Python library. Use docker or an EC2 instance with the AWS AMI to create the wheel file.

Here is the docker file.

python3.7 -m venv wheel-env
source wheel-env/bin/activate
pip install --upgrade pip
cat "snowflake-connector-python" > requirements.txt
for f in $(cat ../requirements.txt); do pip wheel $f -w ../wheelhouse; done
cd wheelhouse/
INDEXFILE="<html><head><title>Links</title></head><body><h1>Links</h1>"
for f in *.whl; do INDEXFILE+="<a href='$f'>$f</a><br>"; done
INDEXFILE+="</body></html>"
echo "$INDEXFILE" > index.html
cd ..
deactivate
rm -rf cache wheel-env
aws s3 sync wheelhouse s3://${S3DataHome}/lib/

Create the Python Script

The Glue job takes a Python script.

ScriptLocation: !Sub "s3://${S3DataHome}/src/etl-scripts/copy_to_snowflake.py"

The script is fairly generic and takes a SQL file to execute so it can be reused as well.

import sys

import boto3
import json

from botocore.exceptions import ClientError
from awsglue.utils import getResolvedOptions

import snowflake.connector


def get_secret_json(session, secret_name, region_name):

    client = session.client(
        service_name='secretsmanager',
        region_name=region_name
    )

    secret = None
    get_secret_value_response = None

    try:
        get_secret_value_response = client.get_secret_value(
            SecretId=secret_name
        )
    except ClientError as e:
            raise e
    else:
        if 'SecretString' in get_secret_value_response:
            secret = get_secret_value_response['SecretString']

    return json.loads(secret)


def connect(user, password, account, database, warehouse, session_parameters=None):
    """
    Connect to Snowflake
    """
    return snowflake.connector.connect(
        user=user,
        password=password,
        account=account,
        database=database,
        warehouse=warehouse,
        session_parameters=session_parameters
    )


def read_sql(session, bucket, filename):
    s3_client = session.client("s3")
    s3_object = s3_client.get_object(Bucket=bucket, Key=filename)
    return s3_object['Body'].read().decode("utf-8")


def main():
    # Create a Secrets Manager client
    session = boto3.session.Session()

    args = getResolvedOptions(sys.argv, ['JobName', 'RegionName', 'SecretName', 'JobBucket', 'JobSql'])

    json_secret = get_secret_json(session, args["SecretName"], args["RegionName"])

    sql = read_sql(session=session, bucket=args["JobBucket"], filename=args["JobSql"])

    with connect(
            user=json_secret['USERNAME'],
            password=json_secret['PASSWORD'],
            account=json_secret['ACCOUNT'],
            database=json_secret['DB'],
            warehouse=json_secret['WAREHOUSE'],
            session_parameters={
                'QUERY_TAG': args["JobName"]
            }) as con:
        cs = con.cursor()

        result_cs = cs.execute(sql)

        result_row = result_cs.fetchone()

        print(result_row)


if __name__ == "__main__":
    main()

Create the Glue Job with the Cloud Formation Script

Here is the Makefile form for executing the Cloud Formation script to create the Glue Job.

package-job:
   aws cloudformation package \
      --template-file resources/glue/snowflake-glue-load-history.yaml \
             --s3-bucket ${S3_DEPLOYMENT_BUCKET} \
       --output-template-file build/snowflake-glue-load-history.yaml

deploy-job: package-job
   aws cloudformation deploy \
      --template-file build/snowflake-glue-load-history.yaml \
      --parameter-overrides S3DataHome=${S3_DATA_BUCKET} \
         SecretName="${SECRET_NAME}" JobName=${LOAD_OLYMPICS_DATA_JOB} \
         JobSql=${SQL_COPY_OLYMPIC_DATA} \
       --stack-name ${LOAD_OLYMPICS_DATA_JOB} \
       --capabilities CAPABILITY_IAM

Running the Code

Execute the Glue Job (or Schedule it)

You can run and execute the Glue Job in the AWS console.

The Glue process costs $0.44 per DPU-Hour, billed per second, with a 1-minute minimum, so it is going to be very competitive price-wise with even free ETL tools running on an EC2 instance.

Checking the Results

Now the fun part - running the query to check your data.

select TEAM,
       sum(case when MEDAL='Gold' then 1 else 0 end) GOLD_MEDALS,
sum(case when MEDAL='Silver' then 1 else 0 end) SILVER_MEDALS,
sum(case when MEDAL='Bronze' then 1 else 0 end) BRONZE_MEDALS
from stage.OLYMPICS_ATHELETE_EVENT
where medal in ('Gold', 'Silver', 'Bronze')
 and sport = 'Fencing'
    group by team
order by sum(case when MEDAL='Gold' then 1 else 0 end) desc,
         sum(case when MEDAL='Silver' then 1 else 0 end) desc,
         sum(case when MEDAL='Bronze' then 1 else 0 end) desc;

So who is the best in Fencing? Italy, of course.