top of page
Search
  • Writer's pictureTim Burns

How to get Data into Snowflake Securely

Updated: Jan 14, 2023


Photo by Meredith Sammon-Burns - Cache la Poudre River, Colorado

Storage Integration needs a clear understanding and remains fundamental to a scalable and secure Snowflake Data Warehouse. A slight misconception about the Storage Integration layer can lead to confusion, wasted effort, and even data security issues. In this blog post, I provide a basic description of Storage Integration and provide AWS Cloud Formation templates to build the connection between an AWS account and a Snowflake account.

Components of a Storage Integration

Storage Integration requires three components.

  • An AWS S3 bucket to bring in your data

  • A Storage Integration object in Snowflake to build stages

  • An IAM role to assign to your Snowflake Storage Integration

The AWS S3 Bucket for your Data

The AWS S3 Bucket should be a bucket dedicated to holding your data and should be private and encrypted. Create the bucket with private access control and block all public access. AWS uses default server-side encryption (SSE-S3) for new buckets. As of January 5, 2023, you can track the encryption status across all regions, including the US GovCloud (See AWS User Guide Encryption FAQ).

Build a new bucket with the default security specifications using CloudFormation using the included template.

AWSTemplateFormatVersion: '2010-09-09'
Transform: AWS::Serverless-2016-10-31
Description: Cloudformation Template for the Staging location
Parameters:
  SnowflakeBucket:
    Type: String

Resources:
  S3Bucket:
    Type: AWS::S3::Bucket
    Properties:
      AccessControl: Private
      PublicAccessBlockConfiguration:
        BlockPublicAcls: true
        BlockPublicPolicy: true
        IgnorePublicAcls: true
        RestrictPublicBuckets: true
      BucketName:
        Ref: SnowflakeBucket

Running the Makefile included in the source code will create the bucket on your AWS Account.

The STORAGE_INTEGRATION object in Snowflake

Snowflake needs to access the bucket you have just created. As an official AWS Partner, they have integrated deeply into the AWS infrastructure and can offer a product that operates within the AWS Cloud.


Snowflake in the AWS Cloud

The STORAGE_INTEGRATION object provides the output we will use to create our Bucket Access Role in AWS. This ensures that only a specific storage integration can access only a specific S3 bucket and folder location endpoint.

  1. Create the Storage Integration to get the Default Values

  2. Recreate the Storage Integration with the Role Output from AWS

In the provided example, I use SnowSQL, the default Snowflake SQL client with many powerful features for automating your Snowflake creation procedures.

!set variable_substitution=true;
-- Create the Stage data.
-- For more information see
--
--   https://docs.snowflake.com/en/sql-reference/sql/create-storage-integration.html
--
use role ACCOUNTADMIN;

CREATE or REPLACE STORAGE INTEGRATION &{STORAGE_INTEGRATION_NAME}
  TYPE = EXTERNAL_STAGE
  STORAGE_PROVIDER = S3
  STORAGE_AWS_ROLE_ARN = '&{SNOWFLAKE_INTEGRATION_ROLE}'
  ENABLED = TRUE
  STORAGE_ALLOWED_LOCATIONS = ('s3://&{SNOWFLAKE_BUCKET}/stage')
  COMMENT = 'Storage Integration for Snowflake Stage Data';

grant create stage on schema &{DATABASE}.stage to role &{DATABASE_ADMIN_ROLE};

grant usage on integration &{STORAGE_INTEGRATION_NAME} to role &{DATABASE_ADMIN_ROLE};

desc integration &{STORAGE_INTEGRATION_NAME}; -- Use these values in the CloudFormation Role Template

We need to get the output from the STORAGE_INTEGRATION procedure to create the role, so set the value to a dummy role and make the storage integration.

export SNOWFLAKE_INTEGRATION_ROLE=arn:aws:iam::001234567890:role/myrole
snowsql --connection owlmtn \
                        -D STORAGE_INTEGRATION_NAME=storage_integration_acme_customer \
                        -D SNOWFLAKE_BUCKET=customer-acme-data-lake \
                        -D SNOWFLAKE_INTEGRATION_ROLE=arn:aws:iam::001234567890:role/myrole \
                        -D DATABASE=owlmtn \
                        -D DATABASE_ADMIN_ROLE=APPADMIN \
                        -f src/snowflake/create_storage_integration.sql

SnowSQL will output the data with straightforward formatting.

1 Row(s) produced. Time Elapsed: 0.146s
+---------------------------+---------------+-------------------------------------------------+------------------+
| property                  | property_type | property_value                                  | property_default |
|---------------------------+---------------+-------------------------------------------------+------------------|
| ENABLED                   | Boolean       | true                                            | false            |
| STORAGE_PROVIDER          | String        | S3                                              |                  |
| STORAGE_ALLOWED_LOCATIONS | List          | s3://customer-acme-data-lake/stage              | []               |
| STORAGE_BLOCKED_LOCATIONS | List          |                                                 | []               |
| STORAGE_AWS_IAM_USER_ARN  | String        | arn:aws:iam::222333633027:user/abc-134          |                  |
| STORAGE_AWS_ROLE_ARN      | String        | arn:aws:iam::001234567890:role/myrole           |                  |
| STORAGE_AWS_EXTERNAL_ID   | String        | SBA55555_SFCRole=2_LkuvROxjMplC2o0+XgyhFpIpUuU= |                  |
| COMMENT                   | String        | Storage Integration for Snowflake Stage Data    |                  |
+---------------------------+---------------+-------------------------------------------------+------------------+

The code snippet below shows how to extract the essential variables to build the role in AWS.

export STORAGE_AWS_IAM_USER_ARN=arn:aws:iam::222333633027:user/abc-134
export STORAGE_AWS_EXTERNAL_ID=SBA55555_SFCRole=2_LkuvROxjMplC2o0+XgyhFpIpUuU=

We will need to create the STORAGE_INTEGRATION again after we make the role, and then we need to get the STORAGE_AWS_EXTERNAL_ID from the SnowSQL output to update the role with the conditional match criteria.

The IAM Role for your Snowflake Account

The IAM Role gives the IAM Principle user access to the bucket and folder location specified. As trust linkage, it is the most critical component of the process of readying to transfer data from your AWS Account S3 Bucket into Snowflake within the AWS Cloud.

The Cloud Formation template below details how we create the role in AWS.

AWSTemplateFormatVersion: '2010-09-09'
Transform: 'AWS::Serverless-2016-10-31'
Description: Role for Snowflake Storage Integration

Parameters:
  SnowflakeRoleName:
    Type: String

  SnowflakeBucketArn:
    Type: String

  SnowflakePrincipalArn:
    Type: String

  SnowflakeExternalId:
    Type: String

Resources:
  SnowflakeStorageIntegration:
    Type: "AWS::IAM::Role"
    Description: "Stage access for Snowflake"
    Properties:
      RoleName: !Ref SnowflakeRoleName
      AssumeRolePolicyDocument:
        Version: "2012-10-17"
        Statement:
          - Effect: "Allow"
            Principal:
              "AWS": !Ref SnowflakePrincipalArn
            Action: "sts:AssumeRole"
            Condition:
              StringEquals:
                sts:ExternalId: !Ref SnowflakeExternalId
      Path: "/"
      Policies:
        - PolicyName: "AllowStageBucket"
          PolicyDocument:
            Version: "2012-10-17"
            Statement:
              - Effect: "Allow"
                Action:
                  - "s3:ListBucket"
                  - "s3:GetBucketLocation"
                  - "s3:PutObject"
                  - "s3:GetObject"

                Resource:
                  - !Ref SnowflakeBucketArn
                  - !Join [ "/", [ !Ref SnowflakeBucketArn, '*' ] ]
Outputs:
    SnowflakeStorageIntegration:
      Description: "Snowflake Storage Integration ARN"
      Value: !GetAtt SnowflakeStorageIntegration.Arn

The included Makefile will build the role. The environment variables captured from the previous step will guide the role's creation.

aws cloudformation --profile owlmtn deploy \
                --template-file build/snowflake_vpc_role.yml \
            --stack-name snowflake-storage-integration-role \
            --parameter-overrides \
                SnowflakeRoleName="role-customer-acme-data-lake" \
                SnowflakeBucketArn="arn:aws:s3:::customer-acme-data-lake" \
                SnowflakePrincipalArn="arn:aws:iam::000000000000:user/abc" \
                SnowflakeExternalId="MyAccount_SFCRole=StorageIntegrationCode=" \
            --capabilities CAPABILITY_NAMED_IAM

On the successful run, check the Cloud Formation output to get the created role.


Get the Value Arn from the Cloud Formation Output

Set the SNOWFLAKE_INTEGRATION_ROLE to be the ARN of the output and then rerun the SnowSQL script to build the Storage Integration.

snowsql --connection owlmtn \
                        -D STORAGE_INTEGRATION_NAME=storage_integration_acme_customer \
                        -D SNOWFLAKE_BUCKET=customer-acme-data-lake \
                        -D SNOWFLAKE_INTEGRATION_ROLE=arn:aws:iam::222333633027:role/role-customer-acme-data-lake \
                        -D DATABASE=owlmtn \
                        -D DATABASE_ADMIN_ROLE=APPADMIN \
                        -f src/snowflake/create_storage_integration.sql

Get the value of STORAGE_AWS_EXTERNAL_ID from the output, and then rerun the Cloud Formation template to update the role with the refreshed storage integration. You must refresh the role every time you recreate the storage integration.

aws cloudformation --profile owlmtn deploy \
                --template-file build/snowflake_vpc_role.yml \
            --stack-name snowflake-storage-integration-role \
            --parameter-overrides \
                SnowflakeRoleName="role-customer-acme-data-lake" \
                SnowflakeBucketArn="arn:aws:s3:::customer-acme-data-lake" \
                SnowflakePrincipalArn="arn:aws:iam::000000000000:user/s4du-s-p2ss9090" \
                SnowflakeExternalId="MyAccount_SFCRole=2_LkuvROxjMplC2o0+XgyhFpIpUuU=" \
            --capabilities CAPABILITY_NAMED_IAM

Creating the Stage from the Storage Integration

The STAGE object will allow you to test your STORAGE_INTEGRATION and make necessary changes. The Makefile provided includes a target that uses SnowSQL to create the STAGE.

snowsql --connection owlmtn \
                        --rolename APPADMIN \
                        -D STORAGE_INTEGRATION_NAME=storage_integration_acme_customer \
                        -D SNOWFLAKE_BUCKET=customer-acme-data-lake \
                        -D STAGE_SCHEMA=STAGE \
                        -D STAGE_ENDPOINT=SIMPLE_STAGE \
                        -f src/snowflake/create_simple_stage.sql

Troubleshooting

use role APPADMIN;
use schema STAGE;
list @SIMPLE_STAGE;

[42601][3073] SQL execution error: Error assuming AWS_ROLE. Please verify the role and externalId are configured correctly in your AWS policy.

When troubleshooting, the most common issue is a mismatch between the SnowflakeExternalId on the STORAGE_INTEGRATION and the value stored in AWS under the Trust Relationships in the IAM Roles section. Get the value from the STORAGE_INTEGRATION and rerun the create-role target in the Makefile to update the role in AWS.

Fixing the issue should give the following result:

list @SIMPLE_STAGE;

+---------------------------------------------------------------+----+--------------------------------+----------------------------+
|name                                                           |size|md5                             |last_modified               |
+---------------------------------------------------------------+----+--------------------------------+----------------------------+
|s3://customer-acme-data-lake/stage/prototype/simple/dummy.jsonl|91  |c5d7a7e5318857ef8d76f5790538887c|Mon, 9 Jan 2023 00:25:29 GMT|
+---------------------------------------------------------------+----+--------------------------------+----------------------------+

First, if there is no data, run the deploy-test-data in the Makefile.

Then, create the table using the create-simple-table target and load the data with a COPY statement.

copy into SIMPLE_TABLE (ID,
                        NAME,
                        VALUE,
                        DW_FILENAME,
                        DW_FILE_ROW_NUMBER)
    from (select $1:ID::integer    id
               , $1:Name::string name
               , $1:Value::string value
               , metadata$filename
               , metadata$file_row_number
          from @SIMPLE_STAGE)
    file_format = (type = json);

select * from SIMPLE_TABLE;
+----------+--+-----+-----+------------------------------------+--------------+------------------------------------+--------------+----------------------------------+------------------+---------+
|SIMPLE_KEY|ID|NAME |VALUE|DW_CREATE_DATE                      |DW_CREATE_USER|DW_UPDATE_DATE                      |DW_UPDATE_USER|DW_FILENAME                       |DW_FILE_ROW_NUMBER|DW_ACTIVE|
+----------+--+-----+-----+------------------------------------+--------------+------------------------------------+--------------+----------------------------------+------------------+---------+
|1         |1 |Hello|World|2023-01-08 16:42:31.432000000 -08:00|TIMBURNSDEV   |2023-01-08 16:42:31.432000000 -08:00|TIMBURNSDEV   |stage/prototype/simple/dummy.jsonl|1                 |true     |
|2         |2 |Foo  |Bar  |2023-01-08 16:42:31.432000000 -08:00|TIMBURNSDEV   |2023-01-08 16:42:31.432000000 -08:00|TIMBURNSDEV   |stage/prototype/simple/dummy.jsonl|2                 |true     |
+----------+--+-----+-----+------------------------------------+--------------+------------------------------------+--------------+----------------------------------+------------------+---------+

You now have the foundations for a highly productive pipeline and the integration code to automate building out Storage Integrations and Stage endpoints to AWS.

Conclusion

The mechanics of integrating Snowflake and AWS is well documented in the Snowflake Docs (https://docs.snowflake.com/en/sql-reference/sql/create-storage-integration.html). Still, focusing on a single Cloud environment like AWS helps to clarify the crucial details. Furthermore, real-world scenarios are often quite complex, so a clear understanding of the fundamentals is essential.

This short article provides a set of templates to build out a simple example using CloudFormation and SnowSQL. The code is open source and part of my public repository on GitHub (https://github.com/timowlmtn/bigdataplatforms/tree/master/storage_integration).

A new tidbit that is worth reviewing is the tracking of the encryption status for S3 bucket data. Again, a clear understanding of this crucial trust relationship between Snowflake and AWS is essential for compliance, security, and governance.

37 views0 comments

Recent Posts

See All

Opmerkingen


bottom of page