Photo by Meredith Sammon-Burns - Cache la Poudre River, Colorado
Storage Integration needs a clear understanding and remains fundamental to a scalable and secure Snowflake Data Warehouse. A slight misconception about the Storage Integration layer can lead to confusion, wasted effort, and even data security issues. In this blog post, I provide a basic description of Storage Integration and provide AWS Cloud Formation templates to build the connection between an AWS account and a Snowflake account.
Components of a Storage Integration
Storage Integration requires three components.
An AWS S3 bucket to bring in your data
A Storage Integration object in Snowflake to build stages
An IAM role to assign to your Snowflake Storage Integration
The AWS S3 Bucket for your Data
The AWS S3 Bucket should be a bucket dedicated to holding your data and should be private and encrypted. Create the bucket with private access control and block all public access. AWS uses default server-side encryption (SSE-S3) for new buckets. As of January 5, 2023, you can track the encryption status across all regions, including the US GovCloud (See AWS User Guide Encryption FAQ).
Build a new bucket with the default security specifications using CloudFormation using the included template.
AWSTemplateFormatVersion: '2010-09-09'
Transform: AWS::Serverless-2016-10-31
Description: Cloudformation Template for the Staging location
Parameters:
SnowflakeBucket:
Type: String
Resources:
S3Bucket:
Type: AWS::S3::Bucket
Properties:
AccessControl: Private
PublicAccessBlockConfiguration:
BlockPublicAcls: true
BlockPublicPolicy: true
IgnorePublicAcls: true
RestrictPublicBuckets: true
BucketName:
Ref: SnowflakeBucket
Running the Makefile included in the source code will create the bucket on your AWS Account.
The STORAGE_INTEGRATION object in Snowflake
Snowflake needs to access the bucket you have just created. As an official AWS Partner, they have integrated deeply into the AWS infrastructure and can offer a product that operates within the AWS Cloud.
Snowflake in the AWS Cloud
The STORAGE_INTEGRATION object provides the output we will use to create our Bucket Access Role in AWS. This ensures that only a specific storage integration can access only a specific S3 bucket and folder location endpoint.
Create the Storage Integration to get the Default Values
Recreate the Storage Integration with the Role Output from AWS
In the provided example, I use SnowSQL, the default Snowflake SQL client with many powerful features for automating your Snowflake creation procedures.
!set variable_substitution=true;
-- Create the Stage data.
-- For more information see
--
-- https://docs.snowflake.com/en/sql-reference/sql/create-storage-integration.html
--
use role ACCOUNTADMIN;
CREATE or REPLACE STORAGE INTEGRATION &{STORAGE_INTEGRATION_NAME}
TYPE = EXTERNAL_STAGE
STORAGE_PROVIDER = S3
STORAGE_AWS_ROLE_ARN = '&{SNOWFLAKE_INTEGRATION_ROLE}'
ENABLED = TRUE
STORAGE_ALLOWED_LOCATIONS = ('s3://&{SNOWFLAKE_BUCKET}/stage')
COMMENT = 'Storage Integration for Snowflake Stage Data';
grant create stage on schema &{DATABASE}.stage to role &{DATABASE_ADMIN_ROLE};
grant usage on integration &{STORAGE_INTEGRATION_NAME} to role &{DATABASE_ADMIN_ROLE};
desc integration &{STORAGE_INTEGRATION_NAME}; -- Use these values in the CloudFormation Role Template
We need to get the output from the STORAGE_INTEGRATION procedure to create the role, so set the value to a dummy role and make the storage integration.
export SNOWFLAKE_INTEGRATION_ROLE=arn:aws:iam::001234567890:role/myrole
snowsql --connection owlmtn \
-D STORAGE_INTEGRATION_NAME=storage_integration_acme_customer \
-D SNOWFLAKE_BUCKET=customer-acme-data-lake \
-D SNOWFLAKE_INTEGRATION_ROLE=arn:aws:iam::001234567890:role/myrole \
-D DATABASE=owlmtn \
-D DATABASE_ADMIN_ROLE=APPADMIN \
-f src/snowflake/create_storage_integration.sql
SnowSQL will output the data with straightforward formatting.
1 Row(s) produced. Time Elapsed: 0.146s
+---------------------------+---------------+-------------------------------------------------+------------------+
| property | property_type | property_value | property_default |
|---------------------------+---------------+-------------------------------------------------+------------------|
| ENABLED | Boolean | true | false |
| STORAGE_PROVIDER | String | S3 | |
| STORAGE_ALLOWED_LOCATIONS | List | s3://customer-acme-data-lake/stage | [] |
| STORAGE_BLOCKED_LOCATIONS | List | | [] |
| STORAGE_AWS_IAM_USER_ARN | String | arn:aws:iam::222333633027:user/abc-134 | |
| STORAGE_AWS_ROLE_ARN | String | arn:aws:iam::001234567890:role/myrole | |
| STORAGE_AWS_EXTERNAL_ID | String | SBA55555_SFCRole=2_LkuvROxjMplC2o0+XgyhFpIpUuU= | |
| COMMENT | String | Storage Integration for Snowflake Stage Data | |
+---------------------------+---------------+-------------------------------------------------+------------------+
The code snippet below shows how to extract the essential variables to build the role in AWS.
export STORAGE_AWS_IAM_USER_ARN=arn:aws:iam::222333633027:user/abc-134
export STORAGE_AWS_EXTERNAL_ID=SBA55555_SFCRole=2_LkuvROxjMplC2o0+XgyhFpIpUuU=
We will need to create the STORAGE_INTEGRATION again after we make the role, and then we need to get the STORAGE_AWS_EXTERNAL_ID from the SnowSQL output to update the role with the conditional match criteria.
The IAM Role for your Snowflake Account
The IAM Role gives the IAM Principle user access to the bucket and folder location specified. As trust linkage, it is the most critical component of the process of readying to transfer data from your AWS Account S3 Bucket into Snowflake within the AWS Cloud.
The Cloud Formation template below details how we create the role in AWS.
AWSTemplateFormatVersion: '2010-09-09'
Transform: 'AWS::Serverless-2016-10-31'
Description: Role for Snowflake Storage Integration
Parameters:
SnowflakeRoleName:
Type: String
SnowflakeBucketArn:
Type: String
SnowflakePrincipalArn:
Type: String
SnowflakeExternalId:
Type: String
Resources:
SnowflakeStorageIntegration:
Type: "AWS::IAM::Role"
Description: "Stage access for Snowflake"
Properties:
RoleName: !Ref SnowflakeRoleName
AssumeRolePolicyDocument:
Version: "2012-10-17"
Statement:
- Effect: "Allow"
Principal:
"AWS": !Ref SnowflakePrincipalArn
Action: "sts:AssumeRole"
Condition:
StringEquals:
sts:ExternalId: !Ref SnowflakeExternalId
Path: "/"
Policies:
- PolicyName: "AllowStageBucket"
PolicyDocument:
Version: "2012-10-17"
Statement:
- Effect: "Allow"
Action:
- "s3:ListBucket"
- "s3:GetBucketLocation"
- "s3:PutObject"
- "s3:GetObject"
Resource:
- !Ref SnowflakeBucketArn
- !Join [ "/", [ !Ref SnowflakeBucketArn, '*' ] ]
Outputs:
SnowflakeStorageIntegration:
Description: "Snowflake Storage Integration ARN"
Value: !GetAtt SnowflakeStorageIntegration.Arn
The included Makefile will build the role. The environment variables captured from the previous step will guide the role's creation.
aws cloudformation --profile owlmtn deploy \
--template-file build/snowflake_vpc_role.yml \
--stack-name snowflake-storage-integration-role \
--parameter-overrides \
SnowflakeRoleName="role-customer-acme-data-lake" \
SnowflakeBucketArn="arn:aws:s3:::customer-acme-data-lake" \
SnowflakePrincipalArn="arn:aws:iam::000000000000:user/abc" \
SnowflakeExternalId="MyAccount_SFCRole=StorageIntegrationCode=" \
--capabilities CAPABILITY_NAMED_IAM
On the successful run, check the Cloud Formation output to get the created role.
Get the Value Arn from the Cloud Formation Output
Set the SNOWFLAKE_INTEGRATION_ROLE to be the ARN of the output and then rerun the SnowSQL script to build the Storage Integration.
snowsql --connection owlmtn \
-D STORAGE_INTEGRATION_NAME=storage_integration_acme_customer \
-D SNOWFLAKE_BUCKET=customer-acme-data-lake \
-D SNOWFLAKE_INTEGRATION_ROLE=arn:aws:iam::222333633027:role/role-customer-acme-data-lake \
-D DATABASE=owlmtn \
-D DATABASE_ADMIN_ROLE=APPADMIN \
-f src/snowflake/create_storage_integration.sql
Get the value of STORAGE_AWS_EXTERNAL_ID from the output, and then rerun the Cloud Formation template to update the role with the refreshed storage integration. You must refresh the role every time you recreate the storage integration.
aws cloudformation --profile owlmtn deploy \
--template-file build/snowflake_vpc_role.yml \
--stack-name snowflake-storage-integration-role \
--parameter-overrides \
SnowflakeRoleName="role-customer-acme-data-lake" \
SnowflakeBucketArn="arn:aws:s3:::customer-acme-data-lake" \
SnowflakePrincipalArn="arn:aws:iam::000000000000:user/s4du-s-p2ss9090" \
SnowflakeExternalId="MyAccount_SFCRole=2_LkuvROxjMplC2o0+XgyhFpIpUuU=" \
--capabilities CAPABILITY_NAMED_IAM
Creating the Stage from the Storage Integration
The STAGE object will allow you to test your STORAGE_INTEGRATION and make necessary changes. The Makefile provided includes a target that uses SnowSQL to create the STAGE.
snowsql --connection owlmtn \
--rolename APPADMIN \
-D STORAGE_INTEGRATION_NAME=storage_integration_acme_customer \
-D SNOWFLAKE_BUCKET=customer-acme-data-lake \
-D STAGE_SCHEMA=STAGE \
-D STAGE_ENDPOINT=SIMPLE_STAGE \
-f src/snowflake/create_simple_stage.sql
Troubleshooting
use role APPADMIN;
use schema STAGE;
list @SIMPLE_STAGE;
[42601][3073] SQL execution error: Error assuming AWS_ROLE. Please verify the role and externalId are configured correctly in your AWS policy.
When troubleshooting, the most common issue is a mismatch between the SnowflakeExternalId on the STORAGE_INTEGRATION and the value stored in AWS under the Trust Relationships in the IAM Roles section. Get the value from the STORAGE_INTEGRATION and rerun the create-role target in the Makefile to update the role in AWS.
Fixing the issue should give the following result:
list @SIMPLE_STAGE;
+---------------------------------------------------------------+----+--------------------------------+----------------------------+
|name |size|md5 |last_modified |
+---------------------------------------------------------------+----+--------------------------------+----------------------------+
|s3://customer-acme-data-lake/stage/prototype/simple/dummy.jsonl|91 |c5d7a7e5318857ef8d76f5790538887c|Mon, 9 Jan 2023 00:25:29 GMT|
+---------------------------------------------------------------+----+--------------------------------+----------------------------+
First, if there is no data, run the deploy-test-data in the Makefile.
Then, create the table using the create-simple-table target and load the data with a COPY statement.
copy into SIMPLE_TABLE (ID,
NAME,
VALUE,
DW_FILENAME,
DW_FILE_ROW_NUMBER)
from (select $1:ID::integer id
, $1:Name::string name
, $1:Value::string value
, metadata$filename
, metadata$file_row_number
from @SIMPLE_STAGE)
file_format = (type = json);
select * from SIMPLE_TABLE;
+----------+--+-----+-----+------------------------------------+--------------+------------------------------------+--------------+----------------------------------+------------------+---------+
|SIMPLE_KEY|ID|NAME |VALUE|DW_CREATE_DATE |DW_CREATE_USER|DW_UPDATE_DATE |DW_UPDATE_USER|DW_FILENAME |DW_FILE_ROW_NUMBER|DW_ACTIVE|
+----------+--+-----+-----+------------------------------------+--------------+------------------------------------+--------------+----------------------------------+------------------+---------+
|1 |1 |Hello|World|2023-01-08 16:42:31.432000000 -08:00|TIMBURNSDEV |2023-01-08 16:42:31.432000000 -08:00|TIMBURNSDEV |stage/prototype/simple/dummy.jsonl|1 |true |
|2 |2 |Foo |Bar |2023-01-08 16:42:31.432000000 -08:00|TIMBURNSDEV |2023-01-08 16:42:31.432000000 -08:00|TIMBURNSDEV |stage/prototype/simple/dummy.jsonl|2 |true |
+----------+--+-----+-----+------------------------------------+--------------+------------------------------------+--------------+----------------------------------+------------------+---------+
You now have the foundations for a highly productive pipeline and the integration code to automate building out Storage Integrations and Stage endpoints to AWS.
Conclusion
The mechanics of integrating Snowflake and AWS is well documented in the Snowflake Docs (https://docs.snowflake.com/en/sql-reference/sql/create-storage-integration.html). Still, focusing on a single Cloud environment like AWS helps to clarify the crucial details. Furthermore, real-world scenarios are often quite complex, so a clear understanding of the fundamentals is essential.
This short article provides a set of templates to build out a simple example using CloudFormation and SnowSQL. The code is open source and part of my public repository on GitHub (https://github.com/timowlmtn/bigdataplatforms/tree/master/storage_integration).
A new tidbit that is worth reviewing is the tracking of the encryption status for S3 bucket data. Again, a clear understanding of this crucial trust relationship between Snowflake and AWS is essential for compliance, security, and governance.
Opmerkingen