The Author at Rhode Island Fencing Club
As a fencer am getting excited about the Olympics This Summer. As a data scientist, I am intrigued by the person-level data available on Olympians.
Import the Data with AWS Glue
AWS Glue is a managed ETL service. It integrates well into databases and machine learning applications.
Create Tables with Data Crawlers
Data Crawlers take the tedious work out of writing "CREATE TABLE blah blah ..." to create tables. Instead of bothering with opening a file and finding out what's inside, use a data crawler to create the table definition.
Start by copying the data from the above Kaggle location into your favorite S3 bucket.
aws s3 sync data ${S3_DATA_URL}
It will copy both athlete_events.csv.zip and noc_regions.csv. Never both with unzipping or decompressing files. Modern ETL services will handle that for you more efficiently.
Write a Cloud Formation template containing your favorite bucket as the main parameter and use resources to set the permissions, create the database, set a classifier to parse the CSV format, and define a crawler to create tables automatically.
AWSTemplateFormatVersion: "2010-09-09"
Description: >
This Template Configures the Data Lake for the Olympics Athlete Event Data
Parameters:
S3DataHome:
Type: String
MinLength: "1"
Description: "The S3 Bucket Containing the Data Lake Data"
Resources:
AWSGlueJobRole:
Type: "AWS::IAM::Role"
Properties:
AssumeRolePolicyDocument:
Version: '2012-10-17'
Statement:
- Effect: Allow
Principal:
Service:
- glue.amazonaws.com
Action:
- sts:AssumeRole
Policies:
- PolicyName: root
PolicyDocument:
Version: 2012-10-17
Statement:
- Effect: Allow
Action:
- "s3:GetObject"
- "s3:PutObject"
- "s3:ListBucket"
- "s3:DeleteObject"
Resource:
- !Sub "arn:aws:s3:::${S3DataHome}"
- !Sub "arn:aws:s3:::${S3DataHome}/*"
ManagedPolicyArns:
- arn:aws:iam::aws:policy/service-role/AWSGlueServiceRole
Path: "/"
OlympicsGlueDatabase:
Type: AWS::Glue::Database
Properties:
CatalogId: !Ref AWS::AccountId
DatabaseInput:
Name: "olympics-data"
Description: "Source: https://www.kaggle.com/heesoo37/120-years-of-olympic-history-athletes-and-results"
LocationUri: "s3://my-data/stage/olympics/kaggle/"
SimpleCsvClassifier:
Type: AWS::Glue::Classifier
Properties:
CsvClassifier:
Name: "olympics-data-csv-classifier"
ContainsHeader: PRESENT
Delimiter: ','
SchemaCreationCrawler:
Type: AWS::Glue::Crawler
Properties:
Name: "olympics-data-create-schema"
Role: !Ref AWSGlueJobRole
DatabaseName: !Ref OlympicsGlueDatabase
Classifiers:
- !Ref SimpleCsvClassifier
Targets:
S3Targets:
- Path: "s3://my-data/stage/olympics"
SchemaChangePolicy:
UpdateBehavior: "UPDATE_IN_DATABASE"
DeleteBehavior: "LOG"
Run the cloud formation template to create the stack.
aws cloudformation package \
--template-file resources/glue/olympics-glue-datalake.yaml \
--s3-bucket ${S3_DEPLOYMENT_BUCKET} \
--output-template-file build/packaged-olympics-glue-datalake.yaml
aws cloudformation deploy \
--template-file build/packaged-olympics-glue-datalake.yaml \
--parameter-overrides S3DataHome=${S3_DATA_BUCKET} \
--stack-name ${DATA_LAKE_NAME} \
--capabilities CAPABILITY_IAM
When stack creation completes, run the crawler and you will have your table definitions based on the CSV files.
Each table will have the appropriate columns. Click on the tables link to view.
Integrating Glue with Snowflake
To integrate Glue with Snowflake, follow the AWS guide for creating a Wheelhouse library for Snowflake.
Here is the docker script for the Snowflake Library.
Create Wheel File for Snowflake in Glue
python3.7 -m venv wheel-env
source wheel-env/bin/activate
pip install --upgrade pip
cat "snowflake-connector-python" > requirements.txt
for f in $(cat ../requirements.txt); do pip wheel $f -w ../wheelhouse; done
cd wheelhouse/
INDEXFILE="<html><head><title>Links</title></head><body><h1>Links</h1>"
for f in *.whl; do INDEXFILE+="<a href='$f'>$f</a><br>"; done
INDEXFILE+="</body></html>"
echo "$INDEXFILE" > index.html
cd ..
deactivate
rm -rf cache wheel-env
aws s3 sync wheelhouse ${S3_LIB_URL}
Use the IAM user in the ~/.aws/credentials file in the docker instance to ensure the service user has proper access to the wheel files.
Comments