Search
  • Tim Burns

Olympics Person-Level Data

Updated: Apr 3



The Author at Rhode Island Fencing Club


As a fencer am getting excited about the Olympics This Summer. As a data scientist, I am intrigued by the person-level data available on Olympians.


https://www.kaggle.com/heesoo37/120-years-of-olympic-history-athletes-and-results


Import the Data with AWS Glue

AWS Glue is a managed ETL service. It integrates well into databases and machine learning applications.


Create Tables with Data Crawlers

Data Crawlers take the tedious work out of writing "CREATE TABLE blah blah ..." to create tables. Instead of bothering with opening a file and finding out what's inside, use a data crawler to create the table definition.


Start by copying the data from the above Kaggle location into your favorite S3 bucket.


aws s3 sync data ${S3_DATA_URL}

It will copy both athlete_events.csv.zip and noc_regions.csv. Never both with unzipping or decompressing files. Modern ETL services will handle that for you more efficiently.


Write a Cloud Formation template containing your favorite bucket as the main parameter and use resources to set the permissions, create the database, set a classifier to parse the CSV format, and define a crawler to create tables automatically.


AWSTemplateFormatVersion: "2010-09-09"
Description: >
  This Template Configures the Data Lake for the Olympics Athlete Event Data

Parameters:
  S3DataHome:
    Type: String
    MinLength: "1"
    Description: "The S3 Bucket Containing the Data Lake Data"

Resources:
  AWSGlueJobRole:
    Type: "AWS::IAM::Role"
    Properties:
      AssumeRolePolicyDocument:
        Version: '2012-10-17'
        Statement:
          - Effect: Allow
            Principal:
              Service:
                - glue.amazonaws.com
            Action:
              - sts:AssumeRole
      Policies:
        - PolicyName: root
          PolicyDocument:
            Version: 2012-10-17
            Statement:
              - Effect: Allow
                Action:
                  - "s3:GetObject"
                  - "s3:PutObject"
                  - "s3:ListBucket"
                  - "s3:DeleteObject"
                Resource:
                  - !Sub "arn:aws:s3:::${S3DataHome}"
                  - !Sub "arn:aws:s3:::${S3DataHome}/*"

      ManagedPolicyArns:
        - arn:aws:iam::aws:policy/service-role/AWSGlueServiceRole
      Path: "/"

  OlympicsGlueDatabase:
    Type: AWS::Glue::Database
    Properties:
      CatalogId: !Ref AWS::AccountId
      DatabaseInput:
        Name: "olympics-data"
        Description: "Source: https://www.kaggle.com/heesoo37/120-years-of-olympic-history-athletes-and-results"
        LocationUri: "s3://my-data/stage/olympics/kaggle/"

  SimpleCsvClassifier:
    Type: AWS::Glue::Classifier
    Properties:
      CsvClassifier:
        Name: "olympics-data-csv-classifier"
        ContainsHeader: PRESENT
        Delimiter: ','


  SchemaCreationCrawler:
    Type: AWS::Glue::Crawler
    Properties:
      Name: "olympics-data-create-schema"
      Role: !Ref AWSGlueJobRole
      DatabaseName: !Ref OlympicsGlueDatabase
      Classifiers:
        - !Ref SimpleCsvClassifier
      Targets:
        S3Targets:
          - Path: "s3://my-data/stage/olympics"
      SchemaChangePolicy:
        UpdateBehavior: "UPDATE_IN_DATABASE"
        DeleteBehavior: "LOG"

Run the cloud formation template to create the stack.


aws cloudformation package \
   --template-file resources/glue/olympics-glue-datalake.yaml \
          --s3-bucket ${S3_DEPLOYMENT_BUCKET} \
    --output-template-file build/packaged-olympics-glue-datalake.yaml
    
aws cloudformation deploy \
   --template-file build/packaged-olympics-glue-datalake.yaml \
   --parameter-overrides S3DataHome=${S3_DATA_BUCKET} \
    --stack-name ${DATA_LAKE_NAME} \
    --capabilities CAPABILITY_IAM            

When stack creation completes, run the crawler and you will have your table definitions based on the CSV files.

Each table will have the appropriate columns. Click on the tables link to view.


Integrating Glue with Snowflake

To integrate Glue with Snowflake, follow the AWS guide for creating a Wheelhouse library for Snowflake.


https://aws.amazon.com/blogs/big-data/building-python-modules-from-a-wheel-for-spark-etl-workloads-using-aws-glue-2-0/


Here is the docker script for the Snowflake Library.


Create Wheel File for Snowflake in Glue


python3.7 -m venv wheel-env
source wheel-env/bin/activate
pip install --upgrade pip
cat "snowflake-connector-python" > requirements.txt
for f in $(cat ../requirements.txt); do pip wheel $f -w ../wheelhouse; done
cd wheelhouse/
INDEXFILE="<html><head><title>Links</title></head><body><h1>Links</h1>"
for f in *.whl; do INDEXFILE+="<a href='$f'>$f</a><br>"; done
INDEXFILE+="</body></html>"
echo "$INDEXFILE" > index.html
cd ..
deactivate
rm -rf cache wheel-env
aws s3 sync wheelhouse ${S3_LIB_URL}

Use the IAM user in the ~/.aws/credentials file in the docker instance to ensure the service user has proper access to the wheel files.



24 views0 comments

Recent Posts

See All