Building a Succesful Data Initiative on AWS
Updated: Oct 18, 2020
Successful data initiatives use a simple architecture that will scale. Most big data projects fail. Why do data initiatives fail?
Build it and hope for the best - Start with clear business goals
Using platforms, you don't understand to solve problems; you understand even less - Pick common technologies that many engineers understand.
Architecting with "Shiny Object" big data technologies - Choose a simple architecture with proven technologies
What can you do to build large scale data on AWS that will succeed?
* Separate execution and business flow using data pipelines
* Drive ingestion with proven event systems like SNS and SQS
* Only use "Big Data" systems for problems that need massive data throughput
Start with Clear Business Goals
Data initiatives often start with massive amounts of data that are scattered through silos and systems.
If your business goal is to make an aspect of that data useful for a decision-maker, the value of that data must be comprehended by the decision-maker - a human. A common data point is that short term memory can hold seven plus or minus two items.
So the challenge from the big data perspective is to break out millions, billions, even trillions of rows of data - numbers bigger than the human mind can even comprehend into seven-plus-or-minus-two-items actionable data points.
You may have many such data points but knowing what they are - that is your domain analysis.
Once the analysis is done, you are ready to focus on the technical objectives.
Pick common technologies that many engineers understand.
The Services list on AWS is alphabet soup. The desires of many a Big Data VP can often be an equivalent alphabet soup. Successful projects begin with solid, simple technologies and clear separation of concern so that if a simple technology needs a complicated replacement, it can be replaced.
What do you need to Ingest Data into a Data Warehouse?
A data staging repository
An event infrastructure
A data pipeline technology
One or more databases
What do you need to turn data into actional data points?
A ubiquitous programming language - Python
Reporting systems that aggregate and deliver data
Self-service BI systems that allow business users to slice and dice aggregations
Data science software to analyze data
A Proven, Cost-Effective Data Stack
The data stack is essential, and it will vary according to existing technologies and platforms. The budget on the stack should reflect the nature of the components.
Spend money on
Why should you spend money on these two technologies? A good database will provide a SQL interface to your data, and your team will be able to leverage their existing SQL knowledge to get value from the data. A good BI environment will open up the data to business users to self-serve when accessing the data.
Save money on
Event technologies - Fanout SNS to SQS on SQS
Data Science and Analytics - Pandas
Why AWS Step functions and Lambda functions over proven data integration tools? Because it will push into a scalable architecture that won't lock your team into the bad practices associated with tools like Pentaho, Informatica, Talend, IBM, etc. Those tools are based on expensive servers, vendor lock-in, proprietary integration interfaces.
Modern integration generally uses REST, and where it doesn't, the integration points are well-documented and well-published.
Google Cloud - https://cloud.google.com/python/docs/reference
Salesforce - https://pypi.org/project/simple-salesforce/
Essentially, the data connectors for any of the data integration platforms are public, free, and often of better quality than those provided by the data integration provider.
Choose a simple architecture with proven technologies.
AWS has several core technologies that should form the foundation of your data initiative architecture
Data Staging Repository - https://aws.amazon.com/s3/
The Data Staging Repository is more than a simple folder structure and requires some careful consideration. The store is private and backs the SQL Database and Data Warehouse. It can also be thought of as the "Data Lake."
Events post from many sources. External applications can write to event repositories and API to external applications. Internally, AWS Cloudwatch Events can run on a scheduled basis or even trigger from SQS posts.
Using the Simple Notification Service allows a single event point to serve one or many event queues.
When building the application, CloudFormation templates allow you to schedule event rules for scheduled events.
ScheduledEventRule: Type: "AWS::Events::Rule" Properties: Description: "Scheduled event to trigger Step Functions state machine" ScheduleExpression: cron(30 10 ? * MON-SUN *) State: "ENABLED" Targets: - Arn: !Ref BusinessAnalyticsPipeline Id: !GetAtt BusinessAnalyticsPipeline.Name RoleArn: !GetAtt ScheduledEventIAMRole.Arn
The data pipeline technology essentially maps the loading process into a mapped list of parallel pipelines. Replacing traditional data integration tools requires a state machine that describes the data flow and allows you to separate flow from processing.
Data scalability should build parallelism into the pipeline, and the mapped loads should utilize the natural serverless scalability by running multiple pipelines in parallel when loading the data.
Snowflake can be expensive, but it has the advantage that it separates the computing model from the data model to scale to multiple TB/hr data loading throughput.
Snowflake will stage from S3 sources with AWS and has equivalent stage stores in Azure or Google Cloud.
It will run massively distributed queries and is column-oriented, and will back any Tableau or analytics queries with standard SQL.
Make your Data Valuable
Valuable data needs
Performance - you need it fast
Quality - you need to cleanse data before publishing
Documentation - you need a data dictionary
At every step of making data valuable, the user need comes first. The domain model should capture the user needs.
Performance: Understand how the Domain Model maps to the Physical Data Storage
The event data will be enormous and will not be clean. Step one to cleaning is
Understand the distribution keys of the data
The distribution keys will be the natural keys for large parallel actions on the data. If your domain model centralizes a person, it is an excellent choice for your distribution key. How do you recognize this key? It will be in nearly every join you do in SQL.
In Snowflake, this will be your clustering key: https://docs.snowflake.com/en/user-guide/tables-clustering-keys.html.
Snowflake will cluster your tables automatically: https://docs.snowflake.com/en/user-guide/tables-auto-reclustering.html.
Quality: Cleansing the data before Publishing
Develop pipelines in Step Functions that are dedicated to cleansing the data.
Write SQL Tests for
Filling in missing data
Removing stale and bad data
Cleanse Data Content
Validate address data using open standards like Google Geocoding
Cleanse data with Python tools you understand
Build Self-Service Data Access for Customers
When organizations are data-hungry, the consumers of the data need self-service access. The domain model in the early stages of your data project is crucial when building a self-service data access endpoint.
Turn your Domain Model into a Data Dictionary
The domain model is the guide to building a proper Data Dictionary. I recommend utilizing JSON object definitions in a simple React component to produce data dictionaries for the data we provide.
Publish the Models in a Self-Service BI Tool
Many BI tools exist to allow users to self-serve on your data. I generally use Tableau because it is a de-facto standard for BI tools. It is immensely flexible and powerful, even if it is overpriced.
If the price is too much, the React Google Charts API has many good visualization components embedded directly in a React application.