top of page
Search
  • Writer's pictureTim Burns

Value, Delivery, Security, Repeatability, and Simplicity

Updated: Apr 24

Improving Data Architecture with AWS, Snowflake, and Open Source Tools


I'm trying that on as a title for a book I'm writing. One of the things I like about Medium is that it allows me to see the interest of my readers and overwhelmingly, most of them are fellow data engineers and architects. Originally, I was thinking of more of a theory book, but I am realizing the incredible need for recipes to build big data applications in Snowflake.


Because of my experience of over twenty years of building data architecture, I can see the larger picture, and I also have gathered quite a few opinions


For example:

  • Build Value First - It's easy to be myopic about technology and blind to whether or not what you're building has value

  • Delivery - Pick a data platform appropriate to your needs

  • Snowflake is for very large TB+ data sets and the data lake

  • Postgres or a relational system will be appropriate for many applications

  • Security - Security should come first because it is very difficult to build security into applications once they are in production

  • Repeatability - It might be called "DevOps" or "CI/CD", but the point is that your build must be repeatable. You need to be able to stand up infrastructure in a repeatable way.

  • Simplicity - Less code is always better. By picking quality open-source tools to realize your data architecture and a powerful platform like Snowflake to build on, you can focus on building your application as simply as possible. Just because you can write code doesn't mean you should.

We'll see about that and where it goes.


I've learned about my audience based on my Medium articles that they are technical and interested in actual recipes to build data architectures. This is great because my forte is putting together crafted, simple recipes to build out important components on a data platform. Now it leaves me to pick the tools to use. They are:

  • Snowflake - The Data Platform

  • Terraform - Infrastructure Components

  • Airbyte - Data Integration Tool

  • dbt - Data Pipeline Transformations

  • Kubernetes - Runtime and scaling

  • Airflow - Orchestration

These are all tools with solid open-source and community support. They've been proven repeatedly and are considered state-of-the-art today.


For the platform, I am choosing Snowflake over Databricks because though Databricks is extremely innovative, I feel that Snowflake offers governance and maturity that Databricks doesn't have. The cost comparison between the two (as in my previous article) is a red herring. Cost is driven by poor governance, and no technology will save you from bad strategy, governance, and execution. Focus on value and the cost will follow (the point of this article).


Notes

Steps

  1. Provision Snowflake Pipeline using Terraform

  2. Create Snowpipe using dbt

  3. Parameterize new data adds

  4. Track lineage consistently

  5. Load Historical Data

  6. Setup SQS Data Triggers on AWS

  7. Monitor Data Loading (Observability)

Decentralized Governance

  1. Creating Dashboards to Monitor Data Quality

  2. People who deal with data every day should be the ones creating the rules

  3. Many domains go into the result


As the DW gets smaller, the incoming data needs to get bigger, and the outgoing analytics capabilities need to get bigger. (Like a bow tie)


References

  1. B. Menuet et al. (2022) Build your data pipeline in your AWS modern data platform using AWS Lake Formation, AWS Glue, and dbt Core

  2. X. Huang, A. Huck (2022) Best Practices for Data Ingestion with Snowflake: Part 1

  3. X. Huang (2022) Best Practices for Data Ingestion with Snowflake: Part 2

  4. Yoitsbenc (2021) Managing Snowpipe Integrations, Stages and Pipes

  5. Walton, C (2020) Auto-ingest Snowpipe on S3

  6. Data Engineering with Apache Airflow, Snowflake & dbt

22 views0 comments
bottom of page