You have a mountain of data and the unlimited potential of a cloud data architecture to load it, but where should you start? Maybe you are starting with a new feed and want to ensure your data pipeline scales, or you have inherited a beastly legacy system and need to extend it into the cloud. The best place to start with Data Warehousing is the Kimball Model. How can you recognize basic patterns in organizing and refining data? How can you avoid common and expensive pitfalls? This article revisits the principles of the Kimball Data warehouse as a fundamental starting point for modern data systems.
Everyone Looks good at the Starting Line.
That's a line from one of my favorite songs by Paul Thorne. Our data engineer also looks good at the starting line. He works for a company that pulls data from various REST Apis and writes them to a cloud data object store. Then, he flips the pipeline switch and pushes that data into a data warehouse, a data lakehouse, or some other big data processing platform that the business users can leverage to query the resulting data.
Over time, the rest data sets he loads multiply and grow. Now, the customer wants ten data feeds. They want to keep all their old reports, but now they want ad-hoc and new ones. They want to leverage consistent metrics and properties across inconsistent data dictionaries. They want it fast, they want it accurate, and they don't want to have to think about it too much. So our data engineer will look hard for ways to address the customers' needs.
Performance is a challenge, so he tries to leverage the better technologies offered, extra-large data warehouses, Apache Spark, and streaming pipelines. Usability is terrible because the data lacks a consistent data dictionary. The simplistic approach of building a data pipeline in an ad-hoc way fails. Because without changing the fundamentals behind the database, the customers are still unsatisfied with the performance (not to mention the cost) and usability of the analytics capabilities on their data. The Kimball Data warehousing method addresses those needs in a proven methodology that spans many subject domains.
What are the Pitfalls our intrepid data engineer will encounter?
The data sets from the REST Apis are very large and slow to query
Identity keys differ between the data sets, and comparing apples to apples is hard
Querying changes over time on the loaded data is impossible
Constant realignment of the queries turns the Data Engineering department into a SQL Sweatshop
The Kimball Model is a Design Pattern
Design patterns help us frame our current problems in a way that engineers have repeatedly solved. The Kimball Group Reader provides a guide to design patterns in Data Engineering. Here are some fundamental commonalities between the two:
Identify unique and shared properties of the incoming data sets.
Start with a simple model that is optimized toward user understanding.
How do I know if I should use the Kimball Model?
The Kimball Model is all about gathering operational data, refining it, and producing an analytical database to answer business questions about that data. Many organizations and development teams may not have that purpose, so it is essential to understand if you should use the Kimball Model before engaging in a project to build a data warehouse. Many data teams are only producing an intermediate product.
What if I am only gathering data and passing it back to another source for analysis?
You probably are not implementing a Data Warehouse if you have a pass-through system. The purpose of the data warehouse is to gather all the operational data together and transform it to support business decision-making.
Many companies exist to do such, and they can be helpful to sources within your data warehouse. Their role is to cleanse, refine, and enhance the data. Examples of companies that have a successful intermediary product but do not serve as examples as data warehouses are:
Both use cases are legitimate, but if you pass data through to be analyzed elsewhere, you aren't doing data warehousing, and the Kimball Model is not for you.
I Do Need the Kimball Data Model - Where do I Start?
Even a medium-sized enterprise generates and stores mountains of data in its operational systems. A data initiative will have many active data stores and may even have existing data warehousing implementations. It may feel like you are being asked to boil the ocean.
A data warehouse initiative is a bit like boiling the ocean, EXCEPT, you can put it into an unstructured object store and boil it in segments by priority.
The object store services on the cloud essentially provide a virtual ocean to hold unstructured data. Gathering data sources into a secure object store (or data lake) is the first step to building a data warehouse.
What if I already have a Data Warehouse, but it isn't working?
Data Warehousing projects are complex and complicated. You may be coming into a project with an existing data warehouse on an older technology. On the other hand, you may be coming into a data warehousing project that failed.
In a Data Warehouse, all data is treated the same, but not all data is equal.
The end goal of the data warehouse is an analytics dataset that can be trusted for business decision-making. In short: Garbage in, garbage out. That leads to my next blog post: Data Quality.