Data Warehousing, the Three-Tier Data Pattern, and Normal Forms (2NF, 3NF)
The Data Warehouse (Lakehouse)
So we want to store all our data in a data warehouse to make business decisions based on data. Every data warehouse has some form of the Three-Tier Pattern. The names may or may not be consistent, but usually, there are three major levels: The Landing Zone, The Processing Layer, and the Data Warehouse.
The Landing Zone
Data in the Landing Zone Layer is ingested from external sources. They might be other databases, CSV files, JSON files, etc. It contains a complete representation of all the data for every source of our data warehouse and is stored in its native format. In this layer, we check for missing or inconsistent data. Databricks calls this layer the "Bronze" layer in their medallion architecture.
The Processing Layer
The Processing Layer is where the data is processed and organized. Generally, the result of the processing layer is transforming the landing zone data into the Second or Third Normal Form (2NF, 3NF). The normal forms are layers where operational data is normalized into structures that facilitate query performance.
The 2NF and 3NF in the Processing Layer
The 2NF is a normalized form where all the data is split into parts representing a specific business concept, for example, retailer, vendor, customer, or product. The 3NF is a further refinement that normalizes data so that an insert or an update writes data into only one place.
Design Principle: The 3NF Normalizes data so that an insert or update writes only into one place.
Examples of Processing Layer Transformations:
Normalize data across sources
Pivot data to a single primary key attribute
The Data Warehouse Layer
The Data Warehouse Layer is for Analytics and Reporting. The Data Warehouse Layer aims to reduce the processing layer's complexity to a Star Schema to optimize the data for querying.
A common mistake is confusing the 3NF and the Data Warehouse. Normalization will introduce a tremendous amount of unnecessary complexity, and the purpose of the Data Warehouse is for the end user that builds reports for analysis. The Star Schema reduces the complexity of the 3NF and presents a simplified and unified schema for analysis.
A litmus test on whether you are trying to use the 3NF as a Data Warehouse is that your tables have too many unnecessary or empty details.