Thoughts on Surrogate Keys in Spark, Snowflake, and Databricks

As a database developer, I take surrogate keys for granted. The surrogate key is the backbone of a data architecture.

In the Kimball style of a data warehouse, every table has as its first column the primary key. These columns provide unique references for each row so that we can create relational analytical structures as we process from the Normal forms (1NF, 2NF, 3NF) to analytical structures. These keys are used for deduplication, tracking lineage, data cleaning, and much more. Data developers are used to having these keys and use them to maintain the database.

<table_name>_key INT PRIMARY KEY AUTOINCREMENT START 1 INCREMENT 1 comment 'A surrogate key for the table row.',

Unfortunately, this is not so straightforward in Spark because of distributed processing.

The long and the short of it is that row_number() is inappropriate for use as a surrogate key and will generate an unnecessary global operation, and monotonically_increasing_id() is suitable for many cases but will not guarantee an increase by only one on every key value.

Another possibility to handle this issue is using Snowflake as an intermediate step for crucial data processing steps. Of course, Snowflake has the drawback that it is insanely expensive, but if you avoid performing the analytics and transformations in Snowflake, you can avoid the high costs. If you can afford it, it's in a class by itself as a data warehousing platform.

Additionally, Databricks provides a solution in Unity Catalog (not part of the open-source Apache Spark).

Data Modeling Best Practices & Implementation on a Modern Lakehouse

Essentially, a solid surrogate critical infrastructure is one of the features you pay for in a commercial product.