What is this TODO Sticky that Says #deltasharing?
I enjoyed writing the Snowflake articles. However, the computing time to write the queries cost me $25. In addition, the queries on my billing statement only gathered user, role, and connection metadata, so it's easy to see how Snowflake's "Pay as you Go" model can get very expensive, very fast. That's why I put a big yellow #deltasharing sticky on my monitor for when I woke up and decided to learn something new.
Delta Sharing is driven by Databricks, the commercial end of the Apache Spark project. It's open-source, so I can leverage the processing power of my own MacBook to do any experimentation.
Some articles from my fellow Median authors:
The Delta Lake concept is a marriage of the Data Lake (file-based database on the cloud) and the structured approach of the traditional Data Warehouse.
I have a healthy level of skepticism on any of the technologies as a cure-all for the difficulties of managing data. Data, more than other aspects of Software Engineer like UI, Middleware, etc., needs clarity of vision. I've worked on great data projects that had a single, controlling leader that enforced a simple, clean data model to know that, yes, the "Guru-does-all" approach to a data warehouse achieves great results. But, I've also seen what happens when "Agile Team" turns a mess of junior data engineers loose with ELT and random stories.
Powerful ELT and modern data warehousing technologies give data engineers more rope to turn into tangled data messes. The limiting factor on a successful data initiative is, and will always be, well-governed and consistent implementations led with a single competent data plan. Data has a long lifespan, and organizing, transforming, and building analytical data summaries needs a clear vision.
I digress - Delta Lake is cool because it has the database qualities of a Data Warehouse and the extensibility and cost-effectiveness of a data lake. However, I suspect it will also have the pitfalls of any large data project requiring clear data vision.
Oh yeah, and on Lakehouse: A New Generation of Open Platforms that Unify Data Warehousing and Advanced Analytics. It's an interesting paper combining Marketing Optimism (the Databricks authors) and Academic Theory. I liked what they had to say about establishing a metadata layer.
In addition, metadata layers are a natural place to implement data quality enforcement features. For example, Delta Lake implements schema enforcement to ensure that the data uploaded to a table matches its schema, and constraints API  that allows table owners to set constraints on the ingested data (e.g., country can only be one of a list of values). Delta’s client libraries will automatically reject records that violate these expectations or quarantine them in a special location. Customers have found these simple features very useful to improve the quality of data lake based pipelines.
At any reporting layer, preventing bad data from entering the Database layer (the SQL analytics portion) is much more effective than trying to detect and fix bad data once it is in the Database layer. May the Good Lord help the organizations that attempt to address bad data by adding layer and layer of transformation and complex regressions on every layer to "watch" bad data. Once you are in the vicious cycle of detecting bad data at the end of a data pipeline, you have so many layers of data and you might as well throw out the entire data stack, evaluate your analytical goals, and rebuild it from scratch.
Finally, metadata layers are a natural place to implement governance features such as access control and audit logging. For example, a metadata layer can check whether a client is allowed to access a table before granting it credentials to read the raw data in the table from a cloud object store, and can reliably log all accesses.
A metadata layer (EG data catalog) is a good place to add governance. That's what databases are, essentially, points of contact where we can assign roles, groups, view masks, and allocate users to access data securely. That isn't the only place where we need to add governance. To give users comprehensive access, we need to give them controlled access to the cloud provider's object store with the provider infrastructure, not a Data Lakehouse infrastructure. This is not so much a Data Lakehouse activity as a roles and permissions activity on the object-store in question. The pure vision of a Data Lakehouse as a controlled metadata layer on top of a cloud object store is a fantasy.
This diagram from CIDR '21 is a Fantasy
A more realistic vision of a modern data analytics environment might have a Lakehouse but will also have raw data a-la a Data Lake.
Messy Diagram showing Heterogeneous Environment
The diagram is messy because the governance and metadata are complex. We are going to have a lot of data that we don't want or need for analytics until a Data Science or Engineer actor brings it into a database (a connection that can run SQL). The Governance Admin is then responsible for saying. where they can write, how can write, the structure of the data. I don't see how it can't be boiled into a single technology like Spark or Lakehouse any more than I see how one database like relational is adequate for all modern data use cases.