Our world is filled with layers of information - waiting to be comprehended (Author)
I have been researching using Machine Learning to build domain models, quality controls, and natural language analysis around data pipelines. The deeper I delve into ML, the more convinced I am that it will radically change how we build data pipelines, even more than the recent transition from on-prem solutions to cloud-based solutions.
Business data needs to grow faster than many data engineering teams can keep up. Data engineers have many tools: Snowflake, AWS, Terraform, and dbt. However, orchestrating meaning and action in the data pipeline remains a persistent problem. Natural language processing engines like OpenAI offer an automated mechanism to connect components logically without human intervention. As a result, analysts and engineers can supervise the process of turning data into value, make connections extending beyond siloed domains, and ultimately build better data products.
Quote from C. Samiulla's reference on testing and monitoring.
In this way testing & monitoring are like battle armor. Too little and you are vulnerable. Too much, and you can barely move.
D. Scully et al. (2015) Hidden Technical Debt in Machine Learning Systems
E. Samuylova (2020) Machine Learning in Production: Why You Should Care About Data and Concept Drift
C. Samiulla (2020) Monitoring Machine Learning Models in Production
D. Sato et al (2019) Continuous Delivery for Machine Learning
E. Breck et al (2017) The ML Test Score: A Rubric for ML Production Readiness and Technical Debt Reduction
S. Amershi et al (2019) Software Engineering for Machine Learning: A Case Study
D. Le et al (2020) Baselines
B. Nushi (2021) Responsible Machine Learning with Error Analysis
B. Gao et al (2017) Deep Label Distribution Learning with Label Ambiguity
A. Ng (2023) Building a data pipeline
B. Mathes et al (2021) ML Metadata: Version Control for ML
C. Wiley (2020) Key requirements for an MLOps foundation