Picking a scheduler is the most fundamental architecture choice one can make. Why? The scheduler drives the entire system and defines how and when data arrives. The scheduler drives monitoring choices and the maintenance lifecycle of a data operation.
In order of preference here are my thoughts on different scheduling styles.
The Database
The Cloud Provider
A Third-Party Data Pipeline Tool
Using the database as a fundamental scheduler is common among those who have run data warehouses for many years. The reason I like using the database as a scheduler most is that it keeps the logic of gathering the data and the endpoint of the data in one place. It is extremely flexible and supports heterogeneous implementations that can be completely independent.
It has the disadvantage that you need to build monitoring on the database scheduler. However, developing that monitoring, say with a tool like New Relic, is an excellent grounding of an SRE team.
My second choice is the cloud provider, and I am thinking of AWS for this. The advantage of using a cloud provider is that the tools for scheduling will be more sophisticated. For example, AWS provides Step Functions, which allow you to coordinate scheduled actions. That is something that a database scheduler cannot do. Additionally, AWS has CloudWatch, which is an excellent monitoring solution.
My third choice is using a SAAS ETL provider. Third-party ETL tools are great when you need to have a whole team manage a complex set of ETL pipelines and services. The scheduling and monitoring tools come out of the box. It is certainly the simplest to have ETL monitoring because the monitoring is built directly into the tool and you don't need to build out a monitoring solution using another vendor.
Popular SAAS ETL Providers that I have used and like:
And of course, there is a fourth choice: Apache Airflow. I've used it and in many ways, it has become the de-facto standard for data engineering teams.
I am thinking long and hard about each of these potential solutions.
Commentaires