Once we've gathered the data we need into a data lake, we need to assess the quality and build pipelines to get all data to a consistent form with high quality.
How do I measure the quality of existing data?
Once the data is all in one place - the Data Lake (a secured object store) - it is time to start assessing data for quality. An excellent place to start is a table containing each data source with the quality measures for that data source.
Here is the example from the KEXP data set. Keep a table containing the data quality assessment in your data lake and the data source, and maintain it as though you would support the data itself.
If you're following along, create a table like this using the primary data source from your data warehouse and fill in the gaps. Here I will show an example from the KEXP data set along with queries to check to measure the data quality of an incoming data set.
Cross Checking against the External Source is an excellent way to validate incoming data. The data in your system should match the data on the external system. It could also entail validating against the operational data sets. The summary data from the working data store is another good external reference that you can use to check the accuracy of your data.
Always gather summary data from the operational data store.
It's always helpful to have a separate summary of counts, max, and min for the data points on an external data set to use when building accuracy tests.
For my example, the KEXP data set is the product of a Django Web API and is easy to validate independently by going to the radio station web page and following the playlist along with the JSON API output.
Counts, samples, and validation rules are all excellent means of measuring data completeness. Once it is copied into a STAGE schema, yIn addition, you can validate that your operational data is complete by querying the staged data using standard SQL commands.
Review the data for statistics like COUNT, MIN, and MAX.
select PROGRAM_NAME, count(*) counts, min(START_TIME)::date min, max(START_TIME)::date max
group by PROGRAM_NAME
order by PROGRAM_NAME;
Use these statistics to measure completeness at every stage of the warehousing transformations.
|PROGRAM_NAME|COUNTS|MIN |MAX |
|90.TEEN |4285 |2020-03-21|2023-02-25|
|Audioasis |6113 |2020-03-21|2023-02-25|
Consistency validation is about checking that the data is consistent across datasets.
Remember, in a Data Warehouse, we are bringing multiple operational datasets together for analysis, and consistency is key to comparing apples to apples during analysis.
Again, we don't need to boil the ocean for the initial data load but rather note what aspects of a dataset need consistency compared to other data sets.
The most obvious element we must make consistent is timezone. KEXP is in the Pacific timezone, so noting that at the dimensional level is critical to maintaining consistency in the data warehouse.
Timeliness will track whether our data is up to date and let us track whether or not the data is stale. We may want to set up alerts on our data warehouse if the data is stale. Timeliness is the operational aspect of the data warehouse, and we should understand several things about our data to assess timeliness.
Is the data batch or streaming?
When does our data become stale?
What is the most current data our analytics processes need?
If our analytics processes only run monthly or weekly, we may be wasting computing time (money) to keep the data up to date daily. Operations generally create data in real time, but it is a mistake to build a data warehouse using operational data assumptions.
It is important to understand timeliness from the perspective of when that data will be used by the BI systems to answer business questions about the data.