The Author rooting for his non-Seattle Team
At some point in my adult life, embarrassedly early, I might add, I realized I am not cool anymore: I don't listen to cool music, I don't stay out past 10 PM, and the frumpy t-shirts I like to wear accent my belly more than my muscles. For the belly, I like crunching data in front of my computer way more than crunching in the gym, so not going to change that. For 10 PM bedtime, ah, I like my sleep. However, on the cool music front, I figured I could fix that by finding a cool radio station and building an analytics pipeline to do some digging into what's going on in Seattle - the coolest city in the 21st-century world.
That's when I discovered KEXP and it is my soundtrack for all things data. I wanted to write this article during the recent pledge drive, but setting up a data pipeline takes a bit of time. In this article, I want to detail how to build a data analytics pipeline from the top down, so that you can see how to realize a business goal (cool music) by connecting three powerful software platforms. If you're not technical and just want to see the data, skip to the last end of this article, because it will get a bit techy.
Tableau for Data Analytics
Snowflake for Data Warehouse
AWS for Data Lake and Pipeline
AWS
Understanding a data feed starts with documentation, perhaps a data dictionary or an API endpoint. An internet radio station is an amazing data endpoint in and of itself, and KEXP has a brilliant API interface. That interface can be used to download both historical and current data to our basic infrastructural component: AWS.
The S3 Data Lake Bucket
The fundamental data storage for AWS is the S3 Bucket, which is a bit like a Google drive for AWS. This article is designed to be light, but also has the details one could use to build a data pipeline. Note that all described here is public because I'm writing an article, not doing sensitive proprietary work, so be sure to use private resources and internal networks for building similar data pipelines.
Create the Bucket with Cloud Formation
Write a Lambda Function and Layer Backend to Sync Playlist Data
Schedule the pipeline through AWS Step Functions
The Step Function / Lambda Data Pipeline in AWS
This isn't a tutorial as much as an overview, but the end result of building the pipeline is a scheduled job that will synchronize the current KEXP Playlist and KEXP Shows endpoints to an S3 bucket every half hour for analytics.
The Cool AWS Step Functions Interface shows that running the job requests data after "2021-10-11T06:40:46-0700" and before "2021-10-11T06:43:03-0700." This is a surprisingly subtle and important aspect of building a dynamic data pipeline: Localizing the date values.
In this case, I am in New England, timezone -0400, and the playlist data is from Seattle, timezone -0700. The data will need to change hands from AWS to Snowflake, and finally to Tableau.
datetime_format_api = "%Y-%m-%dT%H:%M:%S%z"
datetime_format_lake = "%Y%m%d%H%M%S"
The Data Lake (S3) will track the data using the integer timestamp key, which is useful for getting the most recent playlist set, the oldest playlist, but the API will track the data in the PST timezone. Tableau will display the data to me based on my browser setting, which is EST.
The Snowflake Data Warehouse
Snowflake is the current powerhouse Data Warehouse solution because they built a fantastic platform that uses the cloud beautifully, scales, integrates easily into enterprise settings. Snowflake is enterprise level and it's priced for the enterprise, so even though you pay for only the time you process data, you pay an enterprise rate, not a developer (cheapo!) rate. So if you are using Snowflake as a developer, make sure you set up a Resource Monitor to avoid an unpleasant bill surprise.
The power of Snowflake is the Stage for cloud platforms. Creating a stage allows you to turn a Data Lake into a Data Warehouse. Creating the storage integration manually is documented by Snowflake, and it can be scripted by following these steps.
Create a Makefile to use local environment variables with SnowSQL
Create a role in AWS linking your Snowflake instance to the Data Lake
Create the Storage Integration linking the AWS role with your Snowflake object
Update the role in AWS with the specific credentials of the Storage integration
Create as many stages as you like with the storage integration
The stage is a powerful tool for tracking logs directly on the Data Lake with select
select
$1:airdate_after_date airdate_after_date,
$1:airdate_before_date airdate_before_date,
$1:number_songs::INT NUMBER_SONGS,
$1:playlist_key::STRING PLAYLIST_KEY,
$1:shows_key::STRING SHOWS_KEY,
metadata$filename filename
from @owlmtn.stage.KEXP_PUBLIC (
pattern =>'stage/kexp/logs/.*',
file_format => stage.json_file_format) stg
order by airdate_before_date desc;
The COPY statement is a powerful to bring data into the Data Warehouse.
Copy the Playlist API extract from S3
Copy the Shows API extract from S3
Generally, in an enterprise environment, the Analytics tool (Tableau, Quicksight, Metabase, ...) will access the data warehouse directly and you can explore your data with a direct connection. However; since this is an article for public usage, we export the data.
Additionally, the COPY statement is powerful for generating a data extract to be used
Create a VIEW to join the playlist and shows table together
Copy the VIEW out to the Data Lake as an export
Pull the export down for use (Developer mode only)
The COPY out here is less than ideal, because our consumer is Tableau Public, which only supports plain CSV files. It is useful because I want to provide access to the data through this AWESOME free service!
Visualize the Data in Tableau
Tableau is one of many Data Analytics tools. It stands out because you can create beautiful analytics visualization and it has a public open offering that works well for communicating the power of data analytics.
The public link is KEXP DJ Favorites.
Even with a simple data set of two tables - Playlists and Shows, it is just plain fun to explore and ask questions like
What does my favorite DJ play most often?
What songs were most popular when Trump was being impeached (Arrest the President)
What are some of the great songs of my favorite artist?
See for yourself. The link is public.
Conclusion
Data is powerful and its easier to build data pipelines when you understand the data. I listened to KEXP while building this data pipeline and it made debugging easy, because I could check the important things a pipeline must do.
Do it have the most recent data?
Is my data duplicated?
Do I have gaps in the data?
The entire Bitbucket source is a good starter for someone interesting in seeing how to build out a data pipeline using build scripts and it showcases two powerful tools: CloudFormation and SnowSQL.
If you enjoyed this article, and please check out the radio station KEXP. I think you will find some amazing music.
Comments