Search
  • Tim Burns

New Music Through Better Data Analytics


The Author rooting for his non-Seattle Team


At some point in my adult life, embarrassedly early, I might add, I realized I am not cool anymore: I don't listen to cool music, I don't stay out past 10 PM, and the frumpy t-shirts I like to wear accent my belly more than my muscles. For the belly, I like crunching data in front of my computer way more than crunching in the gym, so not going to change that. For 10 PM bedtime, ah, I like my sleep. However, on the cool music front, I figured I could fix that by finding a cool radio station and building an analytics pipeline to do some digging into what's going on in Seattle - the coolest city in the 21st-century world.


That's when I discovered KEXP and it is my soundtrack for all things data. I wanted to write this article during the recent pledge drive, but setting up a data pipeline takes a bit of time. In this article, I want to detail how to build a data analytics pipeline from the top down, so that you can see how to realize a business goal (cool music) by connecting three powerful software platforms. If you're not technical and just want to see the data, skip to the last end of this article, because it will get a bit techy.

  1. Tableau for Data Analytics

  2. Snowflake for Data Warehouse

  3. AWS for Data Lake and Pipeline

AWS

Understanding a data feed starts with documentation, perhaps a data dictionary or an API endpoint. An internet radio station is an amazing data endpoint in and of itself, and KEXP has a brilliant API interface. That interface can be used to download both historical and current data to our basic infrastructural component: AWS.


The S3 Data Lake Bucket

The fundamental data storage for AWS is the S3 Bucket, which is a bit like a Google drive for AWS. This article is designed to be light, but also has the details one could use to build a data pipeline. Note that all described here is public because I'm writing an article, not doing sensitive proprietary work, so be sure to use private resources and internal networks for building similar data pipelines.

  1. Create the Bucket with Cloud Formation

  2. Write a Lambda Function and Layer Backend to Sync Playlist Data

  3. Schedule the pipeline through AWS Step Functions


The Step Function / Lambda Data Pipeline in AWS

This isn't a tutorial as much as an overview, but the end result of building the pipeline is a scheduled job that will synchronize the current KEXP Playlist and KEXP Shows endpoints to an S3 bucket every half hour for analytics.


The Cool AWS Step Functions Interface shows that running the job requests data after "2021-10-11T06:40:46-0700" and before "2021-10-11T06:43:03-0700." This is a surprisingly subtle and important aspect of building a dynamic data pipeline: Localizing the date values.


In this case, I am in New England, timezone -0400, and the playlist data is from Seattle, timezone -0700. The data will need to change hands from AWS to Snowflake, and finally to Tableau.

datetime_format_api = "%Y-%m-%dT%H:%M:%S%z"
datetime_format_lake = "%Y%m%d%H%M%S"

The Data Lake (S3) will track the data using the integer timestamp key, which is useful for getting the most recent playlist set, the oldest playlist, but the API will track the data in the PST timezone. Tableau will display the data to me based on my browser setting, which is EST.


The Snowflake Data Warehouse

Snowflake is the current powerhouse Data Warehouse solution because they built a fantastic platform that uses the cloud beautifully, scales, integrates easily into enterprise settings. Snowflake is enterprise level and it's priced for the enterprise, so even though you pay for only the time you process data, you pay an enterprise rate, not a developer (cheapo!) rate. So if you are using Snowflake as a developer, make sure you set up a Resource Monitor to avoid an unpleasant bill surprise.


The power of Snowflake is the Stage for cloud platforms. Creating a stage allows you to turn a Data Lake into a Data Warehouse. Creating the storage integration manually is documented by Snowflake, and it can be scripted by following these steps.

  1. Create a Makefile to use local environment variables with SnowSQL

  2. Create a role in AWS linking your Snowflake instance to the Data Lake

  3. Create the Storage Integration linking the AWS role with your Snowflake object

  4. Update the role in AWS with the specific credentials of the Storage integration

  5. Create as many stages as you like with the storage integration

The stage is a powerful tool for tracking logs directly on the Data Lake with select

select
       $1:airdate_after_date airdate_after_date,
       $1:airdate_before_date airdate_before_date,
       $1:number_songs::INT NUMBER_SONGS,
       $1:playlist_key::STRING PLAYLIST_KEY,
       $1:shows_key::STRING SHOWS_KEY,
       metadata$filename                filename
from @owlmtn.stage.KEXP_PUBLIC (
         pattern =>'stage/kexp/logs/.*',
         file_format => stage.json_file_format) stg
order by airdate_before_date desc;

The COPY statement is a powerful to bring data into the Data Warehouse.

  1. Copy the Playlist API extract from S3

  2. Copy the Shows API extract from S3


Generally, in an enterprise environment, the Analytics tool (Tableau, Quicksight, Metabase, ...) will access the data warehouse directly and you can explore your data with a direct connection. However; since this is an article for public usage, we export the data.

Additionally, the COPY statement is powerful for generating a data extract to be used

  1. Create a VIEW to join the playlist and shows table together

  2. Copy the VIEW out to the Data Lake as an export

  3. Pull the export down for use (Developer mode only)

The COPY out here is less than ideal, because our consumer is Tableau Public, which only supports plain CSV files. It is useful because I want to provide access to the data through this AWESOME free service!


Visualize the Data in Tableau

Tableau is one of many Data Analytics tools. It stands out because you can create beautiful analytics visualization and it has a public open offering that works well for communicating the power of data analytics.


The public link is KEXP DJ Favorites.


Even with a simple data set of two tables - Playlists and Shows, it is just plain fun to explore and ask questions like


  1. What does my favorite DJ play most often?

  2. What songs were most popular when Trump was being impeached (Arrest the President)

  3. What are some of the great songs of my favorite artist?

See for yourself. The link is public.


Conclusion

Data is powerful and its easier to build data pipelines when you understand the data. I listened to KEXP while building this data pipeline and it made debugging easy, because I could check the important things a pipeline must do.

  1. Do it have the most recent data?

  2. Is my data duplicated?

  3. Do I have gaps in the data?

The entire Bitbucket source is a good starter for someone interesting in seeing how to build out a data pipeline using build scripts and it showcases two powerful tools: CloudFormation and SnowSQL.


If you enjoyed this article, and please check out the radio station KEXP. I think you will find some amazing music.


14 views0 comments

Recent Posts

See All

Calendars and dates drive our lives; a calendar dimension is essential to most data warehouses. Pope Gregory introduced the Gregorian Calendar in XIII in 1582. It took over the world, and we take it