• Tim Burns

A Checklist for Building a ML Model in Spark

Often the most effective method of understanding data is creating a checklist. As a data practitioner, I use checklists extensively to achieve consistent results when building data projects.

As this is an advanced article, I am assuming that you have started a spark engine, created a SQL context, and have a data frame for the data. The appendix of this article contains the full script.

In my checklist, I like to include a mix of traditional SQL analysis, the powerful Matplotlib tool, and Jupyter Notebook to provide a lightweight and portable analysis environment.

To illustrate how these tools can be used together, I am going to review LogisticRegression from the Spark ML package as described in Spark: The Definitive Guide in chapter 25 "MLlib in Action" starting on page 686.

As a practitioner, I like to apply a set of steps to data in order to understand it.

Check 1: Create A Data Dictionary

Creating a data dictionary is the first step any practitioner should take when understanding new data.

The data dictionary should be simple, concise, and should define the column names, data types, and descriptions of each column.

The Data Dictionary is the simplest and most effective starting point for understanding any data.

Check 2: Create a Context to Access your Data Interactively

This article is on Spark, but Spark is only one of many contexts you can use to access data interactively. The context is essentially your database connection and this article shows how Jupyter Notebooks enables analytical functions that database clients do not have.

from pyspark import SparkContext
from pyspark.sql import SQLContext
sc = SparkContext(appName="Spark Guide")
sql_context = SQLContext(sc)

Sort the data by the metrics to get an idea of how it looks.

simple_df ="{spark_guide}/simple-ml")

Sorting and viewing edges of the data is especially important for very large amounts of data because you want to get an idea of the edges.

Check 3: Use SQL to Summarize the Data

SQL is the lingua franca of data analysis. Any data project that involves multiple technologies will have the SQL in common. The basic SQL statements on the data are as important as the data dictionary.

The size of the data and the counts of distinct values offer many clues into the nature of the data. For each column, get the count, min, and max.

result = sql_context.sql("""
select count(*) total_count,
       count(distinct color),
       count(distinct lab),
       count(distinct value1),
       count(distinct value2),
       from simple_ml

In this case, we have a small data set with 3 different categories, 2 labels, a set of 8 distinct. values and a set of 2 distinct values.

Check 4 Build out the RFormula for Transform and Fit

Transforming the data will create a column we can use to train our data set. To understand the transformation, each of the techniques for analysis is useful.

Transform, Fit, and create a SQL View

Spark only handles transformation data of type Double. We will use the formula to add a column to our data set. Starting with a simple modeling

from import RFormula
supervised = RFormula(formula="lab ~ . + color:value1")
fitted_rf =
prepared_df = fitted_rf.transform(simple_df)

The RFormula notation says that we want to predict the label based on the following 8 rows of data.

The columns are mapped with the lab column excluded and assigned to the label.

lab modeled as all columns lab ~

[1,0] = green

[0,1] = blue

[0,0].= red

Concatenating in the interaction between value1 and color here

lab modeled as lab ~ . + color:value1

lab modeled as lab ~ . + color:value2

Putting them together we have lab modeled as lab ~ . + color:value1 + color:value2

More on the vectorizing used to generate these matrices here:

12 views0 comments

Recent Posts

See All

Downloading CMS Data is a bit tricky. The base site is here: After beating my head against the wall, I discovered that the data key is embedded on the web page.