top of page
Search
  • Writer's pictureTim Burns

A Checklist for Building a ML Model in Spark




Often the most effective method of understanding data is creating a checklist. As a data practitioner, I use checklists extensively to achieve consistent results when building data projects.


As this is an advanced article, I am assuming that you have started a spark engine, created a SQL context, and have a data frame for the data. The appendix of this article contains the full script.


In my checklist, I like to include a mix of traditional SQL analysis, the powerful Matplotlib tool, and Jupyter Notebook to provide a lightweight and portable analysis environment.


To illustrate how these tools can be used together, I am going to review LogisticRegression from the Spark ML package as described in Spark: The Definitive Guide in chapter 25 "MLlib in Action" starting on page 686.


As a practitioner, I like to apply a set of steps to data in order to understand it.


Check 1: Create A Data Dictionary

Creating a data dictionary is the first step any practitioner should take when understanding new data.


The data dictionary should be simple, concise, and should define the column names, data types, and descriptions of each column.

The Data Dictionary is the simplest and most effective starting point for understanding any data.


Check 2: Create a Context to Access your Data Interactively

This article is on Spark, but Spark is only one of many contexts you can use to access data interactively. The context is essentially your database connection and this article shows how Jupyter Notebooks enables analytical functions that database clients do not have.

#%%
from pyspark import SparkContext
from pyspark.sql import SQLContext
#%%
sc = SparkContext(appName="Spark Guide")
sql_context = SQLContext(sc)

Sort the data by the metrics to get an idea of how it looks.

#%%
simple_df = sql_context.read.json(f"{spark_guide}/simple-ml")
simple_df.orderBy("value1").show()

Sorting and viewing edges of the data is especially important for very large amounts of data because you want to get an idea of the edges.


Check 3: Use SQL to Summarize the Data

SQL is the lingua franca of data analysis. Any data project that involves multiple technologies will have the SQL in common. The basic SQL statements on the data are as important as the data dictionary.


The size of the data and the counts of distinct values offer many clues into the nature of the data. For each column, get the count, min, and max.


#%%
result = sql_context.sql("""
select count(*) total_count,
       count(distinct color),
       count(distinct lab),
       count(distinct value1),
       min(value1),
       max(value1),
       count(distinct value2),
       min(value2),
       max(value2)
       from simple_ml
""");
result.show()

In this case, we have a small data set with 3 different categories, 2 labels, a set of 8 distinct. values and a set of 2 distinct values.


Check 4 Build out the RFormula for Transform and Fit

Transforming the data will create a column we can use to train our data set. To understand the transformation, each of the techniques for analysis is useful.


Transform, Fit, and create a SQL View

Spark only handles transformation data of type Double. We will use the formula to add a column to our data set. Starting with a simple modeling

from pyspark.ml.feature import RFormula
supervised = RFormula(formula="lab ~ . + color:value1")
fitted_rf = supervised.fit(simple_df)
prepared_df = fitted_rf.transform(simple_df)
prepared_df.createOrReplaceTempView("prepared_df");

The RFormula notation says that we want to predict the label based on the following 8 rows of data.



The columns are mapped with the lab column excluded and assigned to the label.


lab modeled as all columns lab ~


[1,0] = green

[0,1] = blue

[0,0].= red

Concatenating in the interaction between value1 and color here


lab modeled as lab ~ . + color:value1

lab modeled as lab ~ . + color:value2



Putting them together we have lab modeled as lab ~ . + color:value1 + color:value2



More on the vectorizing used to generate these matrices here:


12 views0 comments

Recent Posts

See All

Carto, Snowflake, and Data Management

A basic principle of data management: Don't move data unless you have to. Moving data is expensive and error-prone. Data Egress Cost: How To Take Back Control And Reduce Egress Charges Archiving to S

bottom of page