Search
  • Tim Burns

Building an ML Model - The PySpark Pipeline


I shouldn't be surprised that building an ML Model is a topic requiring several blog posts and significant study.


For example, let's look deeper into building a fit model.


Here I have a model of Cape Cod birds I notice when walking.


#%%

data = ["heron", "osprey", "plover", "heron", "plover", "osprey"]
json_data = []
for idx in range(0, len(data)):
    json_data.append({"id": idx, "category": data[idx]})

cape_cod_birds_df = sql_context.createDataFrame(json_data)

cape_cod_birds_df.show()

+--------+---+
|category| id|
+--------+---+
|   heron|  0|
|  osprey|  1|
|  plover|  2|
|   heron|  3|
|  plover|  4|
|  osprey|  5|
+--------+---+

Building an indexer from the string data converts it into double precision variables for numerical processing.


#%%

from pyspark.ml.feature import StringIndexer
category_indexer = \
    StringIndexer(inputCol="category", outputCol="category_index")

fitted_model = category_indexer.fit(cape_cod_birds_df)
bird_df_model = fitted_model.transform(cape_cod_birds_df)

bird_df_model.show()

+--------+---+--------------+
|category| id|category_index|
+--------+---+--------------+
|   heron|  0|           0.0|
|  osprey|  1|           1.0|
|  plover|  2|           2.0|
|   heron|  3|           0.0|
|  plover|  4|           2.0|
|  osprey|  5|           1.0|
+--------+---+--------------+

So heron=0.0, osprey=1.0, and plover=2.0.


Adding a column as follows has this result on the model.


#%%
habitat = ["marshes", "lakes and ponds", "shorelines", "marshes", "shorelines", "lakes and ponds"]

json_data = []
for idx in range(0, len(data)):
    json_data.append({"id": idx, "category": data[idx], "habitat": habitat[idx]})

birds_df = sql_context.createDataFrame(json_data)
birds_df.show()

+--------+---------------+---+
|category|        habitat| id|
+--------+---------------+---+
|   heron|        marshes|  0|
|  osprey|lakes and ponds|  1|
|  plover|     shorelines|  2|
|   heron|        marshes|  3|
|  plover|     shorelines|  4|
|  osprey|lakes and ponds|  5|
+--------+---------------+---+

Remove the ID column to get a data frame that only contains interesting data.


#%%
features = birds_df.columns
features.remove('id')
birds_feature_df = birds_df.select(features)

birds_feature_df.show()

+--------+---------------+
|category|        habitat|
+--------+---------------+
|   heron|        marshes|
|  osprey|lakes and ponds|
|  plover|     shorelines|
|   heron|        marshes|
|  plover|     shorelines|
|  osprey|lakes and ponds|
+--------+---------------+

The pipeline comes in handy when we want to create an index for both of these features. The pipeline will take each StringIndexer we created and apply it in order when we supply the indexer as part of an array with the "stages" parameter. Here we add category_index first, and then habitat_index and get the indexed values.


#%%
from pyspark.ml import Pipeline

feature_indexer = []
for column in birds_feature_df.columns:
    feature_indexer.append(StringIndexer(inputCol=f"{column}", outputCol=f"{column}_index"))

pipeline = Pipeline(stages=feature_indexer)

bird_df = pipeline.fit(birds_feature_df).transform(birds_feature_df)

bird_df.show()

+--------+---------------+--------------+-------------+
|category|        habitat|category_index|habitat_index|
+--------+---------------+--------------+-------------+
|   heron|        marshes|           0.0|          1.0|
|  osprey|lakes and ponds|           1.0|          0.0|
|  plover|     shorelines|           2.0|          2.0|
|   heron|        marshes|           0.0|          1.0|
|  plover|     shorelines|           2.0|          2.0|
|  osprey|lakes and ponds|           1.0|          0.0|
+--------+---------------+--------------+-------------+

My next post will continue building this model and adding complexity. I generally find that as I explore data, something fascinating will emerge, and I have doubt that then the model is complete, the birding data will yield some unexpected result.



9 views0 comments

Recent Posts

See All