Tim Burns

# A Middle Manager Learns Spark

Photo by Author

As I dig into machine learning, I find my mathematical and numerical background makes me a perfect *candidate* as an expert. However, my time in middle management has left me sadly bereft of real Data Science skills.

Time to do what I've always done when in need of some knowledge: Go to Oreilly!

__https://www.oreilly.com/library/view/spark-the-definitive/9781491912201/____https://github.com/databricks/Spark-The-Definitive-Guide.git__

It turns out the most important thing to do is establish a distance metric between data rows. I can trust my fellow Medium authors to write clear articles on specific tasks I want to accomplish.

__Feature Engineering in Pyspark__by Dhiraj Rai

Curiously, following the guide above led me to an old favorite math book by one of the best teachers I ever had, __Roger Horn__. He's famous enough to have a Wiki page, and I took his Matrix Analysis class back when I was a math bum.

The Wear on the Text Block shows this book was Well-Loved

To Quote:

What may be said about the "size" of matrices, which may be thought of as vectors in a higher-dimensional space? What about vectors in infinite-dimensional spaces? What about complex vectors? Are there useful ways to measure the "size" of real vectors other than by Euclidean length?

And so begins a crucial topic in machine learning: the distance between records in a database (vectors).

Spark Itself has three common norms for measuring vector distance under the Normalizer class.

Normalizer().setP(1)

Normalizer().setP(2)

Normalizer().setP(float("inf"))

So what are the normalizers?

Thinking in terms of column comparisons, they are a measure of how close two vectors are together.

Norm 1 is the Manhatten Norm and is the sum of the absolute value of the difference for the measure of each column.

Norm 2 is the Euclidean Norm and is the sum of the square root value of the difference for the measure of each column squared.

Norm-inf is the Infinity norm and is the maximum difference of all the column differences.