Distributed machine learning with Spark#

1. Application: Spam Filtering#

viagra

learning

the

dating

nigeria

spam?

X1

1

0

1

0

0

y1 = 1

X2

0

1

1

0

0

Y2 = -1

X3

0

0

0

0

1

y3 = 1

  • Instance spaces X1, X2, X3 belong to set X (data points)

    • Binary or real-valued feature vector X of word occurrences

    • d features (words and other things, d is approximately 100,000)

  • Class Y

    • Spam = 1

    • Ham = -1

2. Linear models for classification#

  • Vector Xj contains real values

  • The goal is to find a vector W = (w1, w2, …, wd) with wj is a real number such that:

    • The labeled points are clearly separated by a line: .

  • Dot is spam, minus is ham!

.

3. Linear classifiers#

  • Each feature i as a weight wi

  • Prediction is based on the weighted sum:

  • If f(x) is:

    • Positive: predict +1

    • Negative: predict -1

.

4. Support Vector Machine#

  • Originally developed by Vapnik and collaborators as a linear classifier.

  • Could be modified to support non-linear classification by mapping into high-dimensional spaces.

  • Problem statement:

    • We want to separate + from - using a line.

    • Training examples: .

    • Each example i: . .

    • Inner product: .

  • Which is the best linear separate defined by w?

.

5. Support Vector Machine: largest margin#

  • Distance from the separating line corresponds to the confidence of the prediction.

  • For example, we are more sure about the class of A and B than of C. .

  • Margin definition: . .

  • Maximizing the margin while identifying w is good according to intuition, theory, and practice. .

  • A math question: how do you narrate this equation?

.

6. Support Vector Machine: what is the margin?#

  • Slide from the book .

  • Notation:

    • Gamma is the distance from point A to the linear separator L: d(A,L) = |AH|

    • If we select a random point M on line L, then d(A,L) is the projection of AM onto vector w.

    • Project

    • If we assume the normalized Euclidean value of w, |w|, is equal to one, that bring us to the result in the slide.

  • In other words, maximizing the margin is directly related to how w is chosen.

  • For the ith data point:

    .

7. Some more math …#

.

  • After some more mathematical manipulations: .

  • Everything comes back to an optimization problem on w: . .

8. SVM: Non-linearly separable data#

.

  • For each data point:

    • If margin greater than 1, don’t care.

    • If margin is less than 1, pay linear penalty.

  • Introducing slack variables:

.

.

9. Hands-on: SVM#

$ conda activate pyspark
$ conda install -y pandas
$ jupyter notebook
  • Create a new notebook using the pyspark kernel, then change the notebook’s name to spark-7.

  • Copy the code from spark-6 to setup and launch a Spark application.

  • Documentation:

  • Question: Can you predict whether a client will subscribe to a term deposit (feature deposit)?

  • Problems:

    • What data should the bank data be converted to?

    • How to handle categorical data?

{% include links.md %}