Contents

Distributed machine learning with Spark

Contents

Distributed machine learning with Spark#

1. Application: Spam Filtering#

	viagra	learning	the	dating	nigeria	spam?
X1	1	0	1	0	0	y₁ = 1
X2	0	1	1	0	0	Y₂ = -1
X3	0	0	0	0	1	y₃ = 1

Instance spaces X1, X2, X3 belong to set X (data points)
- Binary or real-valued feature vector X of word occurrences
- d features (words and other things, d is approximately 100,000)
Class Y
- Spam = 1
- Ham = -1

2. Linear models for classification#

Vector X_j contains real values
- The Euclidean norm is 1.
- Each vector has a label y_j
The goal is to find a vector W = (w₁, w₂, …, w_d) with w_j is a real number such that:
- The labeled points are clearly separated by a line: .
Dot is spam, minus is ham!

.

3. Linear classifiers#

Each feature i as a weight w_i
Prediction is based on the weighted sum:
If f(x) is:
- Positive: predict +1
- Negative: predict -1

.

4. Support Vector Machine#

Originally developed by Vapnik and collaborators as a linear classifier.
Could be modified to support non-linear classification by mapping into high-dimensional spaces.
Problem statement:
- We want to separate + from - using a line.
- Training examples: .
- Each example i: . .
- Inner product: .
Which is the best linear separate defined by w?

.

5. Support Vector Machine: largest margin#

Distance from the separating line corresponds to the confidence of the prediction.
For example, we are more sure about the class of A and B than of C. .
Margin definition: . .
Maximizing the margin while identifying w is good according to intuition, theory, and practice. .
A math question: how do you narrate this equation?

.

6. Support Vector Machine: what is the margin?#

Slide from the book .
Notation:
- Gamma is the distance from point A to the linear separator L: d(A,L) = |AH|
- If we select a random point M on line L, then d(A,L) is the projection of AM onto vector w.
- Project
- If we assume the normalized Euclidean value of w, |w|, is equal to one, that bring us to the result in the slide.
In other words, maximizing the margin is directly related to how w is chosen.
For the i^th data point:

.

7. Some more math …#

.

After some more mathematical manipulations: .
Everything comes back to an optimization problem on w: . .

8. SVM: Non-linearly separable data#

.

For each data point:
- If margin greater than 1, don’t care.
- If margin is less than 1, pay linear penalty.
Introducing slack variables:

.

.

9. Hands-on: SVM#

Download the set of inaugural speeches from https://www.cs.wcupa.edu/lngo/data/bank.csv
Activate the pyspark conda environment, install pandas, then launch Jupyter notebook

$ conda activate pyspark
$ conda install -y pandas
$ jupyter notebook

Create a new notebook using the pyspark kernel, then change the notebook’s name to spark-7.
Copy the code from spark-6 to setup and launch a Spark application.
Documentation:
Question: Can you predict whether a client will subscribe to a term deposit (feature deposit)?
Problems:
- What data should the bank data be converted to?
- How to handle categorical data?

{% include links.md %}