Distributed machine learning with Spark
Contents
Distributed machine learning with Spark#
1. Application: Spam Filtering#
viagra |
learning |
the |
dating |
nigeria |
spam? |
|
---|---|---|---|---|---|---|
X1 |
1 |
0 |
1 |
0 |
0 |
y1 = 1 |
X2 |
0 |
1 |
1 |
0 |
0 |
Y2 = -1 |
X3 |
0 |
0 |
0 |
0 |
1 |
y3 = 1 |
Instance spaces X1, X2, X3 belong to set X (data points)
Binary or real-valued feature vector X of word occurrences
d
features (words and other things, d is approximately 100,000)
Class Y
Spam = 1
Ham = -1
2. Linear models for classification#

Vector Xj contains real values
The Euclidean norm is
1
.Each vector has a label yj
The goal is to find a vector W = (w1, w2, …, wd) with wj is a real number such that:
The labeled points are clearly separated by a line:
.
Dot is spam, minus is ham!
.
3. Linear classifiers#
Each feature
i
as a weight wiPrediction is based on the weighted sum:
If f(x) is:
Positive: predict +1
Negative: predict -1
.
4. Support Vector Machine#
Originally developed by Vapnik and collaborators as a linear classifier.
Could be modified to support non-linear classification by mapping into high-dimensional spaces.
Problem statement:
We want to separate
+
from-
using a line.Training examples:
.
Each example
i
:.
.
Inner product:
.
Which is the best linear separate defined by w?
.
5. Support Vector Machine: largest margin#
Distance from the separating line corresponds to the confidence of the prediction.
For example, we are more sure about the class of
A
andB
than ofC
..
Margin definition:
.
.
Maximizing the margin while identifying
w
is good according to intuition, theory, and practice..
A math question: how do you narrate this equation?
.
6. Support Vector Machine: what is the margin?#
Slide from the book
.
Notation:
Gamma
is the distance from point A to the linear separator L:d(A,L) = |AH|
If we select a random point M on line L, then d(A,L) is the projection of AM onto vector
w
.If we assume the normalized Euclidean value of
w
,|w|
, is equal to one, that bring us to the result in the slide.
In other words, maximizing the margin is directly related to how
w
is chosen.For the ith data point:
.
7. Some more math …#
.
After some more mathematical manipulations:
.
Everything comes back to an optimization problem on
w
:.
.
8. SVM: Non-linearly separable data#
.
For each data point:
If margin greater than 1, don’t care.
If margin is less than 1, pay linear penalty.
Introducing slack variables:
.
.
9. Hands-on: SVM#
Download the set of inaugural speeches from https://www.cs.wcupa.edu/lngo/data/bank.csv
Activate the
pyspark
conda environment, installpandas
, then launch Jupyter notebook
$ conda activate pyspark
$ conda install -y pandas
$ jupyter notebook
Create a new notebook using the
pyspark
kernel, then change the notebook’s name tospark-7
.Copy the code from
spark-6
to setup and launch a Spark application.Documentation:
Question: Can you predict whether a client will subscribe to a term deposit (feature
deposit
)?Problems:
What data should the bank data be converted to?
How to handle categorical data?
{% include links.md %}