Distributed Machine Learning with Spark
Last updated on 2024-07-26 | Edit this page
Estimated time 60 minutes
Overview
Questions
- How does Linux come to be?
Objectives
- Explain the historical development of Linux
Distributed machine learning with Spark
1. Application: Spam Filtering
viagra | learning | the | dating | nigeria | spam? | |
---|---|---|---|---|---|---|
X1 | 1 | 0 | 1 | 0 | 0 | y1 = 1 |
X2 | 0 | 1 | 1 | 0 | 0 | Y2 = -1 |
X3 | 0 | 0 | 0 | 0 | 1 | y3 = 1 |
- Instance spaces X1, X2, X3 belong to set X (data points)
- Binary or real-valued feature vector X of word occurrences
-
d
features (words and other things, d is approximately 100,000)
- Class Y
- Spam = 1
- Ham = -1
2. Linear models for classification

- Vector Xj contains real values
- The Euclidean
norm is
1
. - Each vector has a label yj
- The Euclidean
norm is
- The goal is to find a vector W = (w1, w2, …,
wd) with wj is a real number such that:
- The labeled points are clearly separated by a line:
.
- The labeled points are clearly separated by a line:
- Dot is spam, minus is ham!
.
3. Linear classifiers
- Each feature
i
as a weight wi - Prediction is based on the weighted sum:
- If f(x) is:
- Positive: predict +1
- Negative: predict -1
.
4. Support Vector Machine
- Originally developed by Vapnik and collaborators as a linear classifier.
- Could be modified to support non-linear classification by mapping into high-dimensional spaces.
- Problem statement:
- We want to separate
+
from-
using a line. - Training examples:
.
- Each example
i
:.
.
- Inner product:
.
- We want to separate
- Which is the best linear separate defined by w?
.
5. Support Vector Machine: largest margin
- Distance from the separating line corresponds to the confidence of the prediction.
- For example, we are more sure about the class of
A
andB
than ofC
..
- Margin definition:
.
.
- Maximizing the margin while identifying
w
is good according to intuition, theory, and practice..
- A math question: how do you narrate this equation?
.
6. Support Vector Machine: what is the margin?
Slide from the book
.
-
Notation:
-
Gamma
is the distance from point A to the linear separator L:d(A,L) = |AH|
- If we select a random point M on line L, then d(A,L) is the
projection of AM onto vector
w
. - Project
- If we assume the normalized Euclidean value of
w
,|w|
, is equal to one, that bring us to the result in the slide.
-
In other words, maximizing the margin is directly related to how
w
is chosen.-
For the ith data point:
.
7. Some more math …
. - After some
more mathematical manipulations:
. - Everything
comes back to an optimization problem on
w
:
.
.
8. SVM: Non-linearly separable data
. - For each
data point: - If margin greater than 1, don’t care. - If margin is less
than 1, pay linear penalty. - Introducing slack variables:
.
.
9. Hands-on: SVM
- Download the set of inaugural speeches from https://www.cs.wcupa.edu/lngo/data/bank.csv
- Activate the
pyspark
conda environment, installpandas
, then launch Jupyter notebook
$ conda activate pyspark
$ conda install -y pandas
$ jupyter notebook
- Create a new notebook using the
pyspark
kernel, then change the notebook’s name tospark-7
. - Copy the code from
spark-6
to setup and launch a Spark application. - Documentation:
- Question: Can you predict whether a client will subscribe to a term
deposit (feature
deposit
)? - Problems:
- What data should the bank data be converted to?
- How to handle categorical data?
{% include links.md %}