Distributed Machine Learning with Spark

Last updated on 2024-07-26 | Edit this page

Estimated time 60 minutes

Overview

Questions

  • How does Linux come to be?

Objectives

  • Explain the historical development of Linux

Distributed machine learning with Spark

1. Application: Spam Filtering

viagra learning the dating nigeria spam?
X1 1 0 1 0 0 y1 = 1
X2 0 1 1 0 0 Y2 = -1
X3 0 0 0 0 1 y3 = 1
  • Instance spaces X1, X2, X3 belong to set X (data points)
    • Binary or real-valued feature vector X of word occurrences
    • d features (words and other things, d is approximately 100,000)
  • Class Y
    • Spam = 1
    • Ham = -1

2. Linear models for classification

  • Vector Xj contains real values
  • The goal is to find a vector W = (w1, w2, …, wd) with wj is a real number such that:
    • The labeled points are clearly separated by a line: .
  • Dot is spam, minus is ham!

.

3. Linear classifiers

  • Each feature i as a weight wi
  • Prediction is based on the weighted sum:
  • If f(x) is:
    • Positive: predict +1
    • Negative: predict -1

.

4. Support Vector Machine

  • Originally developed by Vapnik and collaborators as a linear classifier.
  • Could be modified to support non-linear classification by mapping into high-dimensional spaces.
  • Problem statement:
    • We want to separate + from - using a line.
    • Training examples: .
    • Each example i: . .
    • Inner product: .
  • Which is the best linear separate defined by w?

.

5. Support Vector Machine: largest margin

  • Distance from the separating line corresponds to the confidence of the prediction.
  • For example, we are more sure about the class of A and B than of C. .
  • Margin definition: . .
  • Maximizing the margin while identifying w is good according to intuition, theory, and practice. .
  • A math question: how do you narrate this equation?

.

6. Support Vector Machine: what is the margin?

  • Slide from the book .

  • Notation:

    • Gamma is the distance from point A to the linear separator L: d(A,L) = |AH|
    • If we select a random point M on line L, then d(A,L) is the projection of AM onto vector w.
    • Project
    • If we assume the normalized Euclidean value of w, |w|, is equal to one, that bring us to the result in the slide.
  • In other words, maximizing the margin is directly related to how w is chosen.

  • For the ith data point:

    .

7. Some more math …

. - After some more mathematical manipulations: . - Everything comes back to an optimization problem on w: . .

8. SVM: Non-linearly separable data

. - For each data point: - If margin greater than 1, don’t care. - If margin is less than 1, pay linear penalty. - Introducing slack variables:

.

.

9. Hands-on: SVM

$ conda activate pyspark
$ conda install -y pandas
$ jupyter notebook
  • Create a new notebook using the pyspark kernel, then change the notebook’s name to spark-7.
  • Copy the code from spark-6 to setup and launch a Spark application.
  • Documentation:
  • Question: Can you predict whether a client will subscribe to a term deposit (feature deposit)?
  • Problems:
    • What data should the bank data be converted to?
    • How to handle categorical data?

{% include links.md %}