Syllabus


Introduction


Figure 1

Scientific process

Figure 2

Examples of data analytic problems

Figure 3

## Gartner Hype Cycle 2014 Gartner Hype Cycle 2014


Figure 4

## Gartner Hype Cycle 2019 Gartner Hype Cycle 2019


Figure 5

## Gartner Hype Cycle 2020 Gartner Hype Cycle 2020


Figure 6

Data parallel programming

MapReduce Programming Paradigm


Figure 1

MapReduce wordcount workflow

Figure 2

  • There is one Reduce function call per unique key k'. mapping and reducing

  • Figure 3

    at scale 1

    Figure 4

    at scale 2

    Spark Computing EnvironmentSpark computing environment


    Figure 1

    Spark

    Figure 2

    Spark application architecture

    Figure 3


    Figure 4


    Figure 5


    Figure 6


    Figure 7


    Figure 8


    Figure 9


    Figure 10


    Figure 11


    Figure 12


    Figure 13


    Figure 14


    Data Parallel Computing with SparkData parallel computing with Spark


    Page RankPage Rank


    Figure 1


    Figure 2


    Figure 3


    Figure 4


    Figure 5


    Figure 6


    Figure 7

  • The order of pages in the rank vector should be the same as the order of pages in rows and columns of M.

  • Figure 8


    Figure 9


    Figure 10


    Figure 11


    Figure 12

  • Example:

  • Figure 13


    Figure 14


    Figure 15

    {: .solution} {: .challenge}


    Figure 16


    Figure 17

    {: .solution} {: .challenge}


    Figure 18


    Figure 19


    Locality Sensitive HashingLocality Sensitive Hashing


    Figure 1


    Figure 2


    Figure 3


    Figure 4


    Figure 5


    Figure 6


    Figure 7


    Frequent ItemsetsFrequent Itemsets


    Figure 1

    Repeat the process with increasing number of items added to only sets found to be frequent.


    ClusteringClustering


    Recommendation SystemsRecommendation Systems


    Distributed Machine Learning with SparkDistributed machine learning with Spark


    Figure 1


    Figure 2

  • The labeled points are clearly separated by a line: .

  • Figure 3

    .


    Figure 4

  • Prediction is based on the weighted sum:

  • Figure 5

    .


    Figure 6

  • Training examples: .

  • Figure 7

  • Each example i: . .

  • Figure 8

  • Inner product: .

  • Figure 9

    .


    Figure 10

  • For example, we are more sure about the class of A and B than of C. .

  • Figure 11

  • Margin definition: . .

  • Figure 12

  • Maximizing the margin while identifying w is good according to intuition, theory, and practice. .

  • Figure 13

    .


    Figure 14

    Slide from the book .


    Figure 15

    .


    Figure 16

    . - After some more mathematical manipulations: . - Everything comes back to an optimization problem on w: . .


    Figure 17

    . - For each data point: - If margin greater than 1, don’t care. - If margin is less than 1, pay linear penalty. - Introducing slack variables:


    Figure 18

    .


    Figure 19

    .