Big Data Engineering: All Images

Big Data Engineering

Syllabus

Introduction

Figure 1

Scientific process

Figure 2

Examples of data analytic problems

Figure 3

## Gartner Hype Cycle 2014

Figure 4

## Gartner Hype Cycle 2019

Figure 5

## Gartner Hype Cycle 2020

Figure 6

Data parallel programming

MapReduce Programming Paradigm

Figure 1

MapReduce wordcount workflow

Figure 2

There is one Reduce function call per unique key k'. mapping and reducing

mapping and reducing

Figure 3

at scale 1

Figure 4

at scale 2

Spark Computing EnvironmentSpark computing environment

Figure 1

Spark

Figure 2

Spark application architecture

Figure 3

Figure 4

Figure 5

Figure 6

Figure 7

Figure 8

Figure 9

Figure 10

Figure 11

Figure 12

Figure 13

Figure 14

Data Parallel Computing with SparkData parallel computing with Spark

Page RankPage Rank

Figure 1

Figure 2

Figure 3

Figure 4

Figure 5

Figure 6

Figure 7

The order of pages in the rank vector should be the same as the order of pages in rows and columns of M.

Figure 8

Figure 9

Figure 10

Figure 11

Figure 12

Example:

Figure 13

Figure 14

Figure 15

{: .solution} {: .challenge}

Figure 16

Figure 17

{: .solution} {: .challenge}

Figure 18

Figure 19

Locality Sensitive HashingLocality Sensitive Hashing

Figure 1

Figure 2

Figure 3

Figure 4

Figure 5

Figure 6

Figure 7

Frequent ItemsetsFrequent Itemsets

Figure 1

Repeat the process with increasing number of items added to only sets found to be frequent.

ClusteringClustering

Recommendation SystemsRecommendation Systems

Distributed Machine Learning with SparkDistributed machine learning with Spark

Figure 1

Figure 2

The labeled points are clearly separated by a line:

.

Figure 3

.

Figure 4

Prediction is based on the weighted sum:

Figure 5

.

Figure 6

Training examples:

.

Figure 7

Each example i:

.

.

Figure 8

Inner product:

.

Figure 9

.

Figure 10

For example, we are more sure about the class of A and B than of C.

.

Figure 11

Margin definition:

.

.

Figure 12

Maximizing the margin while identifying w is good according to intuition, theory, and practice.

.

Figure 13

.

Figure 14

Slide from the book .

Figure 15

.

Figure 16

. - After some more mathematical manipulations: . - Everything comes back to an optimization problem on w: . .

Figure 17

. - For each data point: - If margin greater than 1, don’t care. - If margin is less than 1, pay linear penalty. - Introducing slack variables:

Figure 18

.

Figure 19

.