MapReduce Programming Paradigm

Last updated on 2024-07-26 | Edit this page

Overview

Hundreds or thousands of machines to support big data.
- Distribute data for storage (not within the scope of this class).
- Parallelize data computation (MapReduce)
- Handle failure (MapReduce)
The paper: MapReduce: simplified data processing on large clusters

Challenges:
- input data is usually large
- the computations have to be distributed across hundreds or thousands of machines
- finish in a reasonable amount of time.
Addressing these challenges causes the original simple computation to become obscured by large amounts of supporting complex codes.
What MapReduce does:
- is a new abstraction
- expresses the simple core computations
- hides the messy details of parallelization, fault-tolerance, data distribution and load balancing in a library.
Why MapReduce?
- inspired by the _map_ and _reduce_ primitives present in Lisp and many other functional languages.
- Most data computations involve:
  - applying a _map_ operation to each logical record in our input in order to compute a set of intermediate key/value pairs, and then
  - applying a _reduce_ operation to all the values that shared the same key to combine the derived data appropriately.

What is map? A function/procedure that is applied to every individual elements of a collection/list/array/…

int square(x) { return x*x;}
map square [1,2,3,4] -> [1,4,9,16]

What is reduce? A function/procedure that performs an operation on a list. This operation will fold/reduce this list into a single value (or a smaller subset)

reduce ([1,2,3,4]) using sum -> 10
reduce ([1,2,3,4]) using multiply -> 24

We have a large amount of text …
- Could be stored in a single massive file.
- Could be stored in multiple files.
We want to count the number of times each distinct word appears in the files
Sample applications:
- Analyze web server logs to find popular URLs/keywords.
- Analyze security logs to find incidents.
Standard parallel programming approach:
- Count number of files or set up seek distances within a file.
- Set number of processes
- Possibly setting up dynamic workload assignment
- A lot of data transfer
- Significant coding effort

Text tokenization, indexing, and search
- Web access log stats
- Inverted index construction
- Term-vector per host
- Distributed grep/sort
Graph creation
- Web link-graph reversal (Google’s PageRank)
Data Mining and machine learning
- Document clustering
- Machine learning
- Statistical machine translation

{% include links.md %}