MapReduce Programming Paradigm
Last updated on 2024-07-26 | Edit this page
Overview
Questions
- How does Linux come to be?
Objectives
- Explain the historical development of Linux
1. The reality of working with big data
- Hundreds or thousands of machines to support big data.
- Distribute data for storage (not within the scope of this class).
- Parallelize data computation (MapReduce)
- Handle failure (MapReduce)
- The paper: MapReduce: simplified data processing on large clusters
2. The reality of working with big data (as outlined by the paper)
- Challenges:
- input data is usually large
- the computations have to be distributed across hundreds or thousands of machines
- finish in a reasonable amount of time.
- Addressing these challenges causes the original simple computation to become obscured by large amounts of supporting complex codes.
- What MapReduce does:
- is a new abstraction
- expresses the simple core computations
- hides the messy details of parallelization, fault-tolerance, data distribution and load balancing in a library.
- Why MapReduce?
- inspired by the
_map_
and_reduce_
primitives present in Lisp and many other functional languages. - Most data computations involve:
- applying a
_map_
operation to each logicalrecord
in our input in order to compute a set of intermediate key/value pairs, and then - applying a
_reduce_
operation to all the values that shared the same key to combine the derived data appropriately.
- applying a
- inspired by the
3. MapReduce in a nutshell
- What is
map
? A function/procedure that is applied to every individual elements of a collection/list/array/…
int square(x) { return x*x;}
map square [1,2,3,4] -> [1,4,9,16]
- What is
reduce
? A function/procedure that performs an operation on a list. This operation will fold/reduce this list into a single value (or a smaller subset)
reduce ([1,2,3,4]) using sum -> 10
reduce ([1,2,3,4]) using multiply -> 24
3. Word Count: the “Hello, World” of big data
- We have a large amount of text …
- Could be stored in a single massive file.
- Could be stored in multiple files.
- We want to count the number of times each distinct word appears in the files
- Sample applications:
- Analyze web server logs to find popular URLs/keywords.
- Analyze security logs to find incidents.
- Standard parallel programming approach:
- Count number of files or set up seek distances within a file.
- Set number of processes
- Possibly setting up dynamic workload assignment
- A lot of data transfer
- Significant coding effort
4. Word Count: MapReduce workflow

5. Word Count: MapReduce workflow - what do you really do
- Input: a set of key-value pairs
- Programmer specifies two methods:
-
Map(k, v) -> (k', v')
:- Takes a key-value pair and outputs a set of key-value pairs.
- E.g., key is the filename, value is a single line in the file
- There is one Map call for every (k,v) pair
- Takes a key-value pair and outputs a set of key-value pairs.
-
Reduce(k2, <v'>) -> <k’, v’’>
- All values
v'
with same keyk'
are reduced together and processed inv'
order. - There is one Reduce function call per unique key
k'
.
- All values
-
- MapReduce environment takes care of:
- Partitioning the input data
- Scheduling the program’s execution across a set of machines
- Performing the group by key step
- Handling machine failures
- Managing required inter-machine communication
6. Word Count: MapReduce workflow at scale


7. Applications that suit well with MapReduce programming paradigm
- Text tokenization, indexing, and search
- Web access log stats
- Inverted index construction
- Term-vector per host
- Distributed grep/sort
- Graph creation
- Web link-graph reversal (Google’s PageRank)
- Data Mining and machine learning
- Document clustering
- Machine learning
- Statistical machine translation
{% include links.md %}