MapReduce Programming Paradigm

Last updated on 2024-07-26 | Edit this page

Overview

Questions

  • How does Linux come to be?

Objectives

  • Explain the historical development of Linux

1. The reality of working with big data


2. The reality of working with big data (as outlined by the paper)


  • Challenges:
    • input data is usually large
    • the computations have to be distributed across hundreds or thousands of machines
    • finish in a reasonable amount of time.
  • Addressing these challenges causes the original simple computation to become obscured by large amounts of supporting complex codes.​
  • What MapReduce does:
    • is a new abstraction
    • expresses the simple core computations
    • hides the messy details of parallelization, fault-tolerance, data distribution and load balancing in a library.
  • Why MapReduce?
    • inspired by the _map_ and _reduce_ primitives present in Lisp and many other functional languages.
    • Most data computations involve:
      • applying a _map_ operation to each logical record in our input in order to compute a set of intermediate key/value pairs, and then
      • applying a _reduce_ operation to all the values that shared the same key to combine the derived data appropriately.

3. MapReduce in a nutshell


  • What is map? A function/procedure that is applied to every individual elements of a collection/list/array/…​
int square(x) { return x*x;}​
map square [1,2,3,4] -> [1,4,9,16]​
  • What is reduce? A function/procedure that performs an operation on a list. This operation will fold/reduce this list into a single value (or a smaller subset)​
reduce ([1,2,3,4]) using sum -> 10​
reduce ([1,2,3,4]) using multiply -> 24​

3. Word Count: the “Hello, World” of big data


  • We have a large amount of text …
    • Could be stored in a single massive file.
    • Could be stored in multiple files.
  • We want to count the number of times each distinct word appears in the files
  • Sample applications:
    • Analyze web server logs to find popular URLs/keywords.
    • Analyze security logs to find incidents. ​
  • Standard parallel programming approach:​
    • Count number of files​ or set up seek distances within a file.
    • Set number of processes​
    • Possibly setting up dynamic workload assignment​
    • A lot of data transfer​
    • Significant coding effort​​

4. Word Count: MapReduce workflow


MapReduce wordcount workflow

5. Word Count: MapReduce workflow - what do you really do


  • Input: a set of key-value pairs
  • Programmer specifies two methods:
    • Map(k, v) -> (k', v'):
      • Takes a key-value pair and outputs a set of key-value pairs.
        • E.g., key is the filename, value is a single line in the file
      • There is one Map call for every (k,v) pair
    • Reduce(k2, <v'>) -> <k’, v’’>
      • All values v' with same key k' are reduced together and processed in v' order.
      • There is one Reduce function call per unique key k'. mapping and reducing
  • MapReduce environment takes care of:
    • Partitioning the input data
    • Scheduling the program’s execution across a set of machines
    • Performing the group by key step
    • Handling machine failures
    • Managing required inter-machine communication

6. Word Count: MapReduce workflow at scale


at scale 1
at scale 2

7. Applications that suit well with MapReduce programming paradigm


  • Text tokenization, indexing, and search
    • Web access log stats
    • Inverted index construction
    • Term-vector per host
    • Distributed grep/sort
  • Graph creation
    • Web link-graph reversal (Google’s PageRank)
  • Data Mining and machine learning
    • Document clustering
    • Machine learning
    • Statistical machine translation

{% include links.md %}