MapReduce =
is a programming model
for processing
large data sets
with a parallel, distributed algorithm
on a cluster
(= share-nothing networked computers)
Map-ReduceComputing Pattern:
A MapReduce program consists of:
A Map( ) procedure
A Reduce( ) procedure
The Map( ) procedure will:
Filter/sort the
input into
separate queues
There is one queue for
one type/catagory/group of
items
The Reduce( ) procedure will:
Perform a
computation
(e.g. summary)
on all items in
each queue (group)
Schematically:
Note:
The MapReduce program is
executed in parallel by
all (networked) processors
The input data will be
distributed over
all processors
The Map( )result
must be transmitted to
its designated processor that
takes care of the
specific queue (group)
The Map-Reduce framework
Fact:
There existssoftware frameworks (implementations)
that
execute:
A user-providedMap( ) function
A user-providedReduce( ) function
within the Map-Reduceenvironment:
Framework for
execution of
Map-Reduce programs:
The input has the followingstructured as:
(key, value)
The map( )function
will perform
the followingtransformation: