In the map-reduce programming model,
work is divided into two phases: a map phase and a reduce phase.
Both of these phases work on key-value pairs. What these pairs contain is
completely up to you: they could be URLs paired with counts of how many pages
link to them, or movie IDs paired with ratings. It all depends on how you write
and set up your map-reduce job.
A map-reduce program typically acts
something like this:
- Input data, such as a long text file, is split into
key-value pairs. These key-value pairs are then fed to your mapper. (This
is the job of the map-reduce framework.)
- Your mapper processes each key-value pair individually
and outputs one or more intermediate key-value pairs.
- All intermediate key-value pairs are collected, sorted,
and grouped by key (again, the responsibility of the framework).
- For each unique key, your reducer receives the
key with a list of all the values associated with it. The reducer
aggregates these values in some way (adding them up, taking averages,
finding the maximum, etc.) and outputs one or more output key-value pairs.
- Output pairs are collected and stored in an output file
(by the framework).
What makes this model so good for
parallel programming should be apparent from the figure above: each key-value
pair can be mapped or reduced independently. This means that many different
processors, or even machines, can each take a section of the data and process
it separately—a classic example of data parallelism. The only real step
where synchronization is needed is during the collecting and sorting phase,
which can be handled by the framework (and, when done carefully, even this can
be parallelized).
So, when you can fit a problem into
this model, it can make parallelization very easy. What may seem less obvious
is how a problem can be solved with this model in the first place.
Real life MapReduce
Tree
of Maps: