MapReduce: Simplified Data Processing on Large Clusters

Introduction

MapReduce is a programming model and an associated implementation for processing and generating large data sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key.

Programs can be automatically parallelized and executed on a large cluster of commodity machines.

The problem that this paper tries to solve is to parallelize and distribute workload of diff processes’ raw data or derived data from upstream services. This data is quite large, say in TBs. So authors came up with map and reduce programming inspired from LISP, a functional prog lang.

Programming Model

User has to write Map & Reduce functions

Map func takes (k, v) pair as input and produce/emit intermediate (k, v) pairs

MR (MapReduce) lib will group intermediate (k, v) pairs based on key k, like a list, and sends them to Reduce func

Reduce func takes list (iterator) of intermediate kv pairs, and applies any logic as mentioned by user on values. Iterator as intermediate kv pairs can be very large in number and can’t fit entirely in-mem. On processing, it’ll emit or return output value (mostly zero or one output value).

Simple example of WordCount (counting the number of occurrences of each word in a large collection of documents) with MR

map(String key, String value): 
	// key: document name
	// value: document contents
	for each word w in value:
		EmitIntermediate(w, "1"); 

reduce(String key, Iterator values):
	// key: a word
	// values: a list of counts
	int result = 0;
	for each v in values:
		result += ParseInt(v);
	Emit(AsString(result));

map() takes a document file, file name as key and file content as value. It’ll start iterating over content for words, and for every word it emits intermediate (k, v) with k as word w and value as 1.

reduce() takes intermediate kv pairs in form of word as key & iterator as value and sums them up i.e.,result++ and them emits output, which will be #word occurences for particular word.

Apart from these 2 funcs, user have to fill MR Spec Obj with names of input, output files and optional tuning params.

This is a super simplified example, more interesting examples include Reverse Web Link Graph, Distributed Sort with Ordering Props.

Implementation

This prog model was mainly targeted on running cluster of 1000s of commodity PCs (non specialized and geenral purpose hardware). Implementation of model can change wrt hardware.