Whitey's Coding Blog: May 2012

The much cited 'word-count' Hadoop map-reduce example has (in my opinion) some fundamental bad-practices with regards to map-reduce processing. In it, the mapper outputs a single for each word processed in the input file. As a simple example on a small dataset this is fine to teach the principals, but too often i see people apply this method of element 'counting' to huge datasets.

There are a number of improvements that can be made, and in this post, I outline a map-aggregation library i've put together (available on github) to address these problems

Limited domain keys

Given a body of text, you can expect the vast majority of words to be contained within a small set or domain of words. A number of words such as 'a, the, and' etc are used over and over again in bodies, and your mapper can be designed to account for this by using a concept of local map-aggregation.

Rather than outputting a pair for each word observed, you can maintain a local map of and update the IntWritable value as each word is discovered. You can easily store a map of some 60,000 word-frequency pairs in a in-memory map, and at the end of your mapper, dump the contents of the map to the output context. This will greatly reduce the volume of data sent between the mappers and reducers. Of course you could also implement this using a Combiner, but you're still hitting the local disk, and paying the cost of a map-side sort of this data.

map-aggregation-lib - amended word count example

Amending the word count mapper to use the map-aggregation-lib is easy - using an instance of the IntWritableAccumulateOutputAggregator class, we can aggregate the word counts in memory rather than delegating all the work to the combiner:

Whitey's Coding Blog

Thursday, May 24, 2012

Building Hadoop Source Jars

Tuesday, May 22, 2012

Hadoop Local Map Aggregation

Limited domain keys

map-aggregation-lib - amended word count example