There are a number of improvements that can be made, and in this post, I outline a map-aggregation library i've put together (available on github) to address these problems
Limited domain keys
Given a body of text, you can expect the vast majority of words to be contained within a small set or domain of words. A number of words such as 'a, the, and' etc are used over and over again in bodies, and your mapper can be designed to account for this by using a concept of local map-aggregation.
Rather than outputting a pair for each word observed, you can maintain a local map of and update the IntWritable value as each word is discovered. You can easily store a map of some 60,000 word-frequency pairs in a in-memory map, and at the end of your mapper, dump the contents of the map to the output context. This will greatly reduce the volume of data sent between the mappers and reducers. Of course you could also implement this using a Combiner, but you're still hitting the local disk, and paying the cost of a map-side sort of this data.
map-aggregation-lib - amended word count example
Amending the word count mapper to use the map-aggregation-lib is easy - using an instance of the IntWritableAccumulateOutputAggregator class, we can aggregate the word counts in memory rather than delegating all the work to the combiner:
Hi,
ReplyDeleteThanks for your info, I am unable to find the IntWritableAccumulateOutputAggregator lib, please let me know how can I get the details
It's on github (the link in the page doesn't appear to render correctly, but if you hover over the word github you should see it): https://github.com/chriswhite199/map-aggregation-lib
ReplyDelete