- Download a hadoop release from http://www.apache.org/dyn/closer.cgi/hadoop/common/
- Unpack the tar.gz to a folder (in my case /opt/hadoop/)
- Copy the script below into a file names mk-source.sh (Amend the VERSION and HADOOP_SRC variables for your environment)
- Optionally install the sources into your local maven repository
mvn install:install-file \ -Dfile=/opt/hadoop/hadoop-0.20.2/src/hadoop-core-0.20.2-sources.jar \ -DgroupId=org.apache.hadoop \ -DartifactId=hadoop-core \ -Dversion=0.20.2 \ -Dpackaging=java-sources \ -DgeneratePom=false
Thursday, May 24, 2012
Building Hadoop Source Jars
Script to build yourself the source jars for hadoop core:
Tuesday, May 22, 2012
Hadoop Local Map Aggregation
The much cited 'word-count' Hadoop map-reduce example has (in my opinion) some fundamental bad-practices with regards to map-reduce processing. In it, the mapper outputs a single for each word processed in the input file.
As a simple example on a small dataset this is fine to teach the principals, but too often i see people apply this method of element 'counting' to huge datasets.
There are a number of improvements that can be made, and in this post, I outline a map-aggregation library i've put together (available on github) to address these problems
There are a number of improvements that can be made, and in this post, I outline a map-aggregation library i've put together (available on github) to address these problems
Limited domain keys
Given a body of text, you can expect the vast majority of words to be contained within a small set or domain of words. A number of words such as 'a, the, and' etc are used over and over again in bodies, and your mapper can be designed to account for this by using a concept of local map-aggregation.
Rather than outputting a pair for each word observed, you can maintain a local map of and update the IntWritable value as each word is discovered. You can easily store a map of some 60,000 word-frequency pairs in a in-memory map, and at the end of your mapper, dump the contents of the map to the output context. This will greatly reduce the volume of data sent between the mappers and reducers. Of course you could also implement this using a Combiner, but you're still hitting the local disk, and paying the cost of a map-side sort of this data.
map-aggregation-lib - amended word count example
Amending the word count mapper to use the map-aggregation-lib is easy - using an instance of the IntWritableAccumulateOutputAggregator class, we can aggregate the word counts in memory rather than delegating all the work to the combiner:
Subscribe to:
Posts (Atom)