Friday, June 1, 2012

Hadoop - Removing Empty Output Files

Sometimes you have a Map Reduce job that runs 100/1000's of mappers or reducers where a number of tasks don't output any K/V pairs. Unfortunately Hadoop still creates the part-m-xxxxx or part-r-xxxxx files, but the contents of those files are empty.

With a little code however, you can extend most FileOutputFormat sub-classes to avoid committing these files to HDFS when the tasks complete. You need to override two methods from the OutputFormat:

  • getRecordWriter - Wrap the writer to track that something was output via the Context.write(K, V) method
  • getOutputCommitter - Extend the OutputCommitter to override the needsTaskCommit(Context) method

Here's an example for extending SequenceFileOutputFormat:

10 comments:

  1. What version of Hadoop does this work for?

    I am running Amazon EMR, which I think is close to Hadoop 1.0.3, and I believe SequenceFileOutputFormat has no getOutputCommitter.

    ReplyDelete
  2. This works for the new mapreduce API - i think Amazon EMR is based upon the old mapred API, and hence doesn't have the method you mention.

    ReplyDelete
  3. The Information which you provided is very much useful for Hadoop Online Training Learners Thank You for Sharing Valuable Information

    ReplyDelete
  4. Good Info. Does it work for a map only job which uses MultiOutputs to write multiple outputs?. Speculative execution is ON by the way.

    ReplyDelete
  5. can we use LazyOutputFormat also for this same purpose?

    ReplyDelete
  6. @neoAnanden - yes LazyOutputFormat achieves the same results (albeit via a different method). I haven't tested with MultipleOutputs, but there's no reason it shouldn't work - but this is the default behaviour for MultipleOutputs anyway - the files are only created upon the first write call for the named output.

    ReplyDelete
  7. Really good piece of knowledge, I had come back to understand regarding your website from my friend Sumit, Hyderabad And it is very useful for who is looking for HADOOP.

    ReplyDelete