Friday, June 1, 2012

Hadoop - Removing Empty Output Files

Sometimes you have a Map Reduce job that runs 100/1000's of mappers or reducers where a number of tasks don't output any K/V pairs. Unfortunately Hadoop still creates the part-m-xxxxx or part-r-xxxxx files, but the contents of those files are empty.

With a little code however, you can extend most FileOutputFormat sub-classes to avoid committing these files to HDFS when the tasks complete. You need to override two methods from the OutputFormat:

  • getRecordWriter - Wrap the writer to track that something was output via the Context.write(K, V) method
  • getOutputCommitter - Extend the OutputCommitter to override the needsTaskCommit(Context) method

Here's an example for extending SequenceFileOutputFormat:

9 comments:

  1. What version of Hadoop does this work for?

    I am running Amazon EMR, which I think is close to Hadoop 1.0.3, and I believe SequenceFileOutputFormat has no getOutputCommitter.

    ReplyDelete
  2. This works for the new mapreduce API - i think Amazon EMR is based upon the old mapred API, and hence doesn't have the method you mention.

    ReplyDelete
  3. Good Info. Does it work for a map only job which uses MultiOutputs to write multiple outputs?. Speculative execution is ON by the way.

    ReplyDelete
  4. can we use LazyOutputFormat also for this same purpose?

    ReplyDelete
  5. @neoAnanden - yes LazyOutputFormat achieves the same results (albeit via a different method). I haven't tested with MultipleOutputs, but there's no reason it shouldn't work - but this is the default behaviour for MultipleOutputs anyway - the files are only created upon the first write call for the named output.

    ReplyDelete
  6. Really good piece of knowledge, I had come back to understand regarding your website from my friend Sumit, Hyderabad And it is very useful for who is looking for HADOOP.

    ReplyDelete