Friday, June 1, 2012

Hadoop - Removing Empty Output Files

Sometimes you have a Map Reduce job that runs 100/1000's of mappers or reducers where a number of tasks don't output any K/V pairs. Unfortunately Hadoop still creates the part-m-xxxxx or part-r-xxxxx files, but the contents of those files are empty.

With a little code however, you can extend most FileOutputFormat sub-classes to avoid committing these files to HDFS when the tasks complete. You need to override two methods from the OutputFormat:

  • getRecordWriter - Wrap the writer to track that something was output via the Context.write(K, V) method
  • getOutputCommitter - Extend the OutputCommitter to override the needsTaskCommit(Context) method

Here's an example for extending SequenceFileOutputFormat: