Friday, June 1, 2012

Hadoop - Removing Empty Output Files

Sometimes you have a Map Reduce job that runs 100/1000's of mappers or reducers where a number of tasks don't output any K/V pairs. Unfortunately Hadoop still creates the part-m-xxxxx or part-r-xxxxx files, but the contents of those files are empty.

With a little code however, you can extend most FileOutputFormat sub-classes to avoid committing these files to HDFS when the tasks complete. You need to override two methods from the OutputFormat:

  • getRecordWriter - Wrap the writer to track that something was output via the Context.write(K, V) method
  • getOutputCommitter - Extend the OutputCommitter to override the needsTaskCommit(Context) method

Here's an example for extending SequenceFileOutputFormat:

17 comments:

  1. What version of Hadoop does this work for?

    I am running Amazon EMR, which I think is close to Hadoop 1.0.3, and I believe SequenceFileOutputFormat has no getOutputCommitter.

    ReplyDelete
  2. This works for the new mapreduce API - i think Amazon EMR is based upon the old mapred API, and hence doesn't have the method you mention.

    ReplyDelete
  3. This is nice article and thanks share with us.we providing Hadoop online training

    ReplyDelete
  4. The Information which you provided is very much useful for Hadoop Online Training Learners Thank You for Sharing Valuable Information

    ReplyDelete
  5. Good Info. Does it work for a map only job which uses MultiOutputs to write multiple outputs?. Speculative execution is ON by the way.

    ReplyDelete
  6. can we use LazyOutputFormat also for this same purpose?

    ReplyDelete
  7. @neoAnanden - yes LazyOutputFormat achieves the same results (albeit via a different method). I haven't tested with MultipleOutputs, but there's no reason it shouldn't work - but this is the default behaviour for MultipleOutputs anyway - the files are only created upon the first write call for the named output.

    ReplyDelete
  8. The information which you have provided is very good and easily understood.
    It is very useful who is looking for Hadoop Online Training

    ReplyDelete
  9. Really good piece of knowledge, I had come back to understand regarding your website from my friend Sumit, Hyderabad And it is very useful for who is looking for HADOOP.

    ReplyDelete
  10. Nice article very happy to see this Hadoop Online Training Article.. I came to know hadooponlinetrainings.com at hyderabad is also providing excellent hadoop online training.. keep Posting more articles..

    ReplyDelete
  11. Thank you provide valuable informations and iam seacrching same informations,and saved my time SAS Online Training

    ReplyDelete
  12. Thanks for that.You will have providing valuable reviews.


    Hadoop Training in Chennai

    ReplyDelete
  13. Hadoop is creating more opportunities to every one. And thanks for sharing best information about hadoop in this blog.
    Hadoop Training in hyderabad

    ReplyDelete
  14. This information you provided in the blog that was really unique I love it!!, Thanks for sharing such a great blog..Keep posting..

    Hadoop Training Institutes in Chennai

    ReplyDelete