Whitey's Coding Blog: Hadoop - Removing Empty Output Files

Friday, June 1, 2012

Hadoop - Removing Empty Output Files

Sometimes you have a Map Reduce job that runs 100/1000's of mappers or reducers where a number of tasks don't output any K/V pairs. Unfortunately Hadoop still creates the part-m-xxxxx or part-r-xxxxx files, but the contents of those files are empty.

With a little code however, you can extend most FileOutputFormat sub-classes to avoid committing these files to HDFS when the tasks complete. You need to override two methods from the OutputFormat:

getRecordWriter - Wrap the writer to track that something was output via the Context.write(K, V) method
getOutputCommitter - Extend the OutputCommitter to override the needsTaskCommit(Context) method

Here's an example for extending SequenceFileOutputFormat:

9 comments:

UnknownJuly 26, 2012 at 6:28 PM
excellent information.
ReplyDelete
Replies
TariqAugust 3, 2012 at 7:10 AM
nice post Chris.
ReplyDelete
Replies
DFMarch 26, 2013 at 6:09 PM
What version of Hadoop does this work for?

I am running Amazon EMR, which I think is close to Hadoop 1.0.3, and I believe SequenceFileOutputFormat has no getOutputCommitter.
ReplyDelete
Replies
Chris WhiteMarch 27, 2013 at 6:09 PM
This works for the new mapreduce API - i think Amazon EMR is based upon the old mapred API, and hence doesn't have the method you mention.
ReplyDelete
Replies
Sebin JohnAugust 2, 2013 at 1:35 AM
Good Info. Does it work for a map only job which uses MultiOutputs to write multiple outputs?. Speculative execution is ON by the way.
ReplyDelete
Replies
Sebin JohnAugust 2, 2013 at 1:37 AM
can we use LazyOutputFormat also for this same purpose?
ReplyDelete
Replies
Chris WhiteAugust 2, 2013 at 6:50 AM
@neoAnanden - yes LazyOutputFormat achieves the same results (albeit via a different method). I haven't tested with MultipleOutputs, but there's no reason it shouldn't work - but this is the default behaviour for MultipleOutputs anyway - the files are only created upon the first write call for the named output.
ReplyDelete
Replies
UnknownAugust 29, 2013 at 2:04 AM
Really good piece of knowledge, I had come back to understand regarding your website from my friend Sumit, Hyderabad And it is very useful for who is looking for HADOOP.
ReplyDelete
Replies
CodewareDecember 30, 2013 at 3:24 PM
cool, thanks for sharing
ReplyDelete
Replies

Add comment