With a little code however, you can extend most FileOutputFormat
sub-classes to avoid committing these files to HDFS when the tasks complete. You need to override two methods from the OutputFormat
:
getRecordWriter
- Wrap the writer to track that something was output via theContext.write(K, V)
methodgetOutputCommitter
- Extend theOutputCommitter
to override theneedsTaskCommit(Context)
method
Here's an example for extending SequenceFileOutputFormat:
excellent information.
ReplyDeletenice post Chris.
ReplyDeleteWhat version of Hadoop does this work for?
ReplyDeleteI am running Amazon EMR, which I think is close to Hadoop 1.0.3, and I believe SequenceFileOutputFormat has no getOutputCommitter.
This works for the new mapreduce API - i think Amazon EMR is based upon the old mapred API, and hence doesn't have the method you mention.
ReplyDeleteGood Info. Does it work for a map only job which uses MultiOutputs to write multiple outputs?. Speculative execution is ON by the way.
ReplyDeletecan we use LazyOutputFormat also for this same purpose?
ReplyDelete@neoAnanden - yes LazyOutputFormat achieves the same results (albeit via a different method). I haven't tested with MultipleOutputs, but there's no reason it shouldn't work - but this is the default behaviour for MultipleOutputs anyway - the files are only created upon the first write call for the named output.
ReplyDeleteReally good piece of knowledge, I had come back to understand regarding your website from my friend Sumit, Hyderabad And it is very useful for who is looking for HADOOP.
ReplyDeletecool, thanks for sharing
ReplyDelete