Using output of the MapReduce program as input in another MapReduce program - KeyValueTextInputFormat

In the WordCount(HelloWorld) MapReduce program i blogged about how to create a MapReduce program that takes a text file as input and generates output which tells you frequency of each word in the input file. I wanted to take that a step further by reading the output generated by the first MapReduce and figure out which word is used most frequently and how many times that word is used. So i developed this HadoopWordCountProcessor program to do that.
  1. First take a look at the output generated by the HadoopWordCount program, which looks like this. In the HadoopWordCount program i used TextOutputFormat as output format class, this class generates output in which there is one key value pair on every line separated by tab character XXX 3 YYY 3 ZZZ 3 aaa 10 bbb 5 ccc 5 ddd 5 eee 5 fff 5 ggg 5 hhh 5 iii 5
  2. First create a program like this, this class receives Text class as Key and value, Only thing i am doing here is converting the Text key into IntWritable and then writing it into output.
  3. The reducer class is the place where i am getting all the words as key and their frequency as value. In this class i am keeping track of highest frequency word (You will have to copy the key and value of highest occuring word into a local variable for it to work because hadoop reuses key and values object sent to reducer)
  4. The last step is to create a Driver class, note one thing about the Driver class, i am setting job.setInputFormatClass(KeyValueTextInputFormat.class);, in this i am setting KeyValueTextInputFormat as input format class. Once i do that hadoop takes care of reading the input and breaking it into key and value and passing to my Mapper class
  5. Next step is to execute the class with the output of the first MapReduce program as input by passing couple of arguments like this file:////Users/gpzpati/hadoop/output/wordcount file:///Users/gpzpati/hadoop/output/wordcount2 It will generate output like this. Which says aaa is the most frequently used word and it appeared 10 times aaa 10


Mathew Stephen said...

There are lots of information about hadoop have spread around the web, but this is a unique one according to me. The strategy you have updated here will make me to get to the next level in big data. Thanks for sharing this.

Hadoop training velachery
Hadoop training in velachery

Anonymous said...

And what if there are joint contenders for the top position? E.g. fox 30 times and dog 30 times?

Ajay Raj said...

Nice Posting,....

Red Hat Linux Training in Chennai
Rhce Training in Chennai

vigneswaran said...

Pretty informed post! I'm seeking for some topics I need to see that our site affection and then drove it our site all report is really good.
Hadoop Training in Chennai
Hadoop Training Institute in Chennai
Best Hadoop Training in Chennai