First take a look at the output generated by the HadoopWordCount program, which looks like this. In the HadoopWordCount program i used TextOutputFormat as output format class, this class generates output in which there is one key value pair on every line separated by tab character
XXX 3 YYY 3 ZZZ 3 aaa 10 bbb 5 ccc 5 ddd 5 eee 5 fff 5 ggg 5 hhh 5 iii 5
- First create a WordCountProcessorMapper.java program like this, this class receives Text class as Key and value, Only thing i am doing here is converting the Text key into IntWritable and then writing it into output.
- The reducer class is the place where i am getting all the words as key and their frequency as value. In this class i am keeping track of highest frequency word (You will have to copy the key and value of highest occuring word into a local variable for it to work because hadoop reuses key and values object sent to reducer)
The last step is to create a Driver class, note one thing about the Driver class, i am setting
job.setInputFormatClass(KeyValueTextInputFormat.class);, in this i am setting KeyValueTextInputFormat as input format class. Once i do that hadoop takes care of reading the input and breaking it into key and value and passing to my Mapper class
Next step is to execute the WordCountProcessor.java class with the output of the first MapReduce program as input by passing couple of arguments like this
file:////Users/gpzpati/hadoop/output/wordcount file:///Users/gpzpati/hadoop/output/wordcount2It will generate output like this. Which says aaa is the most frequently used word and it appeared 10 times
Using output of the MapReduce program as input in another MapReduce program - KeyValueTextInputFormat
In the WordCount(HelloWorld) MapReduce program i blogged about how to create a MapReduce program that takes a text file as input and generates output which tells you frequency of each word in the input file. I wanted to take that a step further by reading the output generated by the first MapReduce and figure out which word is used most frequently and how many times that word is used. So i developed this HadoopWordCountProcessor program to do that.