MapReduce program that reads input files from S3 and writes output to S3

In the WordCount(HelloWorld) MapReduce program entry i talked about how to create a simple WordCount Map Reducer program with Hadoop. I wanted to change it to so that it reads input files from Amazon S3 bucket and writes output back to Amazon S3 bucket, so i built S3MapReduce program, that you can download from here. I followed these steps
  1. First create 2 buckets one for storing input and other for storing output in your Amazon S3 account. Most important issue here is to make sure that you create your buckets in US Standard region, if you dont do that then additional steps might be required for Hadoop to be able to access your buckets Name of input bucket in my case is com.spnotes.hadoop.wordcount.books
    Name of the output bucket is com.spnotes.hadoop.wordcount.output
  2. Upload few .txt files that you want to use as input in your input bucket like this
  3. Next step is to create MapReduce program like this, In my case one Java class has code for Mapper, Reducer and driver class. Most of the code in the MapReduce is same only difference is for working with S3 you will have to add few S3 specific properties like this, basically you need to set your accessKey and secretAccessKey that you can get from AWS Security console and paste it here. You will also have to tell Hadoop to use s3n as file system.
    
    //Replace this value
    job.getConfiguration().set("fs.s3n.awsAccessKeyId", "awsaccesskey");
    //Replace this value
    job.getConfiguration().set("fs.s3n.awsSecretAccessKey","awssecretaccesskey");
    job.getConfiguration().set("fs.default.name","s3n://com.spnotes.hadoop.input.books");
    
  4. Now last step is to execute this program, it takes 2 inputs, You can just right click on your S3MapReduce program and say execute with following 2 parameters
    
    s3n://com.spnotes.hadoop.wordcount.books s3n://com.spnotes.hadoop.wordcount.output/output3
    
  5. Once the MapReduce is executed you can check the output by going to S3 console and looking at content of com.spnotes.hadoop.wordcount.output like this

Hadoop MapReduce HTTP Notification

Normally MapReduce programs tend to run for long time and you might want to setup a way to get notification when the job is done finishing to find out if the job was executed successfully or not. Hadoop provides a mechanism in which you can get notification on the progress of your MapReduce job. This is two step process
  1. First set up job.end.notification.url property with value equal to a web application URL that should get invoked with the progress of job job.getConfiguration().set("job.end.notification.url", "http://localhost:8080/mrnotification/MRNotification/$jobId?status=$jobStatus");
  2. Next create a web application that will receive the notification and printout the job status and job name like this

Using output of the MapReduce program as input in another MapReduce program - KeyValueTextInputFormat

In the WordCount(HelloWorld) MapReduce program i blogged about how to create a MapReduce program that takes a text file as input and generates output which tells you frequency of each word in the input file. I wanted to take that a step further by reading the output generated by the first MapReduce and figure out which word is used most frequently and how many times that word is used. So i developed this HadoopWordCountProcessor program to do that.
  1. First take a look at the output generated by the HadoopWordCount program, which looks like this. In the HadoopWordCount program i used TextOutputFormat as output format class, this class generates output in which there is one key value pair on every line separated by tab character XXX 3 YYY 3 ZZZ 3 aaa 10 bbb 5 ccc 5 ddd 5 eee 5 fff 5 ggg 5 hhh 5 iii 5
  2. First create a WordCountProcessorMapper.java program like this, this class receives Text class as Key and value, Only thing i am doing here is converting the Text key into IntWritable and then writing it into output.
  3. The reducer class is the place where i am getting all the words as key and their frequency as value. In this class i am keeping track of highest frequency word (You will have to copy the key and value of highest occuring word into a local variable for it to work because hadoop reuses key and values object sent to reducer)
  4. The last step is to create a Driver class, note one thing about the Driver class, i am setting job.setInputFormatClass(KeyValueTextInputFormat.class);, in this i am setting KeyValueTextInputFormat as input format class. Once i do that hadoop takes care of reading the input and breaking it into key and value and passing to my Mapper class
  5. Next step is to execute the WordCountProcessor.java class with the output of the first MapReduce program as input by passing couple of arguments like this file:////Users/gpzpati/hadoop/output/wordcount file:///Users/gpzpati/hadoop/output/wordcount2 It will generate output like this. Which says aaa is the most frequently used word and it appeared 10 times aaa 10

Use MRUnit for testing your MapReduce program

In the WordCount(HelloWorld) MapReduce program entry i blogged about how you can create your WordCount (HelloWorld) MapReduce program. You can use Apache MRUnit which is a unit testing framework for testing your MapReduce program. I used MRUnit for developing unit tests for my WordCount program.
  1. First i did create a unit test for my WordCountMapper class like this, Basic idea here is you set input and expected output on the test class and then execute the test by calling mapDriver.runTest()
  2. Then i did create a unit test for my WordCountReducer class like this
  3. Last part was to develop a end to end test, in this case you setup both mapper and reducer class that you want to set and then run it end to end

Maven script for running Hadoop programs

In the WordCount(HelloWorld) MapReduce program blog i talked about how to create a WordCount MapReduce program. While developing MapReduce program, i follow this pattern in which first i develop it using MRUnit test driven development, then i execute it on local using driver. But last step for me is always to copy this program to my Cloudera VM and execute it. I build this maven script to help me with the last part, which is to scp the deployment .jar file to the Cloudera Hadoop VM and then execute it. This is the script i use When i create a new MapReduce program, i have to make couple of changes in it but i can always reuse most of it
  1. Change the value of scp.host to point to my hadoop vm, if you changed the username and password on your VM you will have to change it too
  2. Next i have to change the value of mainClass attribute to point to correct class for the MapReduce program that i am developing. In this case name of the driver class its com.spnotes.hadoop.WordCountDriver
  3. Then i have to change the value of command attribute in sshexec element. THe command is made up of different parts
    hadoop jar ${scp.dirCopyTo}/${project.build.finalName}.jar books/dickens.txt wordcount/outputs
    in this ${scp.dirCopyTo}/${project.build.finalName}.jar points to the .jar file that is being scp to the VM. books/dickens.txt is path of the input text file, in this case i am using hdfs as input location which points to hdfs://localhost/user/cloudera/books/dickens.txt and the output of mapreduce will get generated in hdfs://localhost/user/cloudera/wordcount/outputs
You can run maven antrun:run command for executing the maven script task that deploys the maperduce jar to the cloudera vm and executes it. You can execute the full project from here

WordCount(HelloWorld) MapReduce program

I am learning about MapReduce and in order to experiment with MapReduce, i created this simple program which takes a text file as input and then generate a output that prints how frequently a word appeared in the text file. You can download the source code for the program from here
  1. First i started by creating a simple Mapper which receives the content of the text file one line at a time, the Mapper takes care of splitting the content into words and then it writes every word into output and sets frequency count for that word to 1, by calling context.write(word,one). In this case the word becomes key and count becomes value
  2. Next i had to develop a Reducer class which, receives word as key and value is list of all the counts for example if your input file is simple text like aaa bbb ccc aaa, then reduce class will get called with aaa - [1, 1], bbb -[1] and ccc - [1] as input. Hadoop framework takes care of collecting output of Mapper and then converting it into key -[value,value] format. In the reducer only thing that i had to do was to iterate through all the values and come up with a count. Once i have that i write it as output of Reducer by calling context.write(key, new IntWritable(sum));
  3. The last part is creating WordCountDriver.java, which is a Java program that sets up Hadoop Framework by setting up inputs, defining outputs and also specifying name of the Mapper and Reducer class. After initializing Hadoop it calls job.waitForCompletion(true), this method will take care of passing the control to Hadoop framework and wait for the job to complete
  4. Now you can either use one of the existing .txt file on your machine or you can create a text file like this
    
    hhh eee  iii bbb ccc fff ddd  ggg aaa aaa
    XXX YYY ZZZ
    hhh eee  iii bbb ccc fff ddd  ggg aaa aaa 
    XXX YYY ZZZ
    hhh eee  iii bbb ccc fff ddd  ggg aaa aaa 
    XXX YYY ZZZ
    hhh eee  iii bbb ccc fff ddd  ggg aaa aaa 
    hhh eee  iii bbb ccc fff ddd  ggg aaa aaa
    
  5. Last step is to run your Hadoop program, if you used the Eclipse or some other IDE for developing your code, you can run your program directly by running WordCountDriver.java directly. This program takes 2 parameters, in my case since the input file is on local file system and i want the output to get stored on local file system too, i pass following 2 parameters
    
    file:///Users/sunil/hadoop/sorttest.txt file:///Users/sunil/hadoop/output/wordcount
    
  6. Once the program is finished successfully, you would be able to see part-r-00000 file created on your local machine at /Users/sunil/hadoop/output/wordcount, if you open it you should see output like this
    
    XXX     3
    YYY     3
    ZZZ     3
    aaa     10
    bbb     5
    ccc     5
    ddd     5
    eee     5
    fff     5
    ggg     5
    hhh     5
    iii     5
    
If you want to run this program with bigger text file then you can download few good classical books from Algorithm site data section

Using MDC for setting context variables in Log4J

Sometimes you might want to configure your logs so that it adds some context specific attributes on every line. Ex. you might want to print name of the logged in user in the log statement. So that if you want to see what went wrong for say user John you can find all the log statements for John by grep and analyze the problem Apache Log4j has MDC.java class that can be used for this type of use case. Basic idea is you set a Map of parameters on the current thread and that would be available to all the methods downstream. I wanted to try this feature out so i used the following steps.
  1. First call MDC.put("USER","Sunil") method to set USER context variable at current thread level
    
    package com.test.mq;
    
    import org.apache.log4j.Logger;
    import org.apache.log4j.MDC;
    public class HelloMDC {
        public static void main(String[] argv){
            Logger logger = Logger.getLogger(HelloMDC.class);
            MDC.put("USER","Sunil");
            logger.debug("Sample debug message");
        }
    }
    
  2. Then you can configure the message pattern layout to include the USER variable like %X{USER} at the start of the message
    
    <?xml version="1.0" encoding="UTF-8" ?>
    <!DOCTYPE log4j:configuration SYSTEM "log4j.dtd">
    <log4j:configuration xmlns:log4j="http://jakarta.apache.org/log4j/">
        <appender name="console" class="org.apache.log4j.ConsoleAppender">
            <param name="Target" value="System.out"/>
            <layout class="org.apache.log4j.PatternLayout">
                <param name="ConversionPattern" value="%X{USER} - %-5p %c{1} - %m%n"/>
            </layout>
        </appender>
        <root>
            <priority value ="debug" />
            <appender-ref ref="console" />
        </root>
    </log4j:configuration>
    
Now when the code executes you can see the user name printed at the start of the message like this Sunil - DEBUG HelloSender - Sample debug message

Debugging Cordova/PhoneGap applications running on Android

Being able to debug your code is always a big help, I wanted to debug PhoneGap/Cordova application and i used the Google Chrome Developer tools that make it really easy to debug web application running on Android
  • First i stated my Android Emulator with my application running on it
  • Once my application is launch on desktop i opened my Chrome browser and clicked on Tools -< Inspect Devices like this
  • On the next tool it should display list of Devices available and here you should see name of the android emulator like this, click on inspect link
  • It opens a new window pointing to your PhoneGap page, you can use that screen for debugging, looking at console messages,... etc

Problems with Android LogCat

When i was working with Android application every now and then i was running into problem that logcat would not work. i.e. if i ran adb logcat it would just seat there saying - waiting for devices - like this
For first few times i did restart my machine to get it working. But then i found this solution. In that you restart your adb by executing following 2 commands

adb kill-all
adb start-server
Make sure that you get daemon started successfully message, if not try stopping and starting adb again. Once your adb started, start emulator and then you should be able to execute adb devices and it should show name of your emulator.
Now if you run adb logcat command it should work

Changes required in serveResource to serve special characters

It seems that returning UTF-8 characters from characters from serveResource method requires some additional steps.

I created this sample portlet which returns some Chinese characters from both doView() and serveResource() method. The content returned from doView() is ok, but the content returned by serveResource() returns garbage character


public void doView(RenderRequest request, RenderResponse response)
throws PortletException, IOException {
// Set the MIME type for the render response
response.setContentType("text/html; charset=UTF-8");
response.getWriter().println("这是世界您好");
}

public void serveResource(ResourceRequest request,
ResourceResponse response)
throws PortletException, IOException {
response.setContentType("text/html; charset=UTF-8");
response.getWriter().println("这是世界您好");
}


It seems that in order to return valid response from serveResource method we have to add some additional properties on the response

public void serveResource(ResourceRequest request, ResourceResponse response)
throws PortletException, IOException {
response.setContentType("text/html; charset=UTF-8");
response.setCharacterEncoding("UTF-8");
response.setProperty("Content-Type","text/html; charset=UTF-8");
response.getWriter().println("这是世界您好");
}