Maven script for running Hadoop programs

In the WordCount(HelloWorld) MapReduce program blog i talked about how to create a WordCount MapReduce program. While developing MapReduce program, i follow this pattern in which first i develop it using MRUnit test driven development, then i execute it on local using driver. But last step for me is always to copy this program to my Cloudera VM and execute it. I build this maven script to help me with the last part, which is to scp the deployment .jar file to the Cloudera Hadoop VM and then execute it. This is the script i use When i create a new MapReduce program, i have to make couple of changes in it but i can always reuse most of it
  1. Change the value of scp.host to point to my hadoop vm, if you changed the username and password on your VM you will have to change it too
  2. Next i have to change the value of mainClass attribute to point to correct class for the MapReduce program that i am developing. In this case name of the driver class its com.spnotes.hadoop.WordCountDriver
  3. Then i have to change the value of command attribute in sshexec element. THe command is made up of different parts
    hadoop jar ${scp.dirCopyTo}/${project.build.finalName}.jar books/dickens.txt wordcount/outputs
    in this ${scp.dirCopyTo}/${project.build.finalName}.jar points to the .jar file that is being scp to the VM. books/dickens.txt is path of the input text file, in this case i am using hdfs as input location which points to hdfs://localhost/user/cloudera/books/dickens.txt and the output of mapreduce will get generated in hdfs://localhost/user/cloudera/wordcount/outputs
You can run maven antrun:run command for executing the maven script task that deploys the maperduce jar to the cloudera vm and executes it. You can execute the full project from here

1 comment:

Unknown said...

Seems more research has been done to create this blog as the information is very good on this blog. To this I also attending hadoop online training, which is adding to my knowledge more.