Sunil's Notes: Maven script for running Hadoop programs

In the WordCount(HelloWorld) MapReduce program blog i talked about how to create a WordCount MapReduce program. While developing MapReduce program, i follow this pattern in which first i develop it using MRUnit test driven development, then i execute it on local using driver. But last step for me is always to copy this program to my Cloudera VM and execute it. I build this maven script to help me with the last part, which is to scp the deployment .jar file to the Cloudera Hadoop VM and then execute it. This is the script i use When i create a new MapReduce program, i have to make couple of changes in it but i can always reuse most of it

Change the value of scp.host to point to my hadoop vm, if you changed the username and password on your VM you will have to change it too
Next i have to change the value of mainClass attribute to point to correct class for the MapReduce program that i am developing. In this case name of the driver class its com.spnotes.hadoop.WordCountDriver
Then i have to change the value of command attribute in sshexec element. THe command is made up of different parts
```
hadoop jar ${scp.dirCopyTo}/${project.build.finalName}.jar books/dickens.txt wordcount/outputs
```
in this ${scp.dirCopyTo}/${project.build.finalName}.jar points to the .jar file that is being scp to the VM. books/dickens.txt is path of the input text file, in this case i am using hdfs as input location which points to hdfs://localhost/user/cloudera/books/dickens.txt and the output of mapreduce will get generated in hdfs://localhost/user/cloudera/wordcount/outputs

You can run maven antrun:run command for executing the maven script task that deploys the maperduce jar to the cloudera vm and executes it. You can execute the full project from here

Maven script for running Hadoop programs

1 comment: