In the
WordCount(HelloWorld) MapReduce program blog i talked about how to create a WordCount MapReduce program. While developing MapReduce program, i follow this pattern in which first i develop it using MRUnit test driven development, then i execute it on local using driver. But last step for me is always to copy this program to my Cloudera VM and execute it. I build this maven script to help me with the last part, which is to scp the deployment .jar file to the Cloudera Hadoop VM and then execute it. This is the script i use
When i create a new MapReduce program, i have to make couple of changes in it but i can always reuse most of it
- Change the value of
scp.host
to point to my hadoop vm, if you changed the username and password on your VM you will have to change it too
- Next i have to change the value of
mainClass
attribute to point to correct class for the MapReduce program that i am developing. In this case name of the driver class its com.spnotes.hadoop.WordCountDriver
- Then i have to change the value of command attribute in
sshexec
element. THe command is made up of different parts
hadoop jar ${scp.dirCopyTo}/${project.build.finalName}.jar books/dickens.txt wordcount/outputs
in this ${scp.dirCopyTo}/${project.build.finalName}.jar points to the .jar file that is being scp to the VM. books/dickens.txt is path of the input text file, in this case i am using hdfs as input location which points to hdfs://localhost/user/cloudera/books/dickens.txt
and the output of mapreduce will get generated in hdfs://localhost/user/cloudera/wordcount/outputs
You can run
maven antrun:run
command for executing the maven script task that deploys the maperduce jar to the cloudera vm and executes it. You can execute the full project from
here
Seems more research has been done to create this blog as the information is very good on this blog. To this I also attending hadoop online training, which is adding to my knowledge more.
ReplyDelete