Using Apache Oozie to execute MapReduce jobs

I wanted to learn about how to automate MapReduce job using Oozie, so i decide to create Oozie workflow to invoke WordCount(HelloWorld) MapReduce program. I had to follow these steps
  1. FIrst thing that i did was to download the WordCount program source code by executing
    
    git clone https://github.com/sdpatil/HadoopWordCount3
    
    This program does have maven script for building executable jar, so i used mvn clean package command to build Hadoop jar.
  2. After that i tried executing the program manually by using following following command
    
    hadoop jar target/HadoopWordCount.jar sorttest.txt output/wordcount
    
  3. Now in order to use Oozie workflow you will have to create a particular folder structure on your machine
    
    wordcount
       -- job.properties
       -- workflow.xml
       -- lib
             -- HadoopWordCount.jar  
    
  4. In the workcount folder create job.properties file like this, This file lets you pass parameters to your oozie workflow. Value of nameNode and jobTracker represent the name node and job tracker location. In my case i am using cloudera vm with single ndoe so both these properties point to localhost. The value of oozie.wf.application.path is equal to HDFS path where you uploaded the wordcount folder created in step 3
  5. Next define your Apache oozie workflow.xml file like this. In my case the workflow has single step which is to execute mapreduce job. I am
    • mapred.mapper.new-api & mapred.reducer.new-api: Set this property to true if your using the new MapReduce API based on org.apache.hadoop.mapreduce.* classes
    • mapreduce.map.class: The fully qualified name of your mapper class
    • mapreduce.reduce.class: The fully qualified name of your reducer class
    • mapred.output.key.class: Fully qualified name of the output key class. This is same as parameter to job.setOutputKeyClass() in your driver class
    • mapred.output.value.class: Fully qualified name of the output value class. This is same as parameter to job.setOutputValueClass() in your driver class
    • mapred.input.dir: Location of your input file in my case i have sorttext.txt in hdfs://localhost/user/cloudera directory
    • mapred.output.dir:Location of output file that will get generated. In my case i want output to go to hdfs://localhost/user/cloudera/output/wordcount directory
  6. Once your oozie workflow is ready upload the wordcount folder in HDFS by executing following command
    
    hdfs dfs -put oozie wordcount
    
  7. 
    Now run your oozie workflow by executing following command from your wordcount directory
    oozie job -oozie http://localhost:11000/oozie -config job.properties -run
    
    If it runs successfully you should see output generated in hdfs://localhost/user/cloudera/output/wordcount directory

1 comment: