Using Apache Oozie to execute MapReduce jobs

I wanted to learn about how to automate MapReduce job using Oozie, so i decide to create Oozie workflow to invoke WordCount(HelloWorld) MapReduce program. I had to follow these steps
  1. FIrst thing that i did was to download the WordCount program source code by executing
    
    git clone https://github.com/sdpatil/HadoopWordCount3
    
    This program does have maven script for building executable jar, so i used mvn clean package command to build Hadoop jar.
  2. After that i tried executing the program manually by using following following command
    
    hadoop jar target/HadoopWordCount.jar sorttest.txt output/wordcount
    
  3. Now in order to use Oozie workflow you will have to create a particular folder structure on your machine
    
    wordcount
       -- job.properties
       -- workflow.xml
       -- lib
             -- HadoopWordCount.jar  
    
  4. In the workcount folder create job.properties file like this, This file lets you pass parameters to your oozie workflow. Value of nameNode and jobTracker represent the name node and job tracker location. In my case i am using cloudera vm with single ndoe so both these properties point to localhost. The value of oozie.wf.application.path is equal to HDFS path where you uploaded the wordcount folder created in step 3
  5. Next define your Apache oozie workflow.xml file like this. In my case the workflow has single step which is to execute mapreduce job. I am
    • mapred.mapper.new-api & mapred.reducer.new-api: Set this property to true if your using the new MapReduce API based on org.apache.hadoop.mapreduce.* classes
    • mapreduce.map.class: The fully qualified name of your mapper class
    • mapreduce.reduce.class: The fully qualified name of your reducer class
    • mapred.output.key.class: Fully qualified name of the output key class. This is same as parameter to job.setOutputKeyClass() in your driver class
    • mapred.output.value.class: Fully qualified name of the output value class. This is same as parameter to job.setOutputValueClass() in your driver class
    • mapred.input.dir: Location of your input file in my case i have sorttext.txt in hdfs://localhost/user/cloudera directory
    • mapred.output.dir:Location of output file that will get generated. In my case i want output to go to hdfs://localhost/user/cloudera/output/wordcount directory
  6. Once your oozie workflow is ready upload the wordcount folder in HDFS by executing following command
    
    hdfs dfs -put oozie wordcount
    
  7. 
    Now run your oozie workflow by executing following command from your wordcount directory
    oozie job -oozie http://localhost:11000/oozie -config job.properties -run
    
    If it runs successfully you should see output generated in hdfs://localhost/user/cloudera/output/wordcount directory

7 comments:

Subu said...

Cool summary!

Tejuteju said...

It was the very nice article and it is very useful Big data Hadoop online training

Rajesh Verma said...

Much obliged for your article. It was intriguing and useful.

Here I additionally need to recommend your peruser who normally heads out from one spot to another they should visit Airlines Gethuman that offer the best arrangements to book your seat on Delta Airlines Reservations. Hurry do as well and benefits the best arrangements and dispose of to check various sites for offers.
Get more help on
Southwest Airlines Flights
Delta Airlines Ticketing

Southwest Airlines Reservations said...

Thanks for your article. It was interesting and informative.
Here I also want to suggest your reader who usually travels one place to another they must visit Airlines Gethuman that offer best deals to book your seat on Delta Airlines Reservations. So do hurry and avail the best deals and get rid of to check different websites for offers

Southwest Airlines Reservations

Zoom said...

Really loved this website content and design, visit Niagara Cab Company
niagara falls cab company

Bipard said...

I hope to see more post from you. Thank you for sharing this post. Your blog posts are more interesting and impressive

TN Elections Portal, TN Voter Id Registration, Apply Online, Status and List Check, elections.tn.gov.in

James Alter said...

What makes Delta Airlines so great is the fact that the airline provides great customer service over the phone and online. You will find Delta Airlines Office in Kampala. Delta has deployed a team of experts in its office that has answers to all the questions and queries of customers.