Creating Oozie workflow for mapreduce job that uses distributed cache

In the Using third part jars and files in your MapReduce application(Distributed cache) entry i blogged about how to create a MapReduce job that uses distributed cache for storing both required jar files and files for use in distributed cache. I wanted to figure out how to automate this mapreduce job using Apache Oozie so i followed these steps
  1. First i did create apachelog directory and in that i had to create job.properties file like this
    nameNode=hdfs://localhost.localdomain:8020
    jobTracker=localhost.localdomain:8021
    queueName=default
    oozie.wf.application.path=${nameNode}/user/${user.name}/apachelog
    outputDir=apachelog
    logFile=apache.log
    view raw job.properties hosted with ❤ by GitHub
  2. Then i create workflow.xml file that looks like this, in this one thing to notice is <file>GeoLite.mmdb#GeoLite2-City.mmdb</file>, so basically i have file GeoLite.mmdb on the disk but i want to refer to it as GeoLite2-City.mmdb in my program so that file element takes care of creating symlink
    <workflow-app xmlns="uri:oozie:workflow:0.2" name="apachelog-wf">
    <start to="mr-node"/>
    <action name="mr-node">
    <map-reduce>
    <job-tracker>${jobTracker}</job-tracker>
    <name-node>${nameNode}</name-node>
    <prepare>
    <delete path="${nameNode}/user/${wf:user()}/output/${outputDir}"/>
    </prepare>
    <configuration>
    <property>
    <name>mapred.mapper.new-api</name>
    <value>true</value>
    </property>
    <property>
    <name>mapred.reducer.new-api</name>
    <value>true</value>
    </property>
    <property>
    <name>mapred.job.queue.name</name>
    <value>${queueName}</value>
    </property>
    <property>
    <name>mapreduce.map.class</name>
    <value>com.spnotes.hadoop.logs.ApacheLogMapper</value>
    </property>
    <property>
    <name>mapreduce.reduce.class</name>
    <value>com.spnotes.hadoop.logs.ApacheLogReducer</value>
    </property>
    <property>
    <name>mapred.output.key.class</name>
    <value>org.apache.hadoop.io.Text</value>
    </property>
    <property>
    <name>mapred.output.value.class</name>
    <value>org.apache.hadoop.io.IntWritable</value>
    </property>
    <property>
    <name>mapred.map.tasks</name>
    <value>1</value>
    </property>
    <property>
    <name>mapred.input.dir</name>
    <value>/user/${wf:user()}/${logFile}</value>
    </property>
    <property>
    <name>mapred.output.dir</name>
    <value>/user/${wf:user()}/output/${outputDir}</value>
    </property>
    </configuration>
    <file>GeoLite.mmdb#GeoLite2-City.mmdb</file>
    </map-reduce>
    <ok to="end"/>
    <error to="fail"/>
    </action>
    <kill name="fail">
    <message>Map/Reduce failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
    </kill>
    <end name="end"/>
    </workflow-app>
    view raw workflow.xml hosted with ❤ by GitHub
  3. Then i copied all the required jar files in the lib folder and then this is how my directory structure looks like
  4. I used following command to copy the apachelog directory that has everything that my oozie job needs to the hdfs
    
    hdfs dfs -put apachelog apachelog
    
  5. Last step is to invoke the oozie job by executing following command
    
    oozie job -oozie http://localhost:11000/oozie -config job.properties -run
    

2 comments:

Anonymous said...

where is the location for the ditributed cache file. I mean should it be hdfs..? can u plz help

Abhi said...

Thanks for info....
Website development in Bangalore