-
First i did create apachelog directory and in that i had to create job.properties file like this
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters
nameNode=hdfs://localhost.localdomain:8020 jobTracker=localhost.localdomain:8021 queueName=default oozie.wf.application.path=${nameNode}/user/${user.name}/apachelog outputDir=apachelog logFile=apache.log -
Then i create workflow.xml file that looks like this, in this one thing to notice is
<file>GeoLite.mmdb#GeoLite2-City.mmdb</file>
, so basically i have file GeoLite.mmdb on the disk but i want to refer to it asGeoLite2-City.mmdb
in my program so that file element takes care of creating symlinkThis file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters<workflow-app xmlns="uri:oozie:workflow:0.2" name="apachelog-wf"> <start to="mr-node"/> <action name="mr-node"> <map-reduce> <job-tracker>${jobTracker}</job-tracker> <name-node>${nameNode}</name-node> <prepare> <delete path="${nameNode}/user/${wf:user()}/output/${outputDir}"/> </prepare> <configuration> <property> <name>mapred.mapper.new-api</name> <value>true</value> </property> <property> <name>mapred.reducer.new-api</name> <value>true</value> </property> <property> <name>mapred.job.queue.name</name> <value>${queueName}</value> </property> <property> <name>mapreduce.map.class</name> <value>com.spnotes.hadoop.logs.ApacheLogMapper</value> </property> <property> <name>mapreduce.reduce.class</name> <value>com.spnotes.hadoop.logs.ApacheLogReducer</value> </property> <property> <name>mapred.output.key.class</name> <value>org.apache.hadoop.io.Text</value> </property> <property> <name>mapred.output.value.class</name> <value>org.apache.hadoop.io.IntWritable</value> </property> <property> <name>mapred.map.tasks</name> <value>1</value> </property> <property> <name>mapred.input.dir</name> <value>/user/${wf:user()}/${logFile}</value> </property> <property> <name>mapred.output.dir</name> <value>/user/${wf:user()}/output/${outputDir}</value> </property> </configuration> <file>GeoLite.mmdb#GeoLite2-City.mmdb</file> </map-reduce> <ok to="end"/> <error to="fail"/> </action> <kill name="fail"> <message>Map/Reduce failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message> </kill> <end name="end"/> </workflow-app> - Then i copied all the required jar files in the lib folder and then this is how my directory structure looks like
- I used following command to copy the apachelog directory that has everything that my oozie job needs to the hdfs
hdfs dfs -put apachelog apachelog
- Last step is to invoke the oozie job by executing following command
oozie job -oozie http://localhost:11000/oozie -config job.properties -run
Creating Oozie workflow for mapreduce job that uses distributed cache
In the Using third part jars and files in your MapReduce application(Distributed cache) entry i blogged about how to create a MapReduce job that uses distributed cache for storing both required jar files and files for use in distributed cache. I wanted to figure out how to automate this mapreduce job using Apache Oozie so i followed these steps
Subscribe to:
Post Comments (Atom)
2 comments:
where is the location for the ditributed cache file. I mean should it be hdfs..? can u plz help
Thanks for info....
Website development in Bangalore
Post a Comment