Sunil's Notes: Creating Oozie workflow for mapreduce job that uses distributed cache

In the Using third part jars and files in your MapReduce application(Distributed cache) entry i blogged about how to create a MapReduce job that uses distributed cache for storing both required jar files and files for use in distributed cache. I wanted to figure out how to automate this mapreduce job using Apache Oozie so i followed these steps

First i did create apachelog directory and in that i had to create job.properties file like this
Then i create workflow.xml file that looks like this, in this one thing to notice is <file>GeoLite.mmdb#GeoLite2-City.mmdb</file>, so basically i have file GeoLite.mmdb on the disk but i want to refer to it as GeoLite2-City.mmdb in my program so that file element takes care of creating symlink
Then i copied all the required jar files in the lib folder and then this is how my directory structure looks like
I used following command to copy the apachelog directory that has everything that my oozie job needs to the hdfs
```
hdfs dfs -put apachelog apachelog
```

Last step is to invoke the oozie job by executing following command


oozie job -oozie http://localhost:11000/oozie -config job.properties -run

Creating Oozie workflow for mapreduce job that uses distributed cache

2 comments: