Creating Oozie workflow for mapreduce job that uses distributed cache

In the Using third part jars and files in your MapReduce application(Distributed cache) entry i blogged about how to create a MapReduce job that uses distributed cache for storing both required jar files and files for use in distributed cache. I wanted to figure out how to automate this mapreduce job using Apache Oozie so i followed these steps
  1. First i did create apachelog directory and in that i had to create job.properties file like this
  2. Then i create workflow.xml file that looks like this, in this one thing to notice is <file>GeoLite.mmdb#GeoLite2-City.mmdb</file>, so basically i have file GeoLite.mmdb on the disk but i want to refer to it as GeoLite2-City.mmdb in my program so that file element takes care of creating symlink
  3. Then i copied all the required jar files in the lib folder and then this is how my directory structure looks like
  4. I used following command to copy the apachelog directory that has everything that my oozie job needs to the hdfs
    
    hdfs dfs -put apachelog apachelog
    
  5. Last step is to invoke the oozie job by executing following command
    
    oozie job -oozie http://localhost:11000/oozie -config job.properties -run
    

2 comments:

Anonymous said...

where is the location for the ditributed cache file. I mean should it be hdfs..? can u plz help

Abhi said...

Thanks for info....
Website development in Bangalore