Using DistributedCache with MapReduce job

In the Using third part jars and files in your MapReduce application(Distributed cache) entry i blogged about how to use Distributed Cache in Hadoop using command line option. But you can also have option of using DistributedCache API. You will have to use following steps to use DistributedCache programmatically In order to use it, first change your MapReduce Driver class to add job.addCacheFile()
  1. In order to use a file with DistributedCache API, it has to available on either hdfs:// or http:// URL, that is accessible to all the cluster members. So first step was to upload the file that you are interested in into HDFS, in my case i used following command to copy the GoeLite2-City.mmdb file to hdfs.
    
    hdfs dfs -copyFromLocal GeoLite2-City.mmdb /GeoLite2-City.mmdb
    
  2. Next step is to change the Driver class and add job.addCacheFile(new URI("hdfs://localhost:9000/GeoLite2-City.mmdb#GeoLite2-City.mmdb")); call, this call takes the hdfs url of the file that you just uploaded to HDFS and passes it to DistributedCache class. The #GeoLite2-City.mmdb is used here to tell Hadoop that it should create a symbolic link to this file
  3. Now in your Mapper class you can read the GeoLite2-City.mmdb using normal File API
When you use the distributed cache Hadoop first copies the file specified in the DistributedCache API on the machine executing task. You can view it by looking at the mapreduce temp directory like this.

3 comments:

Anonymous said...

Hi

I was trying to add the files to distributed cache and tried retrieving them in my Map class.

The issue that I faced was with the URL.

The MR job error out hitting file not found exception.

What I saw was weird.

The file that I passed had following name [for ex]
hdfs://localhost:9000/abc/xyz.txt

I used SOP and it was printed correctly in the mapper class. But when I used the same file name in my FileReader object's constructor, It was resolved like

hdfs:/localhost:9000/abc/xyz.txt

I tried adding an assitional / but it didn't worked out.

Can some one please help?

Anonymous said...

Hey thanks alot for such an informative article.....you saved my many hours....cheers!!!

veera cynixit said...

very nice post.Thank you for info.

big data online training

big data and hadoop course