In the
Using third part jars and files in your MapReduce application(Distributed cache) entry i blogged about how to use
Distributed Cache in Hadoop using command line option. But you can also have option of using DistributedCache API. You will have to use following steps to use DistributedCache programmatically
In order to use it, first change your MapReduce Driver class to add
job.addCacheFile()
- In order to use a file with DistributedCache API, it has to available on either
hdfs:// or http://
URL, that is accessible to all the cluster members. So first step was to upload the file that you are interested in into HDFS, in my case i used following command to copy the GoeLite2-City.mmdb file to hdfs.
hdfs dfs -copyFromLocal GeoLite2-City.mmdb /GeoLite2-City.mmdb
-
Next step is to change the Driver class and add
job.addCacheFile(new URI("hdfs://localhost:9000/GeoLite2-City.mmdb#GeoLite2-City.mmdb"));
call, this call takes the hdfs url of the file that you just uploaded to HDFS and passes it to DistributedCache class. The #GeoLite2-City.mmdb
is used here to tell Hadoop that it should create a symbolic link to this file
-
Now in your Mapper class you can read the
GeoLite2-City.mmdb
using normal File API
When you use the distributed cache Hadoop first copies the file specified in the DistributedCache API on the machine executing task. You can view it by looking at the mapreduce temp directory like this.
Hi
ReplyDeleteI was trying to add the files to distributed cache and tried retrieving them in my Map class.
The issue that I faced was with the URL.
The MR job error out hitting file not found exception.
What I saw was weird.
The file that I passed had following name [for ex]
hdfs://localhost:9000/abc/xyz.txt
I used SOP and it was printed correctly in the mapper class. But when I used the same file name in my FileReader object's constructor, It was resolved like
hdfs:/localhost:9000/abc/xyz.txt
I tried adding an assitional / but it didn't worked out.
Can some one please help?
Hey thanks alot for such an informative article.....you saved my many hours....cheers!!!
ReplyDeletevery nice post.Thank you for info.
ReplyDeletebig data online training
big data and hadoop course