If you want to use a third party jar in your MapReduce program you have two options, one is to create a single jar with all dependencies and other is to use the hadoop distributed cache option.
I wanted to play around with both these options, so i built this simple application in which i read the Standard Apache HTTP Server log and parse it to read the ip address request is coming from. Then i use the ip address and invoke the
Geo IP lookup to find out what city, country that request came from. Once i have a city i am using it to count number of requests that came from particular city. You can download the source code for this project from
here. This is how the mapper class for my map-reduce application looks like, In the
setup()
method of my class i am creating object of
com.maxmind.geoip2.DatabaseReader
. I am passing the
GeoLite2-City.mmdb
the geo ip database file to my reader.
In order for this program to work my MapReduce program needs access to
com.maxmind.geoip2.geoip2
along with its dependencies, it also needs access to
GeoLite2-City.mmdb
.
Creating single jar with all dependencies
In order for this method to work add following plugin in your maven build file like this, in this case the jar will get created with main class as
com.spnotes.hadoop.logs.ApacheLogMRDriver
Now you can build a single jar file by executing
mvn clean compile assembly:single
command. Copy the
GeoLite2-City.mmdb
to the hadoop cluster(You can package it inside jar but might not be a good idea). Now you can execute following command to execute this job on cluster
hadoop jar ApacheLogsMR-jar-with-dependencies.jar -files GeoLite2-City.mmdb /apache.log /output/logs/apachelogs
Using Hadoop distributed cache
Second option you have is to create a jar only for your map reduce program and pass dependencies to hadoop using distributed cache. In this option, you should make sure that your Driver class extends
Configured
like this
Copy the dependency jars as well as
GeoLite2-City.mmdb
file on to the same machine that has hadoop cluster. Then execute following command to pass both jar and file to the hadoop.
hadoop jar ApacheLogsMR.jar -libjars geoip2-0.7.1.jar,maxmind-db-0.3.2.jar,jackson-databind-2.2.3.jar,jackson-core-2.2.3.jar,jackson-annotations-2.2.3.jar -files GeoLite2-City.mmdb /apache.log /output/logs/apachelogs
Where will I get sample logs for this sample ?
ReplyDeleteCan't thank you enough for this. I was looking for an option to share jar resources across reducers without explicitly copying it on hdfs or using DistributedCache. Didn't know about the "-files" option. Thanks again.
ReplyDeleteThanks for info....
ReplyDeleteWebsite development in Bangalore