- First step was to create a folder called streaming on my local machine and copying of mapper.py, reducer.py into the streaming folder, i also create the place holder for job.properties and workflow.xml
- Next i did create a job.properties file like this
Now this job.properties is quite similar to the job.properties for java mapreduce job, only difference is you must set
oozie.use.system.libpath=true
, by default the streaming related jars are not included in the classpath, so unless you set that value to true you will get following error2014-07-23 06:15:13,170 WARN org.apache.hadoop.mapred.Child: Error running child java.lang.RuntimeException: java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.streaming.Pi peMapRunner not found at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:1649) at org.apache.hadoop.mapred.JobConf.getMapRunnerClass(JobConf.java:1010) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:413) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:332) at org.apache.hadoop.mapred.Child$4.run(Child.java:268) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408) at org.apache.hadoop.mapred.Child.main(Child.java:262) Caused by: java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.streaming.PipeMapRunner not f ound at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:1617) at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:1641) ... 8 more Caused by: java.lang.ClassNotFoundException: Class org.apache.hadoop.streaming.PipeMapRunner not found at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:1523) at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:1615) ... 9 more 2014-07-23 06:15:13,175 INFO org.apache.hadoop.mapred.Task: Runnning cleanup for the task
-
Next step in the process is to create workflow.xml file like this, make sure to add
<file>mapper.py#mapper.py</file>
element in the workflow.xml, which takes care of putting the mapper.py and reducer.py in the sharedlib and creating symbolic link to these two files. -
Upload the streaming folder with all your changes on hdfs by executing following command
hdfs dfs -put streaming streaming
- You can trigger the oozie workflow by executing following command
oozie job -oozie http://localhost:11000/oozie -config streaming/job.properties -run
Showing posts with label hadoop-streaming. Show all posts
Showing posts with label hadoop-streaming. Show all posts
Using Apache Oozie for automating streaming map-reduce job
In the
WordCount MapReduce program using Hadoop streaming and python i talked about how to create a Streaming map-reduce job using python. I wanted to figure out how to automate that program using Oozie workflow so i followed these steps
WordCount MapReduce program using Pydoop (MapReduce framework for Python)
In WordCount MapReduce program using Hadoop streaming and python entry i used Hadoop Streaming for creating MapReduce program, but that program is quite verbose and it has limitations such as you cannot use counters,.. etc.
So i decided to develop same program using Pydoop, which is framework that makes is easy to developing Hadoop Map Reduce program as well as working with HDFS easier. I followed these steps
- First i followed instructions on pydoop installation page to install pydoop on my machine. I ran into some issues during that process but eventually had pydoop installed
-
Next i did create a HelloPydoop.py file which contains mapper function and reducer function like this. The mapper function gets linenumber and line at a time, in that function i am taking care of breaking the line into words and then writing them into output (
writer.emit()
). In the reducer method i am getting word and incount in the(key, [value,value1]
format. Which is different that Hadoop streaming where i have to take care of change in key, so this code is much compact -
Once my HelloPydoop.py file is ready i could invoke it by passing to pydoop script in this aesop.txt is the name of the input file in HDFS and i want the output to get generated in output/pydoop directory in HDFS.
pydoop script /home/user/PycharmProjects/HelloWorld/Pydoop/HelloPydoop.py aesop.txt output/pydoop
-
After the map reducer is done executing i can look at its output by executing
hdfs dfs -cat output/pydoop/part-00000
command
WordCount MapReduce program using Hadoop streaming and python
I wanted to learn how to use Hadoop Streaming, which allows us to use scripting language such as Python, Ruby,.. etc for developing Map Reduce program. The idea is instead of writing Java classes for Mapper and Reducer you develop 2 script files (something that can be executed from command line) one for mapper and other for reducer and pass it to Hadoop. Hadoop will communicate to the script files using standard input/output, which means for both mapper and reducer hadoop will pass input on standard input and your script file will read it from standard input. Once your script is done processing the data in either mapper or reducer it will write output to standard output that will get sent back to hadoop.
I decided to create Word Count program that takes a file as input and counts occurrence of every word in the file and writes it in output. I followed these steps
-
I started by creating a mapper.py file like this, In the mapper i am reading one line from input at a time and then splitting it into pieces and writing it to output in
(word,1)
format. In the mapper whatever i write in output gets passed back to Hadoop, so i could not use standard output for writing debug statements. So i configured file logger that generates debug.log in the current directory -
Next i created a reducer.py program that reads one line at a time and splits it on tab character. In the split first part is word and second is the count. Now one difference between java reducer and streaming reducer is in Java your reduce method gets input like this
(key, [value1, value2,value3]),(key1, [value1, value2,value3])
. In streaming it gets called with one key and value every time like this(key,value1),(key,value2),(key,value3),(key1,value),(key1,value2),(key1,value3)
, so you will have to remember what key your processing and handle the change in key. In my reducer i am keeping track of current key, and for every value of the current key i keep accumulating it, when the key changes i use that opportunity to dump the old key and count -
One good part with developing using scripting is that you can test your code without hadoop as well. In this case once my mapper and reducer are ready i can test it on command line using
data | mapper | sort | reducer
format. In my case the mapper and reducer files are in /home/user/workspace/HadoopPython/streaming/ directory. and i have a sample file in home directory so i could test my program by executing it like thiscat ~/sample.txt | /home/user/workspace/HadoopPython/streaming/mapper.py | sort | /home/user/workspace/HadoopPython/streaming/reducer.py
-
After working through bugs i copied aesop.txt in in root of my HDFS and then i could use following command to execute my map reduce program.
hadoop jar /usr/local/hadoop/share/hadoop/tools/lib/hadoop-streaming-2.4.0.jar -input aesop.txt -output output/wordcount -mapper /home/user/workspace/HadoopPython/streaming/mapper.py -reducer /home/user/workspace/HadoopPython/streaming/reducer.py
-
Once the program is done executing i could see the output generated by it using following command
hdfs dfs -cat output/wordcount/part-00000
Subscribe to:
Posts (Atom)