Sunil's Notes: hadoop-streaming

Showing posts with label hadoop-streaming. Show all posts

Using Apache Oozie for automating streaming map-reduce job

In the WordCount MapReduce program using Hadoop streaming and python i talked about how to create a Streaming map-reduce job using python. I wanted to figure out how to automate that program using Oozie workflow so i followed these steps

First step was to create a folder called streaming on my local machine and copying of mapper.py, reducer.py into the streaming folder, i also create the place holder for job.properties and workflow.xml

Next i did create a job.properties file like this Now this job.properties is quite similar to the job.properties for java mapreduce job, only difference is you must set oozie.use.system.libpath=true, by default the streaming related jars are not included in the classpath, so unless you set that value to true you will get following error


2014-07-23 06:15:13,170 WARN org.apache.hadoop.mapred.Child: Error running child
java.lang.RuntimeException: java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.streaming.Pi
peMapRunner not found
 at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:1649)
 at org.apache.hadoop.mapred.JobConf.getMapRunnerClass(JobConf.java:1010)
 at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:413)
 at org.apache.hadoop.mapred.MapTask.run(MapTask.java:332)
 at org.apache.hadoop.mapred.Child$4.run(Child.java:268)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:396)
 at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
 at org.apache.hadoop.mapred.Child.main(Child.java:262)
Caused by: java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.streaming.PipeMapRunner not f
ound
 at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:1617)
 at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:1641)
 ... 8 more
Caused by: java.lang.ClassNotFoundException: Class org.apache.hadoop.streaming.PipeMapRunner not found
 at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:1523)
 at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:1615)
 ... 9 more
2014-07-23 06:15:13,175 INFO org.apache.hadoop.mapred.Task: Runnning cleanup for the task

Next step in the process is to create workflow.xml file like this, make sure to add <file>mapper.py#mapper.py</file> element in the workflow.xml, which takes care of putting the mapper.py and reducer.py in the sharedlib and creating symbolic link to these two files.
Upload the streaming folder with all your changes on hdfs by executing following command
```
hdfs dfs -put streaming streaming
```

You can trigger the oozie workflow by executing following command


oozie job -oozie http://localhost:11000/oozie -config streaming/job.properties -run

WordCount MapReduce program using Pydoop (MapReduce framework for Python)

In WordCount MapReduce program using Hadoop streaming and python entry i used Hadoop Streaming for creating MapReduce program, but that program is quite verbose and it has limitations such as you cannot use counters,.. etc. So i decided to develop same program using Pydoop, which is framework that makes is easy to developing Hadoop Map Reduce program as well as working with HDFS easier. I followed these steps

First i followed instructions on pydoop installation page to install pydoop on my machine. I ran into some issues during that process but eventually had pydoop installed
Next i did create a HelloPydoop.py file which contains mapper function and reducer function like this. The mapper function gets linenumber and line at a time, in that function i am taking care of breaking the line into words and then writing them into output (writer.emit()). In the reducer method i am getting word and incount in the (key, [value,value1] format. Which is different that Hadoop streaming where i have to take care of change in key, so this code is much compact
Once my HelloPydoop.py file is ready i could invoke it by passing to pydoop script in this aesop.txt is the name of the input file in HDFS and i want the output to get generated in output/pydoop directory in HDFS. pydoop script /home/user/PycharmProjects/HelloWorld/Pydoop/HelloPydoop.py aesop.txt output/pydoop
After the map reducer is done executing i can look at its output by executing hdfs dfs -cat output/pydoop/part-00000 command

WordCount MapReduce program using Hadoop streaming and python

I wanted to learn how to use Hadoop Streaming, which allows us to use scripting language such as Python, Ruby,.. etc for developing Map Reduce program. The idea is instead of writing Java classes for Mapper and Reducer you develop 2 script files (something that can be executed from command line) one for mapper and other for reducer and pass it to Hadoop. Hadoop will communicate to the script files using standard input/output, which means for both mapper and reducer hadoop will pass input on standard input and your script file will read it from standard input. Once your script is done processing the data in either mapper or reducer it will write output to standard output that will get sent back to hadoop. I decided to create Word Count program that takes a file as input and counts occurrence of every word in the file and writes it in output. I followed these steps

I started by creating a mapper.py file like this, In the mapper i am reading one line from input at a time and then splitting it into pieces and writing it to output in (word,1) format. In the mapper whatever i write in output gets passed back to Hadoop, so i could not use standard output for writing debug statements. So i configured file logger that generates debug.log in the current directory
Next i created a reducer.py program that reads one line at a time and splits it on tab character. In the split first part is word and second is the count. Now one difference between java reducer and streaming reducer is in Java your reduce method gets input like this (key, [value1, value2,value3]),(key1, [value1, value2,value3]) . In streaming it gets called with one key and value every time like this (key,value1),(key,value2),(key,value3),(key1,value),(key1,value2),(key1,value3), so you will have to remember what key your processing and handle the change in key. In my reducer i am keeping track of current key, and for every value of the current key i keep accumulating it, when the key changes i use that opportunity to dump the old key and count
One good part with developing using scripting is that you can test your code without hadoop as well. In this case once my mapper and reducer are ready i can test it on command line using data | mapper | sort | reducer format. In my case the mapper and reducer files are in /home/user/workspace/HadoopPython/streaming/ directory. and i have a sample file in home directory so i could test my program by executing it like this cat ~/sample.txt | /home/user/workspace/HadoopPython/streaming/mapper.py | sort | /home/user/workspace/HadoopPython/streaming/reducer.py
After working through bugs i copied aesop.txt in in root of my HDFS and then i could use following command to execute my map reduce program. hadoop jar /usr/local/hadoop/share/hadoop/tools/lib/hadoop-streaming-2.4.0.jar -input aesop.txt -output output/wordcount -mapper /home/user/workspace/HadoopPython/streaming/mapper.py -reducer /home/user/workspace/HadoopPython/streaming/reducer.py
Once the program is done executing i could see the output generated by it using following command hdfs dfs -cat output/wordcount/part-00000

Note: My mapper and reducer code is not as compact as it can be, because i am new to Python