WordCount MapReduce program using Hadoop streaming and python

I wanted to learn how to use Hadoop Streaming, which allows us to use scripting language such as Python, Ruby,.. etc for developing Map Reduce program. The idea is instead of writing Java classes for Mapper and Reducer you develop 2 script files (something that can be executed from command line) one for mapper and other for reducer and pass it to Hadoop. Hadoop will communicate to the script files using standard input/output, which means for both mapper and reducer hadoop will pass input on standard input and your script file will read it from standard input. Once your script is done processing the data in either mapper or reducer it will write output to standard output that will get sent back to hadoop. I decided to create Word Count program that takes a file as input and counts occurrence of every word in the file and writes it in output. I followed these steps
  1. I started by creating a mapper.py file like this, In the mapper i am reading one line from input at a time and then splitting it into pieces and writing it to output in (word,1) format. In the mapper whatever i write in output gets passed back to Hadoop, so i could not use standard output for writing debug statements. So i configured file logger that generates debug.log in the current directory
  2. Next i created a reducer.py program that reads one line at a time and splits it on tab character. In the split first part is word and second is the count. Now one difference between java reducer and streaming reducer is in Java your reduce method gets input like this (key, [value1, value2,value3]),(key1, [value1, value2,value3]) . In streaming it gets called with one key and value every time like this (key,value1),(key,value2),(key,value3),(key1,value),(key1,value2),(key1,value3), so you will have to remember what key your processing and handle the change in key. In my reducer i am keeping track of current key, and for every value of the current key i keep accumulating it, when the key changes i use that opportunity to dump the old key and count
  3. One good part with developing using scripting is that you can test your code without hadoop as well. In this case once my mapper and reducer are ready i can test it on command line using data | mapper | sort | reducer format. In my case the mapper and reducer files are in /home/user/workspace/HadoopPython/streaming/ directory. and i have a sample file in home directory so i could test my program by executing it like this cat ~/sample.txt | /home/user/workspace/HadoopPython/streaming/mapper.py | sort | /home/user/workspace/HadoopPython/streaming/reducer.py
  4. After working through bugs i copied aesop.txt in in root of my HDFS and then i could use following command to execute my map reduce program. hadoop jar /usr/local/hadoop/share/hadoop/tools/lib/hadoop-streaming-2.4.0.jar -input aesop.txt -output output/wordcount -mapper /home/user/workspace/HadoopPython/streaming/mapper.py -reducer /home/user/workspace/HadoopPython/streaming/reducer.py
  5. Once the program is done executing i could see the output generated by it using following command hdfs dfs -cat output/wordcount/part-00000
Note: My mapper and reducer code is not as compact as it can be, because i am new to Python


Jhon David said...

There are lots of information about latest technology and how to get trained in them, like Hadoop Training Chennai have spread around the web, but this is a unique one according to me. The strategy you have updated here will make me to get trained in future technologies(Hadoop Training in Chennai). By the way you are running a great blog. Thanks for sharing this.

Hadoop training institutes in chennai | Hadoop Training Chennai

Roshini RS said...

I am reading your post from the beginning, it was so interesting to read & I feel thanks to you for posting such a good blog, keep updates regularly.
Python Training in Chennai|Informatica training in chennai|Python Training Institutes in Chennai

Andria BZ said...

Thanks for sharing this niche useful informative post to our knowledge, Actually SAP is ERP software that can be used in many companies for their day to day business activities it has great scope in future.
SAP training|SAP institutes in chennai|SAP Institutes in Chennai|sap training institute in Chennai

James Brown said...

I have a hard time describing my on content, but I really felt I should here. Your article is really great. I like the way you wrote this information.
character count tool

kovalan Jayamurugan said...

Informative post on Android mobile application development!!! As the usage of iPhones keep on increasing, there is massive demand for best performing Apps and Games for iOS platform. Best IOS Training in Chennai

Ramesh K said...

Thanks for sharing this information .You may also refer http://www.s4techno.com/hadoop-training-in-pune/