WordCount MapReduce program using Hadoop streaming and python

I wanted to learn how to use Hadoop Streaming, which allows us to use scripting language such as Python, Ruby,.. etc for developing Map Reduce program. The idea is instead of writing Java classes for Mapper and Reducer you develop 2 script files (something that can be executed from command line) one for mapper and other for reducer and pass it to Hadoop. Hadoop will communicate to the script files using standard input/output, which means for both mapper and reducer hadoop will pass input on standard input and your script file will read it from standard input. Once your script is done processing the data in either mapper or reducer it will write output to standard output that will get sent back to hadoop. I decided to create Word Count program that takes a file as input and counts occurrence of every word in the file and writes it in output. I followed these steps
  1. I started by creating a mapper.py file like this, In the mapper i am reading one line from input at a time and then splitting it into pieces and writing it to output in (word,1) format. In the mapper whatever i write in output gets passed back to Hadoop, so i could not use standard output for writing debug statements. So i configured file logger that generates debug.log in the current directory
  2. Next i created a reducer.py program that reads one line at a time and splits it on tab character. In the split first part is word and second is the count. Now one difference between java reducer and streaming reducer is in Java your reduce method gets input like this (key, [value1, value2,value3]),(key1, [value1, value2,value3]) . In streaming it gets called with one key and value every time like this (key,value1),(key,value2),(key,value3),(key1,value),(key1,value2),(key1,value3), so you will have to remember what key your processing and handle the change in key. In my reducer i am keeping track of current key, and for every value of the current key i keep accumulating it, when the key changes i use that opportunity to dump the old key and count
  3. One good part with developing using scripting is that you can test your code without hadoop as well. In this case once my mapper and reducer are ready i can test it on command line using data | mapper | sort | reducer format. In my case the mapper and reducer files are in /home/user/workspace/HadoopPython/streaming/ directory. and i have a sample file in home directory so i could test my program by executing it like this cat ~/sample.txt | /home/user/workspace/HadoopPython/streaming/mapper.py | sort | /home/user/workspace/HadoopPython/streaming/reducer.py
  4. After working through bugs i copied aesop.txt in in root of my HDFS and then i could use following command to execute my map reduce program. hadoop jar /usr/local/hadoop/share/hadoop/tools/lib/hadoop-streaming-2.4.0.jar -input aesop.txt -output output/wordcount -mapper /home/user/workspace/HadoopPython/streaming/mapper.py -reducer /home/user/workspace/HadoopPython/streaming/reducer.py
  5. Once the program is done executing i could see the output generated by it using following command hdfs dfs -cat output/wordcount/part-00000
Note: My mapper and reducer code is not as compact as it can be, because i am new to Python

18 comments:

Jhon David said...

There are lots of information about latest technology and how to get trained in them, like Hadoop Training Chennai have spread around the web, but this is a unique one according to me. The strategy you have updated here will make me to get trained in future technologies(Hadoop Training in Chennai). By the way you are running a great blog. Thanks for sharing this.

Hadoop training institutes in chennai | Hadoop Training Chennai

Roshini RS said...

I am reading your post from the beginning, it was so interesting to read & I feel thanks to you for posting such a good blog, keep updates regularly.
Regards,
Python Training in Chennai|Informatica training in chennai|Python Training Institutes in Chennai

Andria BZ said...

Thanks for sharing this niche useful informative post to our knowledge, Actually SAP is ERP software that can be used in many companies for their day to day business activities it has great scope in future.
Regards,
SAP training|SAP institutes in chennai|SAP Institutes in Chennai|sap training institute in Chennai

James Brown said...

I have a hard time describing my on content, but I really felt I should here. Your article is really great. I like the way you wrote this information.
character count tool

kovalan Jayamurugan said...

Informative post on Android mobile application development!!! As the usage of iPhones keep on increasing, there is massive demand for best performing Apps and Games for iOS platform. Best IOS Training in Chennai

Ramesh K said...

Thanks for sharing this information .You may also refer http://www.s4techno.com/hadoop-training-in-pune/

jazz said...

Excellent and very cool idea and the subject at the top of magnificence and I am happy to this post..Interesting post! Thanks for writing it.What's wrong with this kind of post exactly? It follows your previous guideline for post length as well as clarity..
Android Training in Chennai

CPDESK said...

I think this map reduce program is easily implementable and neat code. Thanks man. CPDESK is Online Web Development Tool Company located in Canada. Our main services include : Web based Software designing Tool, Web based Business Application, Web based SQL form designer, Corporate application form designer. For more details please visit our site - Web Development Tools For Business Application | CPDESK

sunilkumarkuppam said...

This content is so informatics and it was motivating all the programmers and beginners to switch over the career into the Big Data Technology. This article is so impressed and keeps updating us regularly.
Hadoop Training in Chennai | Hadoop Training Chennai | Big Data Training in Chennai

Adam lee said...

I feel really happy to have seen your webpage and look forward to so
many more entertaining times reading here. Thanks once more for all
the details. Besant technology offer Python training in Bangalore

DAVIS MILLER said...

The young boys ended up stimulated to read through them and now
have unquestionably been having fun with these things.


Selenium Training in Chennai

rose said...

I enjoy what you guys are usually up too. This sort of clever work and coverage! Keep up the wonderful works guys I’ve added you guys to my blog roll.

Java Training in Bangalore|

Melba henry said...

Hello there! This is my first comment here, so I just wanted to give a quick shout out and say I genuinely enjoy reading your articles. Can you recommend any other blogs/websites/forums that deal with the same subjects? Thanks. DevOps Training in Bangalore

Addeline joseph said...

My Besant Technologies offer AWS training with 100% placement. Our AWS training course that includes fundamentals and advance AWS training program with high priority jobs. AWS training with placement having more exposure in most of the industry nowadays in depth manner of AWS.
AWS Training in Bangalore

Saranya said...

Very Nice blog: WordCount MapReduce program using Hadoop streaming and python
python, hadoop and mapreduce in same blog.
thank you for sharing the precious knowledge with us
keep blogging more Mr. Sunil I hav red ur other blog also on python.
very useful.
Devops Training in Bangalore

Unknown said...

usefull article. Thanks for sharing

Melisa said...

Thanks a lot for explaining practically. Fantastic Post! IOS Training in Chennai. Get more information IOS Training

Careen joseph said...

I’ve bookmarked your site, and I’m adding your RSS feeds to my Google account.
Besant technologies Marathahalli