Sunil's Notes: WordCount MapReduce program using Pydoop (MapReduce framework for Python)

In WordCount MapReduce program using Hadoop streaming and python entry i used Hadoop Streaming for creating MapReduce program, but that program is quite verbose and it has limitations such as you cannot use counters,.. etc. So i decided to develop same program using Pydoop, which is framework that makes is easy to developing Hadoop Map Reduce program as well as working with HDFS easier. I followed these steps

First i followed instructions on pydoop installation page to install pydoop on my machine. I ran into some issues during that process but eventually had pydoop installed
Next i did create a HelloPydoop.py file which contains mapper function and reducer function like this. The mapper function gets linenumber and line at a time, in that function i am taking care of breaking the line into words and then writing them into output (writer.emit()). In the reducer method i am getting word and incount in the (key, [value,value1] format. Which is different that Hadoop streaming where i have to take care of change in key, so this code is much compact
Once my HelloPydoop.py file is ready i could invoke it by passing to pydoop script in this aesop.txt is the name of the input file in HDFS and i want the output to get generated in output/pydoop directory in HDFS. pydoop script /home/user/PycharmProjects/HelloWorld/Pydoop/HelloPydoop.py aesop.txt output/pydoop
After the map reducer is done executing i can look at its output by executing hdfs dfs -cat output/pydoop/part-00000 command

3 comments:

Unknown said...: Thanks for helping me to understand basic Hadoop Map reduce program on hadoop concepts. As a beginner in Hadoop your post help me a lot.
Hadoop Training in Velachery | Hadoop Training .
Hadoop Training in Chennai | Hadoop .; April 20, 2018 at 2:52 AM
Abhi said...: Thanks for info....
Website development in Bangalore; June 24, 2019 at 11:13 PM
Abhimanyu said...: Hospital administrators coordinate healthcare delivery, manage teams and finances, uphold policies, and strive for excellence in patient care.
Learn more; June 21, 2025 at 7:31 AM