- First i followed instructions on pydoop installation page to install pydoop on my machine. I ran into some issues during that process but eventually had pydoop installed
-
Next i did create a HelloPydoop.py file which contains mapper function and reducer function like this. The mapper function gets linenumber and line at a time, in that function i am taking care of breaking the line into words and then writing them into output (
writer.emit()
). In the reducer method i am getting word and incount in the(key, [value,value1]
format. Which is different that Hadoop streaming where i have to take care of change in key, so this code is much compact -
Once my HelloPydoop.py file is ready i could invoke it by passing to pydoop script in this aesop.txt is the name of the input file in HDFS and i want the output to get generated in output/pydoop directory in HDFS.
pydoop script /home/user/PycharmProjects/HelloWorld/Pydoop/HelloPydoop.py aesop.txt output/pydoop
-
After the map reducer is done executing i can look at its output by executing
hdfs dfs -cat output/pydoop/part-00000
command
WordCount MapReduce program using Pydoop (MapReduce framework for Python)
In WordCount MapReduce program using Hadoop streaming and python entry i used Hadoop Streaming for creating MapReduce program, but that program is quite verbose and it has limitations such as you cannot use counters,.. etc.
So i decided to develop same program using Pydoop, which is framework that makes is easy to developing Hadoop Map Reduce program as well as working with HDFS easier. I followed these steps
Thanks for helping me to understand basic Hadoop Map reduce program on hadoop concepts. As a beginner in Hadoop your post help me a lot.
ReplyDeleteHadoop Training in Velachery | Hadoop Training .
Hadoop Training in Chennai | Hadoop .
Thanks for info....
ReplyDeleteWebsite development in Bangalore