WordCount program writtten using Spark framework written in python language

In the WordCount(HelloWorld) MapReduce program entry i talked about how to build a simple WordCount program using MapReduce. I wanted to try developing same program using Apache Spark but using Python, so i followed these steps
  1. Download version of spark that is appropriate for your hadoop from Spark Download page. In my case i am using Cloudera CHD4 VM image for development so i did download CDH4 version
  2. I did extract the spark-1.0.0-bin-cdh4.tgz in /home/cloudera/software folder
  3. Next step is to build a WordCount.py program like this. This program has 3 methods in this
    • flatMap: This method takes a line as input and splits it on space and publishes those words
    • map: This method takes a word as input and publishesh a tuple in word, 1 format
    • reduce: This method takes care of adding all the counters together
    The counts = distFile.flatMap(flatMap).map(map).reduceByKey(reduce) takes care of tying everything together
  4. Once WordCount.py is ready you can execute it like this by providing it path of the WordCount.py and input and output path
    
     ./bin/spark-submit --master local[4] /home/cloudera/workspace/spark/HelloSpark/WordCount.py 
    file:///home/cloudera/sorttext.txt file:///home/cloudera/output/wordcount
    
  5. Once the program is done executing you can take a look at the output by executing following command
    
    more /home/cloudera/output/wordcount/part-00000
    

5 comments:

  1. Hi There,

    When I use your example without the code specifying the output file, the output can be printed into terminal. But when I added the output address, there is no output, and terminal has a response: "Usage: wordcount ".

    Can you help me with this?

    ReplyDelete
  2. This article is so useful for users. Thanks for sharing this news with us !
    Word Count Software

    ReplyDelete
  3. I wish more authors of this type of content would take the time you did to research and write so well. I am very impressed with your vision and insight. this

    ReplyDelete