WordCount program writtten using Spark framework written in python language

In the WordCount(HelloWorld) MapReduce program entry i talked about how to build a simple WordCount program using MapReduce. I wanted to try developing same program using Apache Spark but using Python, so i followed these steps
  1. Download version of spark that is appropriate for your hadoop from Spark Download page. In my case i am using Cloudera CHD4 VM image for development so i did download CDH4 version
  2. I did extract the spark-1.0.0-bin-cdh4.tgz in /home/cloudera/software folder
  3. Next step is to build a WordCount.py program like this. This program has 3 methods in this
    • flatMap: This method takes a line as input and splits it on space and publishes those words
    • map: This method takes a word as input and publishesh a tuple in word, 1 format
    • reduce: This method takes care of adding all the counters together
    The counts = distFile.flatMap(flatMap).map(map).reduceByKey(reduce) takes care of tying everything together
  4. Once WordCount.py is ready you can execute it like this by providing it path of the WordCount.py and input and output path
    
     ./bin/spark-submit --master local[4] /home/cloudera/workspace/spark/HelloSpark/WordCount.py 
    file:///home/cloudera/sorttext.txt file:///home/cloudera/output/wordcount
    
  5. Once the program is done executing you can take a look at the output by executing following command
    
    more /home/cloudera/output/wordcount/part-00000
    

No comments: