- Download version of spark that is appropriate for your hadoop from Spark Download page. In my case i am using Cloudera CHD4 VM image for development so i did download CDH4 version
- I did extract the spark-1.0.0-bin-cdh4.tgz in /home/cloudera/software folder
-
Next step is to build a WordCount.py program like this. This program has 3 methods in this
- flatMap: This method takes a line as input and splits it on space and publishes those words
- map: This method takes a word as input and publishesh a tuple in
word, 1
format - reduce: This method takes care of adding all the counters together
counts = distFile.flatMap(flatMap).map(map).reduceByKey(reduce)
takes care of tying everything togetherThis file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters__author__ = 'cloudera' from pyspark import SparkContext import sys print sys.argv #Check if number of arguments are less than 3 exit if(len(sys.argv) < 3 ): print "Use WordCount2.py inputPath outputPath" sys.exit(1); sc = SparkContext("local","WordCount") # Read input and output path inputPath = sys.argv[1] print ('Path of input file ->' + inputPath) outputPath = sys.argv[2] print ('Path of output file ->' + outputPath) distFile = sc.textFile(inputPath) def flatMap(line): return line.split(",") def map(word): return (word,1) def reduce(a,b): return a+b counts = distFile.flatMap(flatMap).map(map).reduceByKey(reduce) counts.saveAsTextFile(outputPath) -
Once WordCount.py is ready you can execute it like this by providing it path of the WordCount.py and input and output path
./bin/spark-submit --master local[4] /home/cloudera/workspace/spark/HelloSpark/WordCount.py file:///home/cloudera/sorttext.txt file:///home/cloudera/output/wordcount
-
Once the program is done executing you can take a look at the output by executing following command
more /home/cloudera/output/wordcount/part-00000
WordCount program writtten using Spark framework written in python language
In the WordCount(HelloWorld) MapReduce program entry i talked about how to build a simple WordCount program using MapReduce. I wanted to try developing same program using Apache Spark but using Python, so i followed these steps
Subscribe to:
Post Comments (Atom)
6 comments:
Hi There,
When I use your example without the code specifying the output file, the output can be printed into terminal. But when I added the output address, there is no output, and terminal has a response: "Usage: wordcount ".
Can you help me with this?
This article is so useful for users. Thanks for sharing this news with us !
Word Count Software
Hey, Great article! I liked the way you write, Check my articles . You may like itInterior Renovation Ideas on your Budget: 5 MINIMALIST INTERIOR DESIGN IDEAS 11 Ultimate tips for Kitchen Interior DesigningUseful ideas for Apartment home Interior designs:
I wish more authors of this type of content would take the time you did to research and write so well. I am very impressed with your vision and insight. this
Thank you ffor this
"I’m so impressed with the idea of creating personalized Frozen cups! It's a brilliant way to make any Frozen-themed party or event extra special. The detailed instructions and pictures make it simple and fun to follow along. This craft is sure to be a hit with kids and parents alike. Thank you for sharing this fantastic idea!"
Medical Coding Courses in Kochi
Post a Comment