WordCount program using Spark DataFrame

I wanted to figure out how to write Word Count Program using Spark DataFrame API, so i followed these steps. Import org.apache.spark.sql.functions._, it includes UDF's that i need to use import org.apache.spark.sql.functions._ Create a data frame by reading README.md. When you read the file, spark will create a data frame with single column value, the content of the value column would be the line in the file

val df = sqlContext.read.text("README.md")
df.show(10,truncate=false)
Next split each of the line into words using split function. This will create a new DataFrame with words column, each words column would have array of words for that line

val wordsDF = df.select(split(df("value")," ").alias("words"))
wordsDF.show(10,truncate=false)
Next use explode transformation to convert the words array into a dataframe with word column. This is equivalent of using flatMap() method on RDD

val wordDF = wordsDF.select(explode(wordsDF("words")).alias("word"))
wordsDF.show(10,truncate=false)
Now you have data frame with each line containing single word in the file. So group the data frame based on word and count the occurrence of each word

val wordCountDF = wordDF.groupBy("word").count
wordCountDF.show(truncate=false)
This is the code you need if you want to figure out 20 top most words in the file

wordCountDF.orderBy(desc("count")).show(truncate=false)

4 comments:

App Development Company said...

Word count program using spark Data Frame has explained in a very convenient way so that every visitor will easily understand.

Sathya G said...

Thank you for sharing such a nice and interesting blog with us. i have seen that all will say the same thing repeatedly. But in your blog, I had a chance to get some useful and unique information. I would like to suggest your blog in my dude circle. please keep on updates. hope it might be much useful for us. keep on updating...
Software Testing Training

Sathya G said...

All are saying the same thing repeatedly, but in your blog I had a chance to get some useful and unique information, I love your writing style very much, I would like to suggest your blog in my dude circle, so keep on updates. Thanks for sharing and keep update more info for us. eagarly waiting for you tech service.
Software Testing Training

Amit Shrivastava said...

Let's say after explode

you had data like

word - Count
Module, 1
Module 2
Module:3
Module- 1

So though word here is only module, you are counting without stripping special characters. In this case this solution doesn't seems complete no?