WordCount program using Spark DataFrame

I wanted to figure out how to write Word Count Program using Spark DataFrame API, so i followed these steps. Import org.apache.spark.sql.functions._, it includes UDF's that i need to use import org.apache.spark.sql.functions._ Create a data frame by reading README.md. When you read the file, spark will create a data frame with single column value, the content of the value column would be the line in the file

val df = sqlContext.read.text("README.md")
df.show(10,truncate=false)
Next split each of the line into words using split function. This will create a new DataFrame with words column, each words column would have array of words for that line

val wordsDF = df.select(split(df("value")," ").alias("words"))
wordsDF.show(10,truncate=false)
Next use explode transformation to convert the words array into a dataframe with word column. This is equivalent of using flatMap() method on RDD

val wordDF = wordsDF.select(explode(wordsDF("words")).alias("word"))
wordsDF.show(10,truncate=false)
Now you have data frame with each line containing single word in the file. So group the data frame based on word and count the occurrence of each word

val wordCountDF = wordDF.groupBy("word").count
wordCountDF.show(truncate=false)
This is the code you need if you want to figure out 20 top most words in the file

wordCountDF.orderBy(desc("count")).show(truncate=false)

12 comments:

  1. Word count program using spark Data Frame has explained in a very convenient way so that every visitor will easily understand.

    ReplyDelete
  2. Let's say after explode

    you had data like

    word - Count
    Module, 1
    Module 2
    Module:3
    Module- 1

    So though word here is only module, you are counting without stripping special characters. In this case this solution doesn't seems complete no?

    ReplyDelete
  3. Worthful Spark tutorial. Appreciate a lot for taking up the pain to write such a quality content on Spark Training. Just now I watched this similar Spark tutorial and I think this will enhance the knowledge of other visitors for sure. Thanks anyway.:-https://www.youtube.com/watch?v=dMDQz82FCqE

    ReplyDelete
  4. The team at printer support number service will help you fix all sorts of issues for all brands of printers. They will provide speedy resolutions to repair the printer and will also tweak its settings to ensure that your printer gives its best performance.

    ReplyDelete
  5. I’ve read some good stuff here. Definitely worth bookmarking for revisiting. I surprise how much effort you put to create such a great informative website. view

    ReplyDelete
  6. We offer the best Web Design & Web Development Company In Mumbai, India. Brain candy provides services like E-commerce development, WordPress development, and more services.
    Please keep sharing this types of blog, "Web Design & Web Development Company In Mumbai, India"

    ReplyDelete
  7. This blog are very informative! We find these technology-related topics. Thanks for the post! Very useful!

    react native app development company
    devops services company
    digital transformation services company

    ReplyDelete
  8. I just came across your blog post and must say that it’s a great piece of information that you have shared. Visit for more info email marketing services in mumbai

    ReplyDelete