I wanted to figure out how to write Word Count Program using Spark DataFrame API, so i followed these steps.
Import
org.apache.spark.sql.functions._
, it includes UDF's that i need to use
import org.apache.spark.sql.functions._
Create a data frame by reading README.md. When you read the file, spark will create a data frame with single column value, the content of the value column would be the line in the file
val df = sqlContext.read.text("README.md")
df.show(10,truncate=false)
Next split each of the line into words using split function. This will create a new DataFrame with words column, each words column would have array of words for that line
val wordsDF = df.select(split(df("value")," ").alias("words"))
wordsDF.show(10,truncate=false)
Next use explode transformation to convert the words array into a dataframe with word column. This is equivalent of using flatMap() method on RDD
val wordDF = wordsDF.select(explode(wordsDF("words")).alias("word"))
wordsDF.show(10,truncate=false)
Now you have data frame with each line containing single word in the file. So group the data frame based on word and count the occurrence of each word
val wordCountDF = wordDF.groupBy("word").count
wordCountDF.show(truncate=false)
This is the code you need if you want to figure out 20 top most words in the file
wordCountDF.orderBy(desc("count")).show(truncate=false)
Word count program using spark Data Frame has explained in a very convenient way so that every visitor will easily understand.
ReplyDeleteLet's say after explode
ReplyDeleteyou had data like
word - Count
Module, 1
Module 2
Module:3
Module- 1
So though word here is only module, you are counting without stripping special characters. In this case this solution doesn't seems complete no?
Worthful Spark tutorial. Appreciate a lot for taking up the pain to write such a quality content on Spark Training. Just now I watched this similar Spark tutorial and I think this will enhance the knowledge of other visitors for sure. Thanks anyway.:-https://www.youtube.com/watch?v=dMDQz82FCqE
ReplyDeleteExtra-Ordinary piece of work. Interesting concepts to read. Very much informative. Thanks for sharing. Waiting for your future posts.
ReplyDeleteTableau Training in Chennai
Tableau Course in Chennai
Tableau Training Institutes in Chennai
Tableau Training in Tambaram
Spoken English Classes in Chennai
Best Spoken English Classes in Chennai
SAS Training in Chennai
SAS Course in Chennai
Thanks for info....
ReplyDeleteWebsite development in Bangalore
The team at printer support number service will help you fix all sorts of issues for all brands of printers. They will provide speedy resolutions to repair the printer and will also tweak its settings to ensure that your printer gives its best performance.
ReplyDeleteLexmark Printer Support | Brother printer suppor | Lexmark Printer support number | Lexmark printer toll free number
ReplyDeletenice post on Spark Training
ReplyDeleteI’ve read some good stuff here. Definitely worth bookmarking for revisiting. I surprise how much effort you put to create such a great informative website. view
ReplyDeleteWe offer the best Web Design & Web Development Company In Mumbai, India. Brain candy provides services like E-commerce development, WordPress development, and more services.
ReplyDeletePlease keep sharing this types of blog, "Web Design & Web Development Company In Mumbai, India"
This blog are very informative! We find these technology-related topics. Thanks for the post! Very useful!
ReplyDeletereact native app development company
devops services company
digital transformation services company
I just came across your blog post and must say that it’s a great piece of information that you have shared. Visit for more info email marketing services in mumbai
ReplyDelete