Sunil's Notes: July 2016

WordCount program using Spark DataFrame

I wanted to figure out how to write Word Count Program using Spark DataFrame API, so i followed these steps. Import org.apache.spark.sql.functions._, it includes UDF's that i need to use


import org.apache.spark.sql.functions._

Create a data frame by reading README.md. When you read the file, spark will create a data frame with single column value, the content of the value column would be the line in the file


val df = sqlContext.read.text("README.md")
df.show(10,truncate=false)

Next split each of the line into words using split function. This will create a new DataFrame with words column, each words column would have array of words for that line


val wordsDF = df.select(split(df("value")," ").alias("words"))
wordsDF.show(10,truncate=false)

Next use explode transformation to convert the words array into a dataframe with word column. This is equivalent of using flatMap() method on RDD


val wordDF = wordsDF.select(explode(wordsDF("words")).alias("word"))
wordsDF.show(10,truncate=false)

Now you have data frame with each line containing single word in the file. So group the data frame based on word and count the occurrence of each word


val wordCountDF = wordDF.groupBy("word").count
wordCountDF.show(truncate=false)

This is the code you need if you want to figure out 20 top most words in the file


wordCountDF.orderBy(desc("count")).show(truncate=false)

How to use built in spark UDF's

In the i talked about how to create a custom UDF in scala for spark. But before you do that always check Spark UDF's that are available with Spark already. I have this sample Spark data frame with list of users I wanted to sort the list of users in descending order of age so i used following 2 lines, first is to import functions that are available with Spark already and then i used desc function to order age in descending order


import org.apache.spark.sql.functions._
display(userDF.orderBy(desc("age")))

Now if i wanted to sort the data frame records using age in ascending order


display(userDF.orderBy(asc("age")))

This is sample of how to use the sum() function


userDF.select(sum("age")).show

How to use built in spark UDF's


import org.apache.spark.sql.functions._
display(userDF.orderBy(desc("age")))

Now if i wanted to sort the data frame records using age in ascending order


display(userDF.orderBy(asc("age")))

This is sample of how to use the sum() function


userDF.select(sum("age")).show

How to define Scala UDF in Spark

Scala UDF Sample Notebook