SparkContext.textFile()
it actually uses TextInputFormat
for reading the file. Advantage of this approach is that you do everything that TextInputFormat does. For example by default when you use TextInputFormat to read file it will break the file into records based on \n character. But sometimes you might want to read the file using some other logic. Example i wanted to parse a book based on sentences instead of \n characters, so i looked into TextInputFormat code and i noticed that it takes textinputformat.record.delimiter
configuration property that i could set with value equal to '.' and the TextInputFormat returns sentences instead of lines. This sample code shows how to do that
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
package com.spnotes.spark | |
import org.apache.hadoop.conf.Configuration | |
import org.apache.hadoop.mapreduce.Job | |
import org.apache.spark.{SparkContext, SparkConf} | |
/** | |
* Created by sunilpatil on 4/18/16. | |
*/ | |
object FileReader { | |
def main(args:Array[String]): Unit ={ | |
if(args.length != 1){ | |
println("Please specify <filepath>") | |
System.exit(-1) | |
} | |
val directoryPath = args(0) | |
println(s"Reading data from $directoryPath") | |
val sparkConf = new SparkConf().setAppName("FileReader").setMaster("local[*]") | |
val sparkContext = new SparkContext(sparkConf); | |
// Set custom delimiter for text input format | |
sparkContext.hadoopConfiguration.set("textinputformat.record.delimiter",".") | |
val sentences = sparkContext.textFile(directoryPath) | |
println("Number of lines " + sentences.count()) | |
sentences.take(10).foreach(println) | |
} | |
} |
sparkContext.hadoopConfiguration.set("textinputformat.record.delimiter",".")
that is setting up hadoop configuration property.
When i used this code to parse 2city10.txt i noticed that it has 16104 lines of text but 6554 sentences.
2 comments:
Thank you very much, it was very much helpful for me.
Thanks for the info..
Post a Comment