I wanted to figure out how to get spark to read text file and break it based on custom delimiter instead of '\n'. These are my notes on how to do that
The Spark Input/Output is based on Mapreduce's InputFormat and OutputFormat. For example when you call
SparkContext.textFile()
it actually uses
TextInputFormat
for reading the file. Advantage of this approach is that you do everything that TextInputFormat does. For example by default when you use TextInputFormat to read file it will break the file into records based on \n character. But sometimes you might want to read the file using some other logic. Example i wanted to parse a book based on sentences instead of \n characters, so i looked into TextInputFormat code and i noticed that it takes
textinputformat.record.delimiter
configuration property that i could set with value equal to '.' and the TextInputFormat returns sentences instead of lines. This sample code shows how to do that
Only change in this code is
sparkContext.hadoopConfiguration.set("textinputformat.record.delimiter",".")
that is setting up hadoop configuration property.
When i used this code to parse
2city10.txt i noticed that it has 16104 lines of text but 6554 sentences.
Thank you very much, it was very much helpful for me.
ReplyDeleteThanks for the info..
ReplyDelete