Monitoring HDFS directory for new files using Spark Streaming

I wanted to build this simple Spark Streaming application that monitors a particular directory in HDFS and whenever a new file shows up, i want to print its content to Console. I built this HDFSFileStream.scala. In this program after creating a SparkStreamContext. I am calling sparkStreamingContext.textFileStream(<directoryName>) on it. Once a new file appears in the directory the value of fileRDD.count() would return more than 0 and then i invoke processNewFile(). The processNewFile() method takes a RDD[String], iterates through the file content and prints it to console Next start the program by executing following code bin/spark-submit ~/HelloSparkStreaming-1.0-SNAPSHOT-jar-with-dependencies.jar /user/mapr/stream 3 Once the streaming started it starts monitoring /user/mapr/stream directory, for new content. I copied a file with few lines in it and i got the following output, which is content of the file


Basanth said...

Hello Sunil,

Thanks for the blog. I need to implement similar code for a component of mine. Had a couple of questions ? I mean

1. I was wondering what kind of load will your approach place on the hdfs namenode. Will you be constantly pinging the namenode to get the latest files on your monitored directories ? Is this going to cause any adverse performance issues on the cluster?

2. Is there a push model instead of this pull model so that your code can be notified of the existence of a new file in hdfs ; rather than querying hdfs ourselves ?

Thanks !

Pratik Shekhar said...

I really appreciate information shared above. It’s of great help. If someone want to learn Online (Virtual) instructor lead live training in Apache Spark TECHNOLOGY , kindly contact us
MaxMunus Offer World Class Virtual Instructor-led training on TECHNOLOGY. We have industry expert trainer. We provide Training Material and Software Support. MaxMunus has successfully conducted 100000+ pieces of training in India, USA, UK, Australia, Switzerland, Qatar, Saudi Arabia, Bangladesh, Bahrain and UAE etc.
For Demo Contact us.
Pratik Shekhar
Ph:(0) +91 9066268701

Jay M said...

Hi Sunil

I would like to refer HDFSFileStream.scala.
I have below requirement

I need to monitor few directories in our cluster and make report every month for below things;
Size of the directories previous month and current month
Monitor new folders which are getting added in current month.
create this report every month

Vikas Chaudhary said...

Battery Mantra is Authorized exide car battery dealer in Noida and Greater Noida. We are providing our service in Indirapuram, Delhi, Ashok Nagar.

Exide Battery Dealer in Noida
Battery Dealer in Noida
Authorized Battery Dealer in Noida
Car Battery Dealer in Noida
Car Battery Dealer
Exide Battery Dealer

Unknown said...

can i get the above code for pyspark?