- Push Model: Spark listens on particular port for Avro event and flume connects to that port and publishes event
- Pull Model: You use special Spark Sink in flume that keeps collecting published data and Spark pulls that data at certain frequency
- First download spark-streaming-flume-sink_2.10-1.6.0.jar and copy it to flume/lib directory
- Next create flume configuration that looks like this, as you can see, Flume is listening for netcat event on port 44444 and it is taking every event and replicating it to both logger and Spark sink. Spark sink would listen on port 9999 for Spark program to connect
This is how your Spark driver will look like. The Spark Flume listener gets event in avro format so you will have to call
event.getBody().array() to get the event.