Configuring Flume to write avro events into HDFS

Recently i wanted to figure out how to configure Flume so that it can listen for Avro Events and whenever it gets event it should dump it in the HDFS. In order to do that i built this simple Flume configuration

# example.conf: A single-node Flume configuration

# Name the components on this agent
agent1.sources = avro
agent1.sinks = logger1
agent1.channels = memory1

# Describe/configure the source
agent1.sources.avro.type = avro
agent1.sources.avro.bind = localhost
agent1.sources.avro.port = 41414
agent1.sources.avro.selector.type = replicating
agent1.sources.avro.channels = memory1

# Describe the sink
agent1.sinks.hdfs1.type = hdfs
agent1.sinks.hdfs1.hdfs.path=/tmp/flume/events
agent1.sinks.hdfs1.hdfs.rollInterval=60
#The number of events to be written into a file before it is rolled.
agent1.sinks.hdfs1.hdfs.rollSize=0
agent1.sinks.hdfs1.hdfs.batchSize=100
agent1.sinks.hdfs1.hdfs.serializer=org.apache.flume.sink.hdfs.AvroEventSerializer$Builder
agent1.sinks.hdfs1.hdfs.fileType = DataStream
agent1.sinks = hdfs1
agent1.sinks.hdfs1.channel = memory1

# Use a channel which buffers events in memory
agent1.channels.memory1.type = memory
agent1.channels.memory1.capacity = 1000
agent1.channels.memory1.transactionCapacity = 100
In this i have Avro source listening on local machine at port 41414, once it gets event it writes that in HDFS in /tmp/flume/events directory Once this file is saved in local machine as hellohdfsavro.conf i can start the flume agent using following command

flume-ng agent --conf conf --conf-file hellohdfsavro.conf  --name agent1 -Dflume.root.logger=DEBUG,console

2 comments:

Anonymous said...

Hi,

I am attempting to do a similar configuration from a kafka source where the data is in avro.
Using the avro event sterializer as you have "org.apache.flume.sink.hdfs.AvroEventSerializer$Builder" my data is being printed to file without the formatting ie. {"Name":"myName", "Surname":"mySurnaeme"} is printed to HDFS as "myNamemySurname".

When I try to view the data in HUE I get the following message,
1. Warning: some binary data has been masked out with '&#xfffd'.
2. SEQ !org.apache.hadoop.io.LongWritable"org.apache.hadoop.io.BytesWritable`9g(}�5�l�~�,�XS" P���� � myName mySurname
Is anyone aware of how I can solve this issue?

Regards,
Kirsten

Unknown said...

hi can u tell me how to start avro source. like after creating the conf file. what exactly should i do? is it enough if i just run the conf file from command prompt. which data will get saved in hdfs? kindly explain. it would be a great help to me. thansk