MapReduce program that reads input files from S3 and writes output to S3

In the WordCount(HelloWorld) MapReduce program entry i talked about how to create a simple WordCount Map Reducer program with Hadoop. I wanted to change it to so that it reads input files from Amazon S3 bucket and writes output back to Amazon S3 bucket, so i built S3MapReduce program, that you can download from here. I followed these steps
  1. First create 2 buckets one for storing input and other for storing output in your Amazon S3 account. Most important issue here is to make sure that you create your buckets in US Standard region, if you dont do that then additional steps might be required for Hadoop to be able to access your buckets Name of input bucket in my case is com.spnotes.hadoop.wordcount.books
    Name of the output bucket is com.spnotes.hadoop.wordcount.output
  2. Upload few .txt files that you want to use as input in your input bucket like this
  3. Next step is to create MapReduce program like this, In my case one Java class has code for Mapper, Reducer and driver class. Most of the code in the MapReduce is same only difference is for working with S3 you will have to add few S3 specific properties like this, basically you need to set your accessKey and secretAccessKey that you can get from AWS Security console and paste it here. You will also have to tell Hadoop to use s3n as file system.
    
    //Replace this value
    job.getConfiguration().set("fs.s3n.awsAccessKeyId", "awsaccesskey");
    //Replace this value
    job.getConfiguration().set("fs.s3n.awsSecretAccessKey","awssecretaccesskey");
    job.getConfiguration().set("fs.default.name","s3n://com.spnotes.hadoop.input.books");
    
  4. Now last step is to execute this program, it takes 2 inputs, You can just right click on your S3MapReduce program and say execute with following 2 parameters
    
    s3n://com.spnotes.hadoop.wordcount.books s3n://com.spnotes.hadoop.wordcount.output/output3
    
  5. Once the MapReduce is executed you can check the output by going to S3 console and looking at content of com.spnotes.hadoop.wordcount.output like this

2 comments:

Unknown said...

How to configure in case of having two different keys for two(READ and WRITE) buckets ?

Charcuterie Recipes said...

Hi nicce reading your blog