Sunil's Notes: MapReduce program that reads input files from S3 and writes output to S3

In the WordCount(HelloWorld) MapReduce program entry i talked about how to create a simple WordCount Map Reducer program with Hadoop. I wanted to change it to so that it reads input files from Amazon S3 bucket and writes output back to Amazon S3 bucket, so i built S3MapReduce program, that you can download from here. I followed these steps

First create 2 buckets one for storing input and other for storing output in your Amazon S3 account. Most important issue here is to make sure that you create your buckets in US Standard region, if you dont do that then additional steps might be required for Hadoop to be able to access your buckets Name of input bucket in my case is com.spnotes.hadoop.wordcount.books
Name of the output bucket is com.spnotes.hadoop.wordcount.output
Upload few .txt files that you want to use as input in your input bucket like this
Next step is to create MapReduce program like this, In my case one Java class has code for Mapper, Reducer and driver class. Most of the code in the MapReduce is same only difference is for working with S3 you will have to add few S3 specific properties like this, basically you need to set your accessKey and secretAccessKey that you can get from AWS Security console and paste it here. You will also have to tell Hadoop to use s3n as file system.
```
//Replace this value
job.getConfiguration().set("fs.s3n.awsAccessKeyId", "awsaccesskey");
//Replace this value
job.getConfiguration().set("fs.s3n.awsSecretAccessKey","awssecretaccesskey");
job.getConfiguration().set("fs.default.name","s3n://com.spnotes.hadoop.input.books");
```
Now last step is to execute this program, it takes 2 inputs, You can just right click on your S3MapReduce program and say execute with following 2 parameters
```
s3n://com.spnotes.hadoop.wordcount.books s3n://com.spnotes.hadoop.wordcount.output/output3
```
Once the MapReduce is executed you can check the output by going to S3 console and looking at content of com.spnotes.hadoop.wordcount.output like this

MapReduce program that reads input files from S3 and writes output to S3

2 comments: