Using Apache Oozie for automating streaming map-reduce job

In the WordCount MapReduce program using Hadoop streaming and python i talked about how to create a Streaming map-reduce job using python. I wanted to figure out how to automate that program using Oozie workflow so i followed these steps
  1. First step was to create a folder called streaming on my local machine and copying of mapper.py, reducer.py into the streaming folder, i also create the place holder for job.properties and workflow.xml
  2. Next i did create a job.properties file like this Now this job.properties is quite similar to the job.properties for java mapreduce job, only difference is you must set oozie.use.system.libpath=true, by default the streaming related jars are not included in the classpath, so unless you set that value to true you will get following error
    
    2014-07-23 06:15:13,170 WARN org.apache.hadoop.mapred.Child: Error running child
    java.lang.RuntimeException: java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.streaming.Pi
    peMapRunner not found
     at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:1649)
     at org.apache.hadoop.mapred.JobConf.getMapRunnerClass(JobConf.java:1010)
     at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:413)
     at org.apache.hadoop.mapred.MapTask.run(MapTask.java:332)
     at org.apache.hadoop.mapred.Child$4.run(Child.java:268)
     at java.security.AccessController.doPrivileged(Native Method)
     at javax.security.auth.Subject.doAs(Subject.java:396)
     at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
     at org.apache.hadoop.mapred.Child.main(Child.java:262)
    Caused by: java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.streaming.PipeMapRunner not f
    ound
     at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:1617)
     at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:1641)
     ... 8 more
    Caused by: java.lang.ClassNotFoundException: Class org.apache.hadoop.streaming.PipeMapRunner not found
     at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:1523)
     at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:1615)
     ... 9 more
    2014-07-23 06:15:13,175 INFO org.apache.hadoop.mapred.Task: Runnning cleanup for the task
    
  3. Next step in the process is to create workflow.xml file like this, make sure to add <file>mapper.py#mapper.py</file> element in the workflow.xml, which takes care of putting the mapper.py and reducer.py in the sharedlib and creating symbolic link to these two files.
  4. Upload the streaming folder with all your changes on hdfs by executing following command
    
    hdfs dfs -put streaming streaming
    
  5. You can trigger the oozie workflow by executing following command
    
    oozie job -oozie http://localhost:11000/oozie -config streaming/job.properties -run
    

13 comments:

  1. The knowledge of technology you have been sharing thorough this post is very much helpful to develop new idea. here by i also want to share this.

    Devops Training in pune|Devops training in tambaram|Devops training in velachery|Devops training in annanagar
    DevOps online Training

    ReplyDelete
  2. This is beyond doubt a blog significant to follow. You’ve dig up a great deal to say about this topic, and so much awareness. I believe that you recognize how to construct people pay attention to what you have to pronounce, particularly with a concern that’s so vital. I am pleased to suggest this blog.
    python training in chennai
    python training in chennai
    python training in Bangalore

    ReplyDelete
  3. Excellant post!!!. The strategy you have posted on this technology helped me to get into the next level and had lot of information in it.

    java training in chennai | java training in bangalore

    java online training | java training in pune

    ReplyDelete
  4. Wonderful piece of work. Master stroke. I have become a fan of your words. Pls keep on writing.

    Guest posting sites
    Education

    ReplyDelete
  5. Wow! Such an amazing and helpful post this is. I really really love it. I hope that you continue to do your work like this in the future also.

    Best python classes in Pune
    Python Classes in Pune

    ReplyDelete
  6. Reach to the best Data Science Training institute in Chennai for skyrocketing your career, Infycle Technologies. It is the best Software Training & Placement institute in and around Chennai, that also gives the best placement training for personality tests, interview preparation, and mock interviews for leveling up the candidate's grades to a professional level.

    ReplyDelete
  7. Infycle Technologies, the top software training institute and placement center in Chennai offers the Digital Marketing course in Chennai for freshers, students, and tech professionals at the best offers. In addition to the Oracle training, other in-demand courses such as DevOps, Data Science, Python, Selenium, Big Data, Java, Power BI, Oracle will also be trained with 100% practical classes. After the completion of training, the trainees will be sent for placement interviews in the top MNC's. Call 7504633633 to get more info and a free demo.

    ReplyDelete
  8. This information is very useful and attractive. For those who need this information, it's very informative and understandable for those all. In Extern Labs, we have professional Web designers for intuitive website designs. Extern Labs is also a website design company.You can hire a dedicated Web designer to design an interactive website with innovative ideas. Thanks for this information. website design company

    ReplyDelete