Sunil's Notes: January 2016

Reading content of file into String in scala

One of the common requirements is to read content of a file into String, You would want to read content of config file at particular path in your program at runtime but during testing you would wan to read content of a file on class path. I built this simple class that takes has following 2 methods

getFilePathContent(): This method takes full path of file and reads its content into string
getResourceContent(): THis method takes relative path of a file already available on classpath and converts it into String

Hello Apache Tika

Apache Tika is nice framework that lets you extract content of file. Example you can extract content of PDF or word document or excel as string. It also lets you extract metadata about the file. For example things like when it was created, author,.. etc. I built this sample application to play with Tika You can try using it by giving it full path of the file that you want to extract.

Flume to Spark Streaming - Pull model

In this post i will demonstrate how to stream data from flume into Spark using Streaming. When it comes to Streaming data from Flume to Spark you have 2 options.

Push Model: Spark listens on particular port for Avro event and flume connects to that port and publishes event
Pull Model: You use special Spark Sink in flume that keeps collecting published data and Spark pulls that data at certain frequency

I built this simple configuration in which i could send event to flume on netcat, flume would take those events and send them to Spark as well as print to console.

First download spark-streaming-flume-sink_2.10-1.6.0.jar and copy it to flume/lib directory
Next create flume configuration that looks like this, as you can see, Flume is listening for netcat event on port 44444 and it is taking every event and replicating it to both logger and Spark sink. Spark sink would listen on port 9999 for Spark program to connect
This is how your Spark driver will look like. The Spark Flume listener gets event in avro format so you will have to call event.getBody().array() to get the event.



Once your spark and flume agents are started open netcat on port 44444 and send messages, those messages should appear in your Spark Console










6 comments:
  














Email ThisBlogThis!Share to XShare to FacebookShare to Pinterest




Labels:
flume,
spark,
sparkstreaming


        

          
        

          
        






Setting up local repository for maven





Before couple of days i was working with my colleague on setting up a cluster in AWS for Spark Lab. One problem we ran into is every time you start a spark build it download bunch of this dependencies(In our case around 200 MB mostly because of complexity of our dependencies). We thought if every student has to download all the dependencies it would take lot of time and cost money for network bandwidth consumption.

So the way we ended up solving this issue is first we ran the maven build for first user say user1. Once that script worked we copied the /user/user01/.m2/repository folder to /opt/mavenrepo directory. Then everytime some other user ran the maven script they pointed to the existing directory on that machine and use the dependencies that are already downloaded.


mvn package -Dmaven.repo.local=/opt/mavenrepo/repository











No comments:
  














Email ThisBlogThis!Share to XShare to FacebookShare to Pinterest




Labels:
maven










          
        

          
        

          
        







Monitoring HDFS directory for new files using Spark Streaming





I wanted to build this simple Spark Streaming application that monitors a particular directory in HDFS and whenever a new file shows up, i want to print its content to Console. 

I built this HDFSFileStream.scala. In this program after creating a SparkStreamContext. I am calling sparkStreamingContext.textFileStream(<directoryName>) on it. Once a new file appears in the directory the value of fileRDD.count() would return more than 0 and then i invoke processNewFile(). The processNewFile() method takes a RDD[String], iterates through the file content and prints it to console



Next start the program by executing following code

bin/spark-submit ~/HelloSparkStreaming-1.0-SNAPSHOT-jar-with-dependencies.jar /user/mapr/stream 3


Once the streaming started it starts monitoring /user/mapr/stream directory, for new content. I copied a file with few lines in it and i got the following output, which is content of the file 












5 comments:
  














Email ThisBlogThis!Share to XShare to FacebookShare to Pinterest




Labels:
scala,
spark,
sparkstreaming










          
        

          
        

          
        






Problem with scala version mismatch in Spark application





I was developing a spark program on my machine and it worked ok. But when i tried to deploy it in Spark running in my Hadoop sandbox i started getting this error 


java.lang.NoSuchMethodError: scala.runtime.IntRef.create(I)Lscala/runtime/IntRef;
 at com.spnotes.enrich.CSVFieldEnricher.enrich(CSVFieldEnricher.scala:31)
 at com.spnotes.PMDriver$$anonfun$1$$anonfun$apply$2.apply(PMDriver.scala:59)
 at com.spnotes.PMDriver$$anonfun$1$$anonfun$apply$2.apply(PMDriver.scala:58)
 at scala.collection.immutable.List.foreach(List.scala:318)
 at scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:32)
 at scala.collection.mutable.ListBuffer.foreach(ListBuffer.scala:45)
 at com.spnotes.PMDriver$$anonfun$1.apply(PMDriver.scala:58)
 at com.spnotes.PMDriver$$anonfun$1.apply(PMDriver.scala:56)
 at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
 at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1469)
 at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1006)
 at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1006)
 at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1498)
 at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1498)
 at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
 at org.apache.spark.scheduler.Task.run(Task.scala:64)
 at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
 at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:745)
16/01/05 13:03:53 ERROR SparkUncaughtExceptionHandler: Uncaught exception in thread Thread[Executor task launch worker-0,5,main]
java.lang.NoSuchMethodError: scala.runtime.IntRef.create(I)Lscala/runtime/IntRef;
 at com.spnotes.enrich.CSVFieldEnricher.enrich(CSVFieldEnricher.scala:31)
 at com.spnotes.PMDriver$$anonfun$1$$anonfun$apply$2.apply(PMDriver.scala:59)
 at com.spnotes.PMDriver$$anonfun$1$$anonfun$apply$2.apply(PMDriver.scala:58)
 at scala.collection.immutable.List.foreach(List.scala:318)
 at scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:32)
 at scala.collection.mutable.ListBuffer.foreach(ListBuffer.scala:45)
 at com.spnotes.PMDriver$$anonfun$1.apply(PMDriver.scala:58)
 at com.spnotes.PMDriver$$anonfun$1.apply(PMDriver.scala:56)
 at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
 at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1469)
 at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1006)
 at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1006)
 at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1498)
 at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1498)
 at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
 at org.apache.spark.scheduler.Task.run(Task.scala:64)
 at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
 at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:745)


So it seems the problem is you need to use same version of Scala for compiling your code as the Scala used by Spark. In my case i was using scala 2.11 for compiling my code and Spark 1.3.1 uses Scala 2.10.4. So i changed the build file and then rebuilt the code and deployed it. That fixed the issue 










7 comments:
  














Email ThisBlogThis!Share to XShare to FacebookShare to Pinterest




Labels:
scala,
spark










          
        

          
        

          
        







Spark error class "javax.servlet.FilterRegistration"'s signer information does not match signer information of other classes in the same package





I had this Spark Program that was working from both IDE and when i built a .jar file and deployed it in Spark. But it suddenly stopped working in IDE, whenever i tried executing in IDE, i was following error


16/01/05 14:34:50 INFO SparkEnv: Registering OutputCommitCoordinator
Exception in thread "main" java.lang.SecurityException: class "javax.servlet.FilterRegistration"'s signer information does not match signer information of other classes in the same package
 at java.lang.ClassLoader.checkCerts(ClassLoader.java:895)
 at java.lang.ClassLoader.preDefineClass(ClassLoader.java:665)
 at java.lang.ClassLoader.defineClass(ClassLoader.java:758)
 at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
 at java.net.URLClassLoader.defineClass(URLClassLoader.java:467)
 at java.net.URLClassLoader.access$100(URLClassLoader.java:73)
 at java.net.URLClassLoader$1.run(URLClassLoader.java:368)
 at java.net.URLClassLoader$1.run(URLClassLoader.java:362)
 at java.security.AccessController.doPrivileged(Native Method)
 at java.net.URLClassLoader.findClass(URLClassLoader.java:361)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
 at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
 at org.spark-project.jetty.servlet.ServletContextHandler.(ServletContextHandler.java:136)
 at org.spark-project.jetty.servlet.ServletContextHandler.(ServletContextHandler.java:129)
 at org.spark-project.jetty.servlet.ServletContextHandler.(ServletContextHandler.java:98)
 at org.apache.spark.ui.JettyUtils$.createServletHandler(JettyUtils.scala:101)
 at org.apache.spark.ui.JettyUtils$.createServletHandler(JettyUtils.scala:92)
 at org.apache.spark.ui.WebUI.attachPage(WebUI.scala:78)
 at org.apache.spark.ui.WebUI$$anonfun$attachTab$1.apply(WebUI.scala:62)
 at org.apache.spark.ui.WebUI$$anonfun$attachTab$1.apply(WebUI.scala:62)
 at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
 at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
 at org.apache.spark.ui.WebUI.attachTab(WebUI.scala:62)
 at org.apache.spark.ui.SparkUI.initialize(SparkUI.scala:50)
 at org.apache.spark.ui.SparkUI.(SparkUI.scala:61)
 at org.apache.spark.ui.SparkUI$.create(SparkUI.scala:151)
 at org.apache.spark.ui.SparkUI$.createLiveUI(SparkUI.scala:106)
 at org.apache.spark.SparkContext.(SparkContext.scala:300)
 at com.mapr.QS.PMDriver$.main(PMDriver.scala:32)
 at com.mapr.QS.PMDriver.main(PMDriver.scala)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
 at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:497)
 at com.intellij.rt.execution.application.AppMain.main(AppMain.java:144)

Process finished with exit code 1


So the problem it seems is that order of javax.servlet:servlet-api.jar was wrong. I opened the Project setting and moved the dependency jar to the end of the list and it started working. This is the screen shot of Intellij settings 



This is screen shot of how to achieve same thing in Eclipse 












3 comments:
  














Email ThisBlogThis!Share to XShare to FacebookShare to Pinterest




Labels:
spark










          
        

          
        

          
        






How to use Hadoop's InputFormat and OutputFormat in Spark





One of the things that i like about Spark is, that it allows you to use you MapReduce based InputFormat and OutputFormats for reading from and writing to. 

I wanted to try this i built the InputFormatOutputDriver class, that uses TextInputFormat for reading a file. Then uses that input to perform word count and finally uses TextOutputFormat for storing output



As you can see most of the code is similar to  WordCount program built using Apache Spark in Java , with difference that this is written in scala and following 2 lines

When you want to use Hadoop API for reading data you should use sparkContext.newAPIHadoopFile() method, i am using version of the method that takes 4 parameters. First is path of input file, second parameter is the InputFormat class you want to use (I want to read file as Key - Value pair so i am using KeyValueTextInputFormat), then the next parameters is type of Key and Type of value, its Text for both key and value in my example and the last. Spark will read the file into a PairRDD[Text,Text], since i am only interested in the content of the file i am iterating through the keys and converting them from Text to String 
 

val lines = sparkContext.newAPIHadoopFile(inputFile,classOf[KeyValueTextInputFormat], 
classOf[Text],classOf[Text]).keys.map(lineText => lineText.toString)


Once i have RDD[String] i can perform wordcount with it. But once the results are ready i am calling wordCountRDD.saveAsNewAPIHadoopFile() for storing data in Hadoop using TextOutputFormat. 


wordCountRDD.saveAsNewAPIHadoopFile(outputFile,classOf[Text],classOf[IntWritable],
classOf[TextOutputFormat[Text,IntWritable]])











1 comment:
  














Email ThisBlogThis!Share to XShare to FacebookShare to Pinterest




Labels:
hadoop,
mapreduce,
spark










          
        

          
        

          
        







How to use ZooInsepector





The Zookeeper has a ZooInspector GUI that you can use for inspecting your zNode structure, you can use it with these steps


First go to the ZooInsepector directory (I am assuming that you already have ZooKeeper on your machine, if not download it from Zookeeper home page)

cd <ZOOKEEPER_HOME>/contrib/ZooInspector
 


You can start the ZooInspector by using following command which makes sure that the necessary jars are on the classpath

java -cp ../../lib/*:lib/*:zookeeper-3.4.7-ZooInspector.jar:../../zookeeper-3.4.7.jar org.apache.zookeeper.inspector.ZooInspector



Once the ZooInspector started enter URL of the Zookeeper that you want to inspect 



Once ZooInspector will show you the zNode hierarchy on that server













5 comments:
  














Email ThisBlogThis!Share to XShare to FacebookShare to Pinterest




Labels:
zookeeper




Newer Posts


Older Posts

Home




Subscribe to:
Comments (Atom)












About Me





Sunil Patil


I am Big Data Solution Architect working for MapR Technologies. 

I am also author of few books and articles

View my complete profile



Total Pageviews










Popular Posts



Responsive Web Design


Spark program to read data from RDBMS


Using AngularJs in Worklight/PhoneGap application


How to drain/delete/expire existing messages in Kafka


Reading content of file into String in scala





Blog Archive








        ► 
      



2017

(7)





        ► 
      



November

(3)







        ► 
      



January

(4)









        ▼ 
      



2016

(27)





        ► 
      



December

(6)







        ► 
      



November

(2)







        ► 
      



July

(4)







        ► 
      



April

(2)







        ► 
      



February

(4)







        ▼ 
      



January

(9)

Reading content of file into String in scala
Hello Apache Tika
Flume to Spark Streaming - Pull model
Setting up local repository for maven
Monitoring HDFS directory for new files using Spar...
Problem with scala version mismatch in Spark appli...
Spark error class "javax.servlet.FilterRegistratio...
How to use Hadoop's InputFormat and OutputFormat i...
How to use ZooInsepector










        ► 
      



2015

(15)





        ► 
      



December

(6)







        ► 
      



November

(1)







        ► 
      



September

(1)







        ► 
      



August

(2)







        ► 
      



May

(3)







        ► 
      



April

(1)







        ► 
      



January

(1)









        ► 
      



2014

(67)





        ► 
      



December

(8)







        ► 
      



August

(10)







        ► 
      



July

(15)







        ► 
      



June

(8)







        ► 
      



May

(6)







        ► 
      



April

(9)







        ► 
      



March

(1)







        ► 
      



February

(10)









        ► 
      



2013

(11)





        ► 
      



December

(2)







        ► 
      



November

(6)







        ► 
      



October

(1)







        ► 
      



April

(2)









        ► 
      



2012

(93)





        ► 
      



November

(4)







        ► 
      



July

(3)







        ► 
      



June

(19)







        ► 
      



May

(18)







        ► 
      



April

(12)







        ► 
      



March

(33)







        ► 
      



January

(4)









        ► 
      



2011

(64)





        ► 
      



December

(3)







        ► 
      



November

(1)







        ► 
      



October

(4)







        ► 
      



July

(1)







        ► 
      



June

(1)







        ► 
      



May

(8)







        ► 
      



April

(15)







        ► 
      



March

(7)







        ► 
      



February

(19)







        ► 
      



January

(5)









        ► 
      



2010

(271)





        ► 
      



December

(12)







        ► 
      



November

(1)







        ► 
      



October

(10)







        ► 
      



September

(51)







        ► 
      



August

(25)







        ► 
      



July

(38)







        ► 
      



June

(43)







        ► 
      



May

(27)







        ► 
      



April

(28)







        ► 
      



March

(9)







        ► 
      



February

(17)







        ► 
      



January

(10)









        ► 
      



2009

(535)





        ► 
      



December

(19)







        ► 
      



November

(1)







        ► 
      



September

(174)







        ► 
      



August

(50)







        ► 
      



July

(35)







        ► 
      



June

(33)







        ► 
      



May

(64)







        ► 
      



April

(59)







        ► 
      



March

(94)







        ► 
      



February

(5)







        ► 
      



January

(1)









        ► 
      



2008

(55)





        ► 
      



December

(6)







        ► 
      



November

(28)







        ► 
      



October

(21)









Twitter

Tweets by @pppsunil
            





Enter your email address:














Followers



























Watermark theme. Powered by Blogger.