When your developing your Spark code, you have option of developing it using either Scala, Java or Python. In some cases you might want to mix the languages that you want to use. I wanted to try that out so i built this simple Spark program that passes control to Python for performing transformation (All that it does it append word "python " in front of every line). You can download source code for sample project from
here
First thing that i did was to develop this simple python script that reads one line at a time from console, appends "Python " to the line and writes it back to standard console
Now this is how the driver looks like, most of the spark code is same only difference is
lines.pipe("python echo.py")
which says that pass every line in the RDD to
python echo.py
. and collect the output. Now there is nothing specific to python here, instead you could use any executable.
When you run this code in cluster you should copy the python file on your machine say in spark directory then you can execute
bin/spark-submit
--files echo.py
ScalaPython-1.0-SNAPSHOT-jar-with-dependencies.jar helloworld.txt
3 comments:
is any there example of doing the same with a dataframe?
I like your post very much. It is nice useful for my research. I wish for you to share more info about this. Keep blogging Apache Kafka Training in Electronic City
Thanks Sunil, Really Helpful!!
Post a Comment