Invoking Python from Spark Scala project

When your developing your Spark code, you have option of developing it using either Scala, Java or Python. In some cases you might want to mix the languages that you want to use. I wanted to try that out so i built this simple Spark program that passes control to Python for performing transformation (All that it does it append word "python " in front of every line). You can download source code for sample project from here First thing that i did was to develop this simple python script that reads one line at a time from console, appends "Python " to the line and writes it back to standard console Now this is how the driver looks like, most of the spark code is same only difference is lines.pipe("python echo.py") which says that pass every line in the RDD to python echo.py. and collect the output. Now there is nothing specific to python here, instead you could use any executable. When you run this code in cluster you should copy the python file on your machine say in spark directory then you can execute

bin/spark-submit 
    --files echo.py  
    ScalaPython-1.0-SNAPSHOT-jar-with-dependencies.jar helloworld.txt

3 comments:

Anonymous said...

is any there example of doing the same with a dataframe?

Anonymous said...

I like your post very much. It is nice useful for my research. I wish for you to share more info about this. Keep blogging Apache Kafka Training in Electronic City

Anonymous said...

Thanks Sunil, Really Helpful!!