How to install Spark and use it from python via pyspark

Visit Spark to download tgz version of spark with hadoop
Unzip and move it to /opt directory

$ tar -xzf spark-2.3.0-bin-hadoop2.7.tgz
$ mv spark-2.3.0-bin-hadoop2.7 /opt/spark-2.3.0

Create sym link

$ ln -s /opt/spark-2.3.0 /opt/spark

Add it to bash

export SPARK_HOME=/opt/spark
export PATH=$SPARK_HOME/bin:$PATH

At this point Spark is installed on your machine. Test it

$ pyspark

Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 2.3.0
      /_/

Using Python version 3.6.4 (default, Jan 16 2018 18:10:19)
SparkSession available as 'spark'.

Connect it to your python scripts, by installing findspark to point python to the location of the Spark

$ pip install findspark

Install pyspark to be able to use Spark from python

$ pip install pyspark

Now you will be able to use Spark from your python scripts

import findspark
import pyspark
from pyspark.sql import SQLContext, Row

findspark.init(spark_home="/opt/spark")
conf = pyspark.SparkConf().setAppName('tf_fraud')
sc = pyspark.SparkContext(conf=conf)
sqlctx = SQLContext(sc)
connection_url = "jdbc:oracle:thin:"+username+"/"+password+"@"+ip+":"+str(port)+"/"+SID

df_pyspark = sqlctx.read.format('jdbc').options(url=connection_url, dbtable="employee", driver="oracle.jdbc.OracleDriver").load()
print(df_pyspark.printSchema())
cad_ser_nums = df_pyspark.foreach(print)

Related Posts: