How to Deploy Spark Standalone in Oracle Cloud (OCI)
1 Introduction
The following walk-through guides you through the steps needed to set up your environment to run Spark and Hadoop in Oracle Cloud Infrastructure.
2 Prerequisites
You have deployed a VM 2.1 or + with Oracle Linux 7.9 (OEL7) in Oracle Cloud Infrastructure (OCI).
- The installation of Oracle Linux 7.9 is using a JVM by default.
- You have access to root either directly or via sudo. By default in OCI, you are connected like “opc” user with sudo privilege.
[opc@xxx ~]$ java -version
java version "1.8.0_281"
Java(TM) SE Runtime Environment (build 1.8.0_281-b09)
Java HotSpot(TM) 64-Bit Server VM (build 25.281-b09, mixed mode)
3 Java Installation
The install is quite simple. It consists of setting up Java, installing Spark and Hadoop components and libraries. Lets start with setting up the Spark and Hadoop environment.
Download the last version of JDK 1.8 because Hadoop 2.X is using this Java version.
rpm -ivh /home/opc/jdk-8u271-linux-x64.rpm
Check Java Version.
java -version
4 Spark and Hadoop Setup
The next step is to install Spark and Hadoop environment.
First, choose the version of Spark and Hadoop you want to install. Then, download the version you want to install:
Download Spark 2.4.5 for Hadoop 2.7
cd /home/opc
wget http://apache.uvigo.es/spark/spark-2.4.5/spark-2.4.5-bin-hadoop2.7.tgz
Download Spark 2.4.7 for Hadoop 2.7
wget http://apache.uvigo.es/spark/spark-2.4.7/spark-2.4.7-bin-hadoop2.7.tgz
Download Spark 3.1.1 for Hadoop 3.2
wget http://apache.uvigo.es/spark/spark-3.1.1/spark-3.1.1-bin-hadoop3.2.tgz
Install the Spark and Hadoop Version
Install the Spark and Hadoop Version choosen in the directory “/opt”.
sudo -i
cd /opt
tar -zxvf /home/opc/spark-2.4.5-bin-hadoop2.7.tgz
#or
tar -zxvf /home/opc/spark-2.4.7-bin-hadoop2.7.tgz
#or
tar -zxvf /home/opc/spark-3.1.1-bin-hadoop3.2.tgz
5 Install PySpark in Python3 environment
/opt/Python-3.7.6/bin/pip3 install 'pyspark=2.4.7'
/opt/Python-3.7.6/bin/pip3 install findspark
Next we shall create a virtual environment and enable it.
Modify your environment to use this Spark and Hadoop Version
Add to “.bashrc” for the user “opc” the following lines:
# Add by %OP%
export PYTHONHOME=/opt/anaconda3
export PATH=$PYTHONHOME/bin:$PYTHONHOME/condabin:$PATH
# SPARK ENV
#export JAVA_HOME=$(/usr/libexec/java_home)
export SPARK_HOME=/opt/spark-2.4.5-bin-hadoop2.7
export PATH=$SPARK_HOME/bin:$PATH
export PYSPARK_PYTHON=python3
export PYSPARK_DRIVER_PYTHON=jupyter
export PYSPARK_DRIVER_PYTHON_OPTS='notebook'
6 Test your Spark and Hadoop Environment
If you’re running directly on a virtual machine and have a browser installed it should take you directly into the jupyter environment. Connect to your “http://xxx.xxx.xxx.xxx:8001/”.
And upload the next notebooks: