Installing Apache Spark 2.2.1

I have installed older Apache Spark versions and now the time is right to install Spark 2.2.1.

Update 27.09.2019: Automated Apache Spark install with Docker, Terraform and Ansible? Check out this post.

Im using an AWS t2.micro instance with Ubuntu 16.04 on it. MobaXterm is my choice of interface to SSH to the instance.

System update

sudo apt-get update -y
sudo apt-get upgrade -y

Change instance name

Go into the hostname file and change the name to spark

sudo vi /etc/hostname

Change localhost with instance name in hosts file

After the 127.0.0.1 write the name of the instance

sudo vi /etc/hosts

Reboot the instance

sudo reboot

Install and set up Java

If not sooner you will need Java for running History server. Java 8 is installed in the following way

sudo add-apt-repository ppa:openjdk-r/ppa
sudo apt-get update
sudo apt-get install openjdk-8-jdk -y

Add JAVA_HOME to the environment file

sudo vi /etc/environment

And add the following line to the top of the file

export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64

Doublecheck if this is the correct Java home.

Python

Spark 2.2.x supports Python 2.7+/3.4+. When running PySpark, Spark looks for Python in /usr/bin directory. Ubuntu 16.04 on AWS comes only with Python 3.5. When running PySpark without Python 2.7+, the following error message outputs

/usr/apache/spark-2.2.1-bin-hadoop2.7/bin/pyspark: line 45: python: command not found
env: ‘python’: No such file or directory

Options are two: install Python 2.7+ or create a link to Python3. The first alternative is acceptable only if Python packages that are in Python2 but not Python3 are going to be used. Otherwise, the latter alternative is the option. And this is done in the following way

sudo ln -s /usr/bin/python3 /usr/bin/python

Running PySpark now starts PySpark CLI with Python 3.5.2

And yes, running python or python3 will both execute the same action – start python 3.5.

Create user spark

sudo adduser spark

Define password, for the sake of testing, let’s go with spark

Prepare directory for Spark home

sudo mkdir /usr/apache

Step into the directory

cd /usr/apache

Download and unpack Apache Spark 2.2.1

sudo wget http://apache.uib.no/spark/spark-2.2.1/spark-2.2.1-bin-hadoop2.7.tgz
sudo tar -xvzf spark-2.2.1-bin-hadoop2.7.tgz

Clean up – delete the spark’s tgz file

sudo rm spark-2.2.1-bin-hadoop2.7.tgz

Change the owner of the spark directory

sudo chown spark:spark /usr/apache/spark-2.2.1-bin-hadoop2.7

Create Spark home

cd spark-2.2.1-bin-hadoop2.7
pwd

Output of the pwd command is the value for SPARK_HOME in the environment file. Open the file

sudo vi /etc/environment

And add the below SPARK_HOME line before the PATH line

export SPARK_HOME=/usr/apache/spark-2.2.1-bin-hadoop2.7

At the end of PATH add

:${SPARK_HOME}/bin

(You need the colon to separate from previous values)
Refresh the environment file

source /etc/environment

Create log and pid directories

sudo mkdir -p /var/log/spark/logs
sudo chown spark:spark -R /var/log/spark
sudo -u spark mkdir $SPARK_HOME/run
sudo -u spark chmod 777 /var/log/spark/logs

The last line allows every user to read, write to the directory. It is probably best to adjust access according to the needs.

If different users are going to run Spark applications and if those applications would want to be seen in History Server, user spark has to be added to the users’ group.
For example, user ubuntu is running Spark applications, that means user ubuntu writes log files to Spark History log directory (check below for the property spark.history.fs.logDirectory) and Spark is opening them through History Server. To be albe to do the latter, spark has to be a memeber of ubuntu group. This is done in the following way

sudo usermod -a -G ubuntu spark

Prepare spark-env.sh file

Open the file

sudo -u spark vi $SPARK_HOME/conf/spark-env.sh

Add the following values

SPARK_LOG_DIR=/var/log/spark
SPARK_PID_DIR=${SPARK_HOME}/run

Prepare spark-default.sh file

Open the file

sudo -u spark vi $SPARK_HOME/conf/spark-defaults.conf

Add the following values

spark.history.fs.logDirectory file:/var/log/spark/logs
spark.eventLog.enabled true
spark.eventLog.dir file:/var/log/spark/logs
spark.history.provider org.apache.spark.deploy.history.FsHistoryProvider
spark.history.ui.port 18080
spark.blockManager.port 38000
spark.broadcast.port 38001
spark.driver.port 38002
spark.executor.port 38003
spark.fileserver.port 38004
spark.replClassServer.port 38005

Start Spark History

sudo -u spark $SPARK_HOME/sbin/start-history-server.sh

Instances IP address on port 18080 should open the Spark History Server. If not, check the /var/log/spark for errors and messages.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s