I have installed older Apache Spark versions and now the time is right to install Spark 2.2.1.
Update 27.09.2019: Automated Apache Spark install with Docker, Terraform and Ansible? Check out this post.
Im using an AWS t2.micro instance with Ubuntu 16.04 on it. MobaXterm is my choice of interface to SSH to the instance.
System update
sudo apt-get update -y sudo apt-get upgrade -yChange instance name
Go into the hostname file and change the name to spark
sudo vi /etc/hostnameChange localhost with instance name in hosts file
After the 127.0.0.1 write the name of the instance
sudo vi /etc/hostsReboot the instance
sudo rebootInstall and set up Java
If not sooner you will need Java for running History server. Java 8 is installed in the following way
sudo add-apt-repository ppa:openjdk-r/ppa sudo apt-get update sudo apt-get install openjdk-8-jdk -yAdd JAVA_HOME to the environment file
sudo vi /etc/environmentAnd add the following line to the top of the file
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64Doublecheck if this is the correct Java home.
Python
Spark 2.2.x supports Python 2.7+/3.4+. When running PySpark, Spark looks for Python in /usr/bin directory. Ubuntu 16.04 on AWS comes only with Python 3.5. When running PySpark without Python 2.7+, the following error message outputs
/usr/apache/spark-2.2.1-bin-hadoop2.7/bin/pyspark: line 45: python: command not found env: ‘python’: No such file or directoryOptions are two: install Python 2.7+ or create a link to Python3. The first alternative is acceptable only if Python packages that are in Python2 but not Python3 are going to be used. Otherwise, the latter alternative is the option. And this is done in the following way
sudo ln -s /usr/bin/python3 /usr/bin/pythonRunning PySpark now starts PySpark CLI with Python 3.5.2
And yes, running python or python3 will both execute the same action – start python 3.5.
Create user spark
sudo adduser sparkDefine password, for the sake of testing, let’s go with spark
Prepare directory for Spark home
sudo mkdir /usr/apacheStep into the directory
cd /usr/apacheDownload and unpack Apache Spark 2.2.1
sudo wget http://apache.uib.no/spark/spark-2.2.1/spark-2.2.1-bin-hadoop2.7.tgz sudo tar -xvzf spark-2.2.1-bin-hadoop2.7.tgzClean up – delete the spark’s tgz file
sudo rm spark-2.2.1-bin-hadoop2.7.tgzChange the owner of the spark directory
sudo chown spark:spark /usr/apache/spark-2.2.1-bin-hadoop2.7Create Spark home
cd spark-2.2.1-bin-hadoop2.7 pwdOutput of the pwd command is the value for SPARK_HOME in the environment file. Open the file
sudo vi /etc/environmentAnd add the below SPARK_HOME line before the PATH line
export SPARK_HOME=/usr/apache/spark-2.2.1-bin-hadoop2.7At the end of PATH add
:${SPARK_HOME}/bin(You need the colon to separate from previous values)
Refresh the environment filesource /etc/environmentCreate log and pid directories
sudo mkdir -p /var/log/spark/logs sudo chown spark:spark -R /var/log/spark sudo -u spark mkdir $SPARK_HOME/run sudo -u spark chmod 777 /var/log/spark/logsThe last line allows every user to read, write to the directory. It is probably best to adjust access according to the needs.
If different users are going to run Spark applications and if those applications would want to be seen in History Server, user spark has to be added to the users’ group.
For example, user ubuntu is running Spark applications, that means user ubuntu writes log files to Spark History log directory (check below for the property spark.history.fs.logDirectory) and Spark is opening them through History Server. To be albe to do the latter, spark has to be a memeber of ubuntu group. This is done in the following waysudo usermod -a -G ubuntu sparkPrepare spark-env.sh file
Open the file
sudo -u spark vi $SPARK_HOME/conf/spark-env.shAdd the following values
SPARK_LOG_DIR=/var/log/spark SPARK_PID_DIR=${SPARK_HOME}/runPrepare spark-default.sh file
Open the file
sudo -u spark vi $SPARK_HOME/conf/spark-defaults.confAdd the following values
spark.history.fs.logDirectory file:/var/log/spark/logs spark.eventLog.enabled true spark.eventLog.dir file:/var/log/spark/logs spark.history.provider org.apache.spark.deploy.history.FsHistoryProvider spark.history.ui.port 18080 spark.blockManager.port 38000 spark.broadcast.port 38001 spark.driver.port 38002 spark.executor.port 38003 spark.fileserver.port 38004 spark.replClassServer.port 38005Start Spark History
sudo -u spark $SPARK_HOME/sbin/start-history-server.shInstances IP address on port 18080 should open the Spark History Server. If not, check the /var/log/spark for errors and messages.