Apache Spark 2.0.0 – Installation and configuration

I am running a HDP 2.4 multinode cluster with Ubuntu Trusty 14.04 on all my nodes. The Spark in this post is installed on my client node. My cluster has HDFS and YARN, among other services. All were installed from Ambari. This is not the case for Apache Spark 2.0, because Hortonworks does not offer Spark 2.0 on HDP 2.4.0

The documentation on the latest Spark version can be found here.

My notes on Spark 2.0 can be found here (if anyone finds them useful).

My post on setting up Apache Spark 1.6.0.

Update 12.January 2018: A post on how to install Apache Spark 2.2.1 Apache Spark 2.2.1 on Ubuntu 16.04 – Hadoop-less instance.

Prerequisities

Java

Update and upgrade the system and install Java

sudo add-apt-repository ppa:openjdk-r/ppa
sudo apt-get update
sudo apt-get install openjdk-8-jdk -y

Add JAVA_HOME in the system variables file

sudo vi /etc/environment
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64

Spark user

Create user spark and add it to group hadoop

sudo adduser spark
sudo usermod -a -G hdfs spark

HDFS home directory for user spark

Create spark’s user folder in HDFS

sudo -u hdfs hadoop fs -mkdir -p /user/spark
sudo -u hdfs hadoop fs -chown -R spark:hdfs /user/spark

Spark installation and configuration

Install Spark

Create directory where spark directory is going to reside. Step into the directory

sudo mkdir /usr/apache
cd /usr/apache

Download Spark 2.0.0 from https://spark.apache.org/downloads.html

sudo wget http://d3kbcqa49mib13.cloudfront.net/spark-2.0.0-bin-hadoop2.7.tgz

Unpack the tar file

sudo tar -xvzf spark-2.0.0-bin-hadoop2.7.tgz

Remove the tar file after it has been unpacked

sudo rm spark-2.0.0-bin-hadoop2.7.tgz

Change the ownership of the folder and its elements

sudo chown -R spark:spark spark-2.0.0-bin-hadoop2.7

Update system variables

Step into the spark 2.0.0 directory and run pwd to get full path

cd spark-2.0.0-bin-hadoop2.7
pwd

Update the system environment file by adding SPARK_HOME and adding SPARK_HOME/bin to the PATH

sudo vi /etc/environment

export SPARK_HOME=/usr/apache/spark-2.0.0-bin-hadoop2.7

At the end of PATH add

${SPARK_HOME}/bin

Refresh the system environments

source /etc/environment

Log and pid directories

Create log and pid directories

sudo mkdir /var/log/spark
sudo chown spark:spark /var/log/spark
sudo -u spark mkdir $SPARK_HOME/run

Spark configuration files

Hive configuration

Create a Hive warehouse and give permissions to the users. If Hive service is set up, the path to the Hive warehouse could be /apps/hive/warehouse

sudo -u hive hadoop fs -mkdir /user/hive/warehouse
sudo -u hdfs hadoop fs -chmod -R 777 /user/hive/warehouse

Find the hive-site.xml file – HDP versions usually have it in the following folder: /usr/hdp/current/hive-client/conf and copy it to $SPARK_HOME/conf.
The following property name in the file should be altered:
hive.metastore.warehouse.dir is replaced with spark.sql.warehouse.dir.

Spark environment file

Create a new file in under $SPARK_HOME/conf

sudo -u spark vi conf/spark-env.sh

Add the following lines and adjust aaccordingly.

export SPARK_LOG_DIR=/var/log/spark
export SPARK_PID_DIR=${SPARK_HOME}/run
export HADOOP_HOME=${HADOOP_HOME:-/usr/hdp/current/hadoop-client}
export HADOOP_CONF_DIR=${HADOOP_CONF_DIR:-/usr/hdp/current/hadoop-client/conf}
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
export SPARK_SUBMIT_OPTIONS="--jars ${SPARK_HOME}/lib/spark-csv_2.11-1.4.0.jar"

The last line serves as an example how to add external libreries to Spark. This particular package is quite common and is advised to install it. The package can be downloaded from this site.

Spark default file

Fetch HDP version

hdp-select status hadoop-client | awk '{print $3;}'

Example output for HDP 2.4:

2.4.0.0-169

Example output for HDP 2.5:

2.5.0.0-1245

Create spark-defaults.conf file in $SPARK_HOME/conf

sudo -u spark vi $SPARK_HOME/conf/spark-defaults.conf

Add the following and adjust accordingly (some properties belong to Spark History Server whose configuration is explained in the post in the link below)

spark.driver.extraJavaOptions -Dhdp.version=2.4.0.0-169
spark.yarn.am.extraJavaOptions -Dhdp.version=2.4.0.0-169
spark.eventLog.dir hdfs:///spark-history
spark.eventLog.enabled true
spark.history.fs.logDirectory hdfs:///spark-history
spark.history.provider org.apache.spark.deploy.history.FsHistoryProvider
spark.history.ui.port 18080

spark.history.kerberos.keytab none
spark.history.kerberos.principal none

spark.yarn.containerLauncherMaxThreads 25
spark.yarn.driver.memoryOverhead 384
spark.yarn.executor.memoryOverhead 384
spark.yarn.historyServer.address spark-server:18080
spark.yarn.max.executor.failures 3
spark.yarn.preserve.staging.files false
spark.yarn.queue default
spark.yarn.scheduler.heartbeat.interval-ms 5000
spark.yarn.submit.file.replication 3

spark.jars.packages com.databricks:spark-csv_2.11:1.4.0

spark.io.compression.codec lzf

spark.blockManager.port 38000
spark.broadcast.port 38001
spark.driver.port 38002
spark.executor.port 38003
spark.fileserver.port 38004
spark.replClassServer.port 38005

The ports are defined in this configuration file. If they are not, then Spark assigns random ports. More on ports and assigning them in Spark can be found here.

If the ports are not under control, you risk the

Yarn application has already ended! It might have been killed or unable to launch application master.

error. More on that is written here.

JAVA OPTS

Create java-opts file in $SPARK_HOME/conf and add your HDP version

sudo -u spark vi $SPARK_HOME/conf/java-opts

Example:

-Dhdp.version=2.4.0.0-169

Fixing links in Ubuntu

Since the Hadoop distribution is Hortonworks and Spark is Apache’s, some workaround is in place. Remove the default link and create new ones.
First, the existing link is removed. Then the new link is created, pointing to the $SPARK_HOME/bin.

sudo rm /usr/hdp/current/spark-client
sudo ln -s /usr/apache/spark-2.0.0-bin-hadoop2.7 /usr/hdp/current/spark-client

Spark 2.0.0 is now almost ready.

Jersey problem

If you try to run a spark-submit command on YARN you can expect the following error message:

Exception in thread “main” java.lang.NoClassDefFoundError: com/sun/jersey/api/client/config/ClientConfig

Jar file jersey-bundle-*.jar is not present in the $SPARK_HOME/jars. Adding it fixes this problem:

sudo -u spark wget http://repo1.maven.org/maven2/com/sun/jersey/jersey-bundle/1.19.1/jersey-bundle-1.19.1.jar -P $SPARK_HOME/jars

January 2017 – Update on this issue:
If the following is done, Jersey 1 will be used when starting Spark History Server and the applications in Spark History Server will not be shown. The folowing error message will be generated in the Spark History Server output file:

WARN servlet.ServletHandler: /api/v1/applications
java.lang.NullPointerException
        at org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:388)

This problem occurs only when one tries to run Spark on YARN, since YARN 2.7.3 uses Jersey 1 and Spark 2.0 uses Jersey 2

One workaround is not to add the Jersey 1 jar described above but disable the YARN Timeline Service in spark-defaults.conf

spark.hadoop.yarn.timeline-service.enabled false

Spark History Server

How Spark History Server is configured and brought to life is explained here. Absolutely worth setting it up, not only because it is very useful and practical for monitoring Spark applications, but also because in Spark 2.0 the graphical interface is more user friendly and, well, more graphical.

Hive SerDe error when querying from spark-sql

If you plan to use spark-sql, it is maybe worth checking this post to avoid the jsonserde not found error message.

Notes about Spark 2.0

My Apache Spark 2.0 Notes

Manipulating files in S3 with Spark is mentioned here.

9 thoughts on “Apache Spark 2.0.0 – Installation and configuration”

Abir says:

09/09/2016 at 10:31 am

Hi Marko, i am installing spark 2.0 standalone single user, here i am having some problems while installing,
1. i am not getting HDP version since i am not using hortonworks distribution i am using hadoop 2.7.3.
2. Java-opts is it required, as i am not having java-opts template in spark conf directory.
3. “Fixing links in Ubuntu” is it necessary.
rest all other things are fine as i put them all in place.

Thank you

LikeLike

1. markobigdata says:
  
  10/09/2016 at 7:41 am
  
  Hi Abir.
  That HDP version is not needed unless you have HDP distribution. I am running Spark 2.0 without specifying HDP values on a non.HDP cluster.
  Fixing links in Ubuntu is also a HDP specific. By default the links are pointing to the HDP directories and if you install Spark manually on HDP those links are broken and have to be repared.
  
  LikeLiked by 1 person
  
  1. abirjameel says:
    
    25/09/2016 at 10:43 pm
    
    Thanks Marko, your blogs and site proved to be a great help for me while installing and getting started with spark.
    
    LikeLike
Anonymous says:

07/10/2016 at 1:53 pm

Thanks a lot Marko,it was very helpful

LikeLike

Anonymous says:

15/06/2017 at 7:28 pm

Hi Mark,
Thanks for the post , I am trying to install Spark 2.0 on a cluster with HDP 2.3.4 which already has Spark 1.5 running. What is happening is when i try to run spark in yarn mode it is resolving to Spark 1.5 jars complaining bad substitution. Makes sense but couldnt’t think of a solution. Any thoughts ?

LikeLike

1. markobigdata says:
  
  15/06/2017 at 11:08 pm
  
  A couple of questions to understand this better: Is there any reason you have Spark 1.5 installed? What prevents you from upgrading HDP to 2.6?
  
  LikeLike
  
  1. Anonymous says:
    
    20/06/2017 at 4:45 pm
    
    Spark 1.5 was installed part of HDP 2.3.4 . We are a big corporate environment and to upgrade to another version is a 6 month long project. So I am trying to install spark 2.1 for the analytics group.
    
    LikeLike
prasad says:

20/06/2017 at 8:36 pm

Spark 1.5 was installed along with HDP 2.3.4 , We can live without spark 1.5. We are a big corporation and any upgrades has to go through a lot of churning which takes a good 6 months. Spark 2.1 is a requirement from a small analytics team that is also critical. So trying to cohost them.

LikeLike

1. markobigdata says:
  
  21/06/2017 at 7:40 am
  
  According to my experience, this was on 2.4, I think, I installed Spark through Ambari and then installed Spark manually on a client node. I was having some issues as well, so I removed the Spark from Ambari and used only the manually installed version.
  Im still installing Spark&Zeppelin manually, not through Ambari.
  (when I say manually, I mean Apache versions, not manual HDP install from CLI).
  Is it an option to remove the Spark 1.5 from the services in Ambari?
  Can you contact me through linkedin or marko_kole (at) yahoo.com and we could talk about this challenge?
  
  LikeLike

markobigdata

Big Data documentation in a blog

Apache Spark 2.0.0 – Installation and configuration

Prerequisities

Java

Spark user

HDFS home directory for user spark

Spark installation and configuration

Install Spark

Log and pid directories

Spark configuration files

Hive configuration

Spark environment file

Spark default file

JAVA OPTS

Fixing links in Ubuntu

Jersey problem

Spark History Server

Hive SerDe error when querying from spark-sql

Notes about Spark 2.0

9 thoughts on “Apache Spark 2.0.0 – Installation and configuration”

Leave a comment Cancel reply

Prerequisities

Java

Spark user

HDFS home directory for user spark

Spark installation and configuration

Install Spark

Log and pid directories

Spark configuration files

Hive configuration

Spark environment file

Spark default file

JAVA OPTS

Fixing links in Ubuntu

Jersey problem

Spark History Server

Hive SerDe error when querying from spark-sql

Notes about Spark 2.0

Share this:

9 thoughts on “Apache Spark 2.0.0 – Installation and configuration”

Leave a comment Cancel reply