I am running a HDP 2.4 multinode cluster with Ubuntu Trusty 14.04 on all my nodes. The Spark in this post is installed on my client node. My cluster has HDFS and YARN, among other services. All were installed from Ambari. This is not the case for Apache Spark 2.0, because Hortonworks does not offer Spark 2.0 on HDP 2.4.0
The documentation on the latest Spark version can be found here.
My notes on Spark 2.0 can be found here (if anyone finds them useful).
My post on setting up Apache Spark 1.6.0.
Update 12.January 2018: A post on how to install Apache Spark 2.2.1 Apache Spark 2.2.1 on Ubuntu 16.04 – Hadoop-less instance.
Prerequisities
Java
Update and upgrade the system and install Java
sudo add-apt-repository ppa:openjdk-r/ppa sudo apt-get update sudo apt-get install openjdk-8-jdk -y
Add JAVA_HOME in the system variables file
sudo vi /etc/environment export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
Spark user
Create user spark and add it to group hadoop
sudo adduser spark sudo usermod -a -G hdfs spark
HDFS home directory for user spark
Create spark’s user folder in HDFS
sudo -u hdfs hadoop fs -mkdir -p /user/spark sudo -u hdfs hadoop fs -chown -R spark:hdfs /user/spark
Spark installation and configuration
Install Spark
Create directory where spark directory is going to reside. Step into the directory
sudo mkdir /usr/apache cd /usr/apache
Download Spark 2.0.0 from https://spark.apache.org/downloads.html
sudo wget http://d3kbcqa49mib13.cloudfront.net/spark-2.0.0-bin-hadoop2.7.tgz
Unpack the tar file
sudo tar -xvzf spark-2.0.0-bin-hadoop2.7.tgz
Remove the tar file after it has been unpacked
sudo rm spark-2.0.0-bin-hadoop2.7.tgz
Change the ownership of the folder and its elements
sudo chown -R spark:spark spark-2.0.0-bin-hadoop2.7
Update system variables
Step into the spark 2.0.0 directory and run pwd to get full path
cd spark-2.0.0-bin-hadoop2.7 pwd
Update the system environment file by adding SPARK_HOME and adding SPARK_HOME/bin to the PATH
sudo vi /etc/environment
export SPARK_HOME=/usr/apache/spark-2.0.0-bin-hadoop2.7
At the end of PATH add
${SPARK_HOME}/bin
Refresh the system environments
source /etc/environment
Log and pid directories
Create log and pid directories
sudo mkdir /var/log/spark sudo chown spark:spark /var/log/spark sudo -u spark mkdir $SPARK_HOME/run
Spark configuration files
Hive configuration
Create a Hive warehouse and give permissions to the users. If Hive service is set up, the path to the Hive warehouse could be /apps/hive/warehouse
sudo -u hive hadoop fs -mkdir /user/hive/warehouse sudo -u hdfs hadoop fs -chmod -R 777 /user/hive/warehouse
Find the hive-site.xml file – HDP versions usually have it in the following folder: /usr/hdp/current/hive-client/conf and copy it to $SPARK_HOME/conf.
The following property name in the file should be altered:
hive.metastore.warehouse.dir is replaced with spark.sql.warehouse.dir.
Spark environment file
Create a new file in under $SPARK_HOME/conf
sudo -u spark vi conf/spark-env.sh
Add the following lines and adjust aaccordingly.
export SPARK_LOG_DIR=/var/log/spark export SPARK_PID_DIR=${SPARK_HOME}/run export HADOOP_HOME=${HADOOP_HOME:-/usr/hdp/current/hadoop-client} export HADOOP_CONF_DIR=${HADOOP_CONF_DIR:-/usr/hdp/current/hadoop-client/conf} export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64 export SPARK_SUBMIT_OPTIONS="--jars ${SPARK_HOME}/lib/spark-csv_2.11-1.4.0.jar"
The last line serves as an example how to add external libreries to Spark. This particular package is quite common and is advised to install it. The package can be downloaded from this site.
Spark default file
Fetch HDP version
hdp-select status hadoop-client | awk '{print $3;}'
Example output for HDP 2.4:
2.4.0.0-169
Example output for HDP 2.5:
2.5.0.0-1245
Create spark-defaults.conf file in $SPARK_HOME/conf
sudo -u spark vi $SPARK_HOME/conf/spark-defaults.conf
Add the following and adjust accordingly (some properties belong to Spark History Server whose configuration is explained in the post in the link below)
spark.driver.extraJavaOptions -Dhdp.version=2.4.0.0-169 spark.yarn.am.extraJavaOptions -Dhdp.version=2.4.0.0-169 spark.eventLog.dir hdfs:///spark-history spark.eventLog.enabled true spark.history.fs.logDirectory hdfs:///spark-history spark.history.provider org.apache.spark.deploy.history.FsHistoryProvider spark.history.ui.port 18080 spark.history.kerberos.keytab none spark.history.kerberos.principal none spark.yarn.containerLauncherMaxThreads 25 spark.yarn.driver.memoryOverhead 384 spark.yarn.executor.memoryOverhead 384 spark.yarn.historyServer.address spark-server:18080 spark.yarn.max.executor.failures 3 spark.yarn.preserve.staging.files false spark.yarn.queue default spark.yarn.scheduler.heartbeat.interval-ms 5000 spark.yarn.submit.file.replication 3 spark.jars.packages com.databricks:spark-csv_2.11:1.4.0 spark.io.compression.codec lzf spark.blockManager.port 38000 spark.broadcast.port 38001 spark.driver.port 38002 spark.executor.port 38003 spark.fileserver.port 38004 spark.replClassServer.port 38005
The ports are defined in this configuration file. If they are not, then Spark assigns random ports. More on ports and assigning them in Spark can be found here.
If the ports are not under control, you risk the
Yarn application has already ended! It might have been killed or unable to launch application master.
error. More on that is written here.
JAVA OPTS
Create java-opts file in $SPARK_HOME/conf and add your HDP version
sudo -u spark vi $SPARK_HOME/conf/java-opts
Example:
-Dhdp.version=2.4.0.0-169
Fixing links in Ubuntu
Since the Hadoop distribution is Hortonworks and Spark is Apache’s, some workaround is in place. Remove the default link and create new ones.
First, the existing link is removed. Then the new link is created, pointing to the $SPARK_HOME/bin.
sudo rm /usr/hdp/current/spark-client sudo ln -s /usr/apache/spark-2.0.0-bin-hadoop2.7 /usr/hdp/current/spark-client
Spark 2.0.0 is now almost ready.
Jersey problem
If you try to run a spark-submit command on YARN you can expect the following error message:
Exception in thread “main” java.lang.NoClassDefFoundError: com/sun/jersey/api/client/config/ClientConfig
Jar file jersey-bundle-*.jar is not present in the $SPARK_HOME/jars. Adding it fixes this problem:
sudo -u spark wget http://repo1.maven.org/maven2/com/sun/jersey/jersey-bundle/1.19.1/jersey-bundle-1.19.1.jar -P $SPARK_HOME/jars
January 2017 – Update on this issue:
If the following is done, Jersey 1 will be used when starting Spark History Server and the applications in Spark History Server will not be shown. The folowing error message will be generated in the Spark History Server output file:
WARN servlet.ServletHandler: /api/v1/applications java.lang.NullPointerException at org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:388)
This problem occurs only when one tries to run Spark on YARN, since YARN 2.7.3 uses Jersey 1 and Spark 2.0 uses Jersey 2
One workaround is not to add the Jersey 1 jar described above but disable the YARN Timeline Service in spark-defaults.conf
spark.hadoop.yarn.timeline-service.enabled false
Spark History Server
How Spark History Server is configured and brought to life is explained here. Absolutely worth setting it up, not only because it is very useful and practical for monitoring Spark applications, but also because in Spark 2.0 the graphical interface is more user friendly and, well, more graphical.
Hive SerDe error when querying from spark-sql
If you plan to use spark-sql, it is maybe worth checking this post to avoid the jsonserde not found error message.
Notes about Spark 2.0
Manipulating files in S3 with Spark is mentioned here.
Hi Marko, i am installing spark 2.0 standalone single user, here i am having some problems while installing,
1. i am not getting HDP version since i am not using hortonworks distribution i am using hadoop 2.7.3.
2. Java-opts is it required, as i am not having java-opts template in spark conf directory.
3. “Fixing links in Ubuntu” is it necessary.
rest all other things are fine as i put them all in place.
Thank you
LikeLike
Hi Abir.
That HDP version is not needed unless you have HDP distribution. I am running Spark 2.0 without specifying HDP values on a non.HDP cluster.
Fixing links in Ubuntu is also a HDP specific. By default the links are pointing to the HDP directories and if you install Spark manually on HDP those links are broken and have to be repared.
LikeLiked by 1 person
Thanks Marko, your blogs and site proved to be a great help for me while installing and getting started with spark.
LikeLike
Thanks a lot Marko,it was very helpful
LikeLike
Hi Mark,
Thanks for the post , I am trying to install Spark 2.0 on a cluster with HDP 2.3.4 which already has Spark 1.5 running. What is happening is when i try to run spark in yarn mode it is resolving to Spark 1.5 jars complaining bad substitution. Makes sense but couldnt’t think of a solution. Any thoughts ?
LikeLike
A couple of questions to understand this better: Is there any reason you have Spark 1.5 installed? What prevents you from upgrading HDP to 2.6?
LikeLike
Spark 1.5 was installed part of HDP 2.3.4 . We are a big corporate environment and to upgrade to another version is a 6 month long project. So I am trying to install spark 2.1 for the analytics group.
LikeLike
Spark 1.5 was installed along with HDP 2.3.4 , We can live without spark 1.5. We are a big corporation and any upgrades has to go through a lot of churning which takes a good 6 months. Spark 2.1 is a requirement from a small analytics team that is also critical. So trying to cohost them.
LikeLike
According to my experience, this was on 2.4, I think, I installed Spark through Ambari and then installed Spark manually on a client node. I was having some issues as well, so I removed the Spark from Ambari and used only the manually installed version.
Im still installing Spark&Zeppelin manually, not through Ambari.
(when I say manually, I mean Apache versions, not manual HDP install from CLI).
Is it an option to remove the Spark 1.5 from the services in Ambari?
Can you contact me through linkedin or marko_kole (at) yahoo.com and we could talk about this challenge?
LikeLike