I am running a HDP 2.3.4 multinode cluster with Ubuntu Trusty 14.04 on all my nodes. The Spark in this post is installed on my client node. My cluster has HDFS and YARN, among other services. All were installed from Ambari. This is not the case for Apache Spark 1.6, because Hortonworks does not offer Spark 1.6 on HDP 2.3.4
The documentation on Spark version 1.6 is here.
My post on setting up Apache Spark 2.0.0.
Prerequisities
Java
Update and upgrade the system and install Java
sudo add-apt-repository ppa:openjdk-r/ppa sudo apt-get update sudo apt-get install openjdk-8-jdk -y
Add JAVA_HOME in the system variables file
sudo vi /etc/environment export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
Spark user
Create user spark and add it to group hadoop
sudo adduser spark sudo usermod -a -G hadoop spark
HDFS home directory for user spark
Create spark’s user folder in HDFS
sudo -u hdfs hadoop fs -mkdir -p /user/spark sudo -u hdfs hadoop fs -chown -R spark:hdfs /user/spark
Spark installation and configuration
Install Spark
Create directory where spark directory is going to reside. Hortonworks installs its services under /usr/hdp. I am following their lead so I am installing all Apache services under /usr/apache. Create the directory and step into it.
sudo mkdir /usr/apache cd /usr/apache
Download Spark 1.6.0 from https://spark.apache.org/downloads.html. I have Hadoop 2.7.1, version 2.6 does the trick.
sudo wget http://d3kbcqa49mib13.cloudfront.net/spark-1.6.0-bin-hadoop2.6.tgz
Unpack the tar file
sudo tar -xvzf spark-1.6.0-bin-hadoop2.6.tgz
Remove the tar file after it has been unpacked
sudo rm spark-1.6.0-bin-hadoop2.6.tgz
Change the ownership of the folder and its elements
sudo chown -R spark:spark spark-1.6.0-bin-hadoop2.6
Update system variables
Step into the spark 1.6.0 directory and run pwd to get full path
cd spark-1.6.0-bin-hadoop2.6 pwd
Update the system environment file by adding SPARK_HOME and adding Spark_HOME/bin to the PATH
sudo vi /etc/environment
export SPARK_HOME=/usr/apache/spark-1.6.0-bin-hadoop2.6
At the end of PATH add
${SPARK_HOME}/bin
Refresh the system environments
source /etc/environment
Change the owner of $SPARK_HOME to spark
sudo chown -R spark:spark $SPARK_HOME
Log and pid directories
Create log and pid directories
sudo mkdir /var/log/spark sudo chown spark:spark /var/log/spark sudo -u spark mkdir $SPARK_HOME/run
Spark configuration files
Hive configuration
sudo -u spark vi $SPARK_HOME/conf/hive-site.xml
<configuration> <property> <name>hive.metastore.uris</name> <!--Make sure that <value> points to the Hive Metastore URI in your cluster --> <value>thrift://hive-server:9083</value> <description>URI for client to contact metastore server</description> </property> </configuration>
Spark environment file
Create a new file in under $SPARK_HOME/conf
sudo vi conf/spark-env.sh
Add the following lines and adjust aaccordingly.
export SPARK_LOG_DIR=/var/log/spark export SPARK_PID_DIR=${SPARK_HOME}/run export HADOOP_HOME=${HADOOP_HOME:-/usr/hdp/current/hadoop-client} export HADOOP_CONF_DIR=${HADOOP_CONF_DIR:-/usr/hdp/current/hadoop-client/conf} export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64 export SPARK_SUBMIT_OPTIONS="--jars ${SPARK_HOME}/lib/spark-csv_2.10-1.4.0.jar"
The last line serves as an example how to add external libreries to Spark. This particular package is quite common and is advised to install it. The package can be downloaded from this site.
Spark default file
Fetch HDP version
hdp-select status hadoop-client | awk '{print $3;}'
Example output
2.3.4.0-3485
Create spark-defaults.conf file in $SPARK_HOME/conf
sudo -u spark vi conf/spark-defaults.conf
Add the following and adjust accordingly (some properties belong to Spark History Server whose configuration is explained in the post in the link below)
spark.driver.extraJavaOptions -Dhdp.version=2.3.4.0-3485 spark.yarn.am.extraJavaOptions -Dhdp.version=2.3.4.0-3485 spark.eventLog.dir hdfs:///spark-history spark.eventLog.enabled true spark.history.fs.logDirectory hdfs:///spark-history spark.history.provider org.apache.spark.deploy.history.FsHistoryProvider spark.history.ui.port 18080 spark.history.kerberos.keytab none spark.history.kerberos.principal none spark.yarn.containerLauncherMaxThreads 25 spark.yarn.driver.memoryOverhead 384 spark.yarn.executor.memoryOverhead 384 spark.yarn.historyServer.address spark-server:18080 spark.yarn.max.executor.failures 3 spark.yarn.preserve.staging.files false spark.yarn.queue default spark.yarn.scheduler.heartbeat.interval-ms 5000 spark.yarn.submit.file.replication 3 spark.jars.packages com.databricks:spark-csv_2.10:1.4.0 spark.io.compression.codec lzf spark.blockManager.port 38000 spark.broadcast.port 38001 spark.driver.port 38002 spark.executor.port 38003 spark.fileserver.port 38004 spark.replClassServer.port 38005
The ports are defined in this configuration file. If they are not, then Spark assigns random ports. More on ports and assigning them in Spark can be found here.
JAVA OPTS
Create java-opts file in $SPARK_HOME/conf and add your HDP version
sudo -u spark vi conf/java-opts
-Dhdp.version=2.4.0.0-169
Fixing links in Ubuntu
Since the Hadoop distribution is Hortonworks and Spark is Apache’s, some workaround is in place. Remove the default link and create new ones
sudo rm /usr/hdp/current/spark-client sudo ln -s /usr/apache/spark-1.6.0-bin-hadoop2.6 /usr/hdp/current/spark-client sudo ln -s /usr/hdp/current/spark-client/bin/sparkR /usr/bin/sparkR
Spark 1.6.0 is now ready.
How Spark History Server is configured and brought to life is explained here.
typo : to be sotored in the directory
LikeLike
typo: order to cotrol that
LikeLike
Klaus, thanks for the comments!
I do appreciate it.
LikeLike