Configuring Apache Spark History Server

Prior to configuring and running Spark History Server, Spark should be installed.

How to install Apache Spark 1.6.0 is described here.

How to install Apache spark 2.0 is described here.

Spark History server

Check that $SPARK_HOME/conf/spark-defaults.conf has History Server properties set

spark.eventLog.dir hdfs:///spark-history
spark.eventLog.enabled true
spark.history.fs.logDirectory hdfs:///spark-history
spark.history.provider org.apache.spark.deploy.history.FsHistoryProvider
spark.history.ui.port 18080

spark.history.kerberos.keytab none
spark.history.kerberos.principal none

Create spark-history directory in HDFS

sudo -u hdfs hadoop fs -mkdir /spark-history

Change the owner of the directory

sudo -u hdfs hadoop fs -chown spark:hdfs /spark-history

Change permission (be more restrictive if necessary)

sudo -u hdfs hadoop fs -chmod 777 /spark-history

Add user spark to group hdfs on the instance where Spark History Server is going to run

sudo usermod -a -G hdfs spark

To view Spark jobs from other users
When you open the History Server and you are not able to see Spark jobs you are expecting to see, check the Spark out file in the Spark log directory. If error message “Permission denied” is present, Spark History Server is trying to read the job log file, but has no permission to do so.
Spark user should be added to the group of the spark job owner.
For example, user marko belongs to a group employee. If marko starts a Spark job, the log file for this job will have user and group marko:employee. In order for spark to be able to read the log file, spark user should e added to the employee group. This is done in the following way

sudo usermod -a -G employee spark

Checking spark’s groups

groups spark

should return group employee among spark’s groups.

Start Spark History server

sudo -u spark $SPARK_HOME/sbin/start-history-server.sh

Output:

starting org.apache.spark.deploy.history.HistoryServer, logging to /var/log/spark/spark-spark-org.apache.spark.deploy.history.HistoryServer-1-t-client01.out

Accessing Spark History server from the web UI can be done by accessing spark-server:18080. The following screen should load.

spark18080
A fresh Spark History Server installation has no applications to show (no applications in hdfs:/spark-history).

Spark History Server offers a great monitoring interface for Spark applications!

WARN ServletHandler: /api/v1/applications

If you happen to start Spark History Server but get neither completed nor incompleted applications on the Web UI, check the log files. If you get something like the following

WARN ServletHandler: /api/v1/applications
java.lang.NullPointerException
        at org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:388)
        at org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:341)
        at org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:228)
        at org.spark_project.jetty.servlet.ServletHolder.handle(ServletHolder.java:812)
        at org.spark_project.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:587)
        at org.spark_project.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1127)
        at org.spark_project.jetty.servlet.ServletHandler.doScope(ServletHandler.java:515)
        at org.spark_project.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1061)
        at org.spark_project.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
        at org.spark_project.jetty.servlets.gzip.GzipHandler.handle(GzipHandler.java:479)
        at org.spark_project.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:215)
        at org.spark_project.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97)
        at org.spark_project.jetty.server.Server.handle(Server.java:499)
        at org.spark_project.jetty.server.HttpChannel.handle(HttpChannel.java:311)
        at org.spark_project.jetty.server.HttpConnection.onFillable(HttpConnection.java:257)
        at org.spark_project.jetty.io.AbstractConnection$2.run(AbstractConnection.java:544)
        at org.spark_project.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:635)
        at org.spark_project.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:555)
        at java.lang.Thread.run(Thread.java:745)

Take the jersey-bundle-*.jar file out of the $SPARK_HOME/jars directory. Hortonworks dont need it, you dont need it 🙂

Building Apache Zeppelin 0.6.0 on Spark 1.5.2 & 1.6.0 in a cluster mode

I have Ubuntu 14.04 Trusty and a multinode Hadoop cluster. Hadoop distribution is Hortonworks 2.3.4. Spark is installed through Ambari Web UI and running version is 1.5.2 (upgraded to 1.6.0).

I am going to explain how I built and set up Apache Zeppelin 0.6.0 on Spark 1.5.2 and 1.6.0

Prerequisities

Non root account

Apache Zeppelin creators recommend not to use root account. For this service, I have created a new user zeppelin.

Java 7

Zeppelin uses Java 7. My system has Java 8, so I have installed Java 7 just for Zeppelin. Installation is in the following directory done as user zeppelin.

/home/zeppelin/prerequisities/jdk1.7.0_79

JAVA_HOME is added to the user’s bashrc.

export JAVA_HOME=/home/zeppelin/prerequisities/jdk1.7.0_79

Zeppelin log directory

Create zeppelin log directory.

sudo mkdir /var/log/zeppelin

Change ownership.

sudo chown zeppelin:zeppelin /var/log/zeppelin

If this is not done, Zeppelin’s log files are written in folder logs right in the current folder.

Clone and Build

Log in as user zeppelin and go to users home directory.

/home/zeppelin

Clone the source code from github.

git clone https://github.com/apache/incubator-zeppelin.git incubator-zeppelin

Zeppelin has a home now.

/home/zeppelin/incubator-zeppelin

Go into Zeppelin home and build Zeppelin

mvn clean package -Pspark-1.5 -Dspark.version=1.5.2 -Dhadoop.version=2.7.1 -Phadoop-2.6 -Pyarn -DskipTests

Build order.

zeppelin build start

7:31 minutes later, Zeppelin is successfully built.

zeppelin build success

Note!

If you try with something like the following 2 examples:

mvn clean package -Pspark-1.5 -Dspark.version=1.5.0 -Dhadoop.version=2.7.1 -Phadoop-2.7 -Pyarn -DskipTests
mvn clean package -Pspark-1.5 -Dspark.version=1.5.2 -Dhadoop.version=2.7.1 -Phadoop-2.7 -Pyarn –DskipTests

Build will succeed, but this warning will appear at the bottom of Build report:

[WARNING] The requested profile “hadoop-2.7” could not be activated because it does not exist.

Hadoop version mentioned in the maven execution must be 2.6 even though actual Hadoop version is 2.7.x.

hive-site.xml

Copy hive-site.xml from hive folder (this is done on Hortonworks distribution, users using other distribution should check where the file is located).

sudo cp /etc/hive/conf/hive-site.xml $ZEPPELIN_HOME/conf

Change ownership of the file.

sudo chown zeppelin:zeppelin $ZEPPELIN_HOME/conf/hive-site.xml

zeppelin-env.sh

Go to Zeppelin home and create zeppelin-env.sh by using the template in conf directory.

cp conf/zeppelin-env.sh.template conf/zeppelin-env.sh

Open it and add the following variables:

export JAVA_HOME=/home/zeppelin/prerequisities/jdk1.7.0_79
export HADOOP_CONF_DIR=/etc/hadoop/conf
export ZEPPELIN_JAVA_OPTS="-Dhdp.version=2.3.4.0-3485"
export ZEPPELIN_LOG_DIR=/var/log/zeppelin

The variable in the third line depends on the Hortonworks build. Find your hdp version by executing

hdp-select status hadoop-client

If your Hortonworks version is 2.3.4, the output is:

hadoop-client – 2.3.4.0-3485

Zeppelin daemon

Start Zeppelin from Zeppelin home

./bin/zeppelin-daemon.sh start

Status after starting the daemon:

zeppelin start

One can check if service is up:

./bin/zeppelin-daemon.sh status

Status:

zeppelin status

Zeppelin can be restarted in the following way:

./bin/zeppelin-daemon.sh restart

Status:

zeppelin restart

Stopping Zeppelin:

./bin/zeppelin-daemon.sh stop

Status:

zeppelin stop

Configuring interpreters in Zeppelin

Apache Zeppelin comes with many default interpreters. It is also possible to create your own interpreters. How to configure default Spark and Hive interpreters is covered in this post.

Installing R on Hadoop cluster to run sparkR

The environment

Ubuntu 14.04 Trusty is the operating system. Hortonworks Hadoop distribution is used to install the multinode cluster.

The following process of installation has been successful for the following versions:

  • Spark 1.4.1
  • Spark 1.5.2
  • Spark 1.6.0

Prerequisities

This post assumes Spark is already installed on the system. How this can be done is explained in one of my posts here.

Setting up sparkR

If command sparkR ($SPARK_HOME/bin/sparkR) is run right after Spark installation is complete, the following error is returned:

env: R: No such file or directory

R packages have to be installed on all nodes in order for sparkR to work properly. Spark 1.6 Technical Preview on Hortonworks website gives us the link on how to set up R on Linux (Installing R on Linux). I experienced the process to be different.

Again, this process should be done on all nodes.

  1. The Ubuntu archives on CRAN are signed with the key of “Michael Rutter marutter@gmail.com” with key ID E084DAB9 (link)
    sudo apt-key adv --keyserver keyserver.ubuntu.com --recv-keys E084DAB9
  2. Now add the key to apt.
    gpg -a --export E084DAB9 | sudo apt-key add -
  3. Fetch Linux codename
    LVERSION=`lsb_release -c | cut -c 11-`

    In this case, it is trusty.

  4. From Cran – Mirrors or CRAN Mirrors US, find a suitable CRAN mirror and use it in the next step. In this case, a CRAN mirror from Austria is used – cran.wu.ac.at.
  5. Append the repository to the system’s sources.list file
    echo deb https://cran.wu.ac.at/bin/linux/ubuntu $LVERSION/ | sudo tee -a /etc/apt/sources.list
    

    The line added to the sources.list should look something like this:

    deb https://cran.wu.ac.at/bin/linux/ubuntu trusty/

  6. Update the system
    sudo apt-get update
  7. Install r-base package
    sudo apt-get install r-base -y
  8. Install r-base-dev package
    sudo apt-get install r-base-dev -y
  9. In order to avoid the following warn when Spark service is started:

    WARN shortcircuit.DomainSocketFactory: The short-circuit local reads feature cannot be used because libhadoop cannot be loaded.

    Add the following export to the system environment file:

    export LD_LIBRARY_PATH=/usr/hdp/current/hadoop-client/lib/native
  10. Test the installation by starting R
    R

    Something like this should show up and R should start.

    R version 3.2.3 (2015-12-10) — “Wooden Christmas-Tree”

  11. Exit R.
    q()

R is now installed on the first node. Steps 1 to 7 should be repeated on all nodes in the cluster.

Testing sparkR

If R is installed on all nodes (I haven’t mentioned that yet in this post), we can test it from the node where Spark client is installed:

cd $SPARK_HOME
./bin/sparkR

Hello message from Spark invites the user to the sparkR environment.

sparkR-hello-window

Output when starting SparkR

R version 3.2.3 (2015-12-10) — “Wooden Christmas-Tree”
Copyright (C) 2015 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type ‘license()’ or ‘licence()’ for distribution details.

Natural language support but running in an English locale

R is a collaborative project with many contributors.
Type ‘contributors()’ for more information and
‘citation()’ on how to cite R or R packages in publications.

Type ‘demo()’ for some demos, ‘help()’ for on-line help, or
‘help.start()’ for an HTML browser interface to help.
Type ‘q()’ to quit R.

Launching java with spark-submit command /usr/hdp/2.3.4.0-3485/spark/bin/spark-submit “sparkr-shell” /tmp/RtmpN1gWPL/backend_port18f039dadb86
16/02/22 22:12:40 WARN SparkConf: The configuration key ‘spark.yarn.applicationMaster.waitTries’ has been deprecated as of Spark 1.3 and and may be removed in the future. Please use the new key ‘spark.yarn.am.waitTime’ instead.
16/02/22 22:12:41 WARN SparkConf: The configuration key ‘spark.yarn.applicationMaster.waitTries’ has been deprecated as of Spark 1.3 and and may be removed in the future. Please use the new key ‘spark.yarn.am.waitTime’ instead.
16/02/22 22:12:41 INFO SparkContext: Running Spark version 1.5.2
16/02/22 22:12:41 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform… using builtin-java classes where applicable
16/02/22 22:12:41 WARN SparkConf: The configuration key ‘spark.yarn.applicationMaster.waitTries’ has been deprecated as of Spark 1.3 and and may be removed in the future. Please use the new key ‘spark.yarn.am.waitTime’ instead.
16/02/22 22:12:41 INFO SecurityManager: Changing view acls to: ubuntu
16/02/22 22:12:41 INFO SecurityManager: Changing modify acls to: ubuntu
16/02/22 22:12:41 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(ubuntu); users with modify permissions: Set(ubuntu)
16/02/22 22:12:42 INFO Slf4jLogger: Slf4jLogger started
16/02/22 22:12:42 INFO Remoting: Starting remoting
16/02/22 22:12:42 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriver@10.x.xxx.xxx:59112]
16/02/22 22:12:42 INFO Utils: Successfully started service ‘sparkDriver’ on port 59112.
16/02/22 22:12:42 INFO SparkEnv: Registering MapOutputTracker
16/02/22 22:12:42 WARN SparkConf: The configuration key ‘spark.yarn.applicationMaster.waitTries’ has been deprecated as of Spark 1.3 and and may be removed in the future. Please use the new key ‘spark.yarn.am.waitTime’ instead.
16/02/22 22:12:42 WARN SparkConf: The configuration key ‘spark.yarn.applicationMaster.waitTries’ has been deprecated as of Spark 1.3 and and may be removed in the future. Please use the new key ‘spark.yarn.am.waitTime’ instead.
16/02/22 22:12:42 INFO SparkEnv: Registering BlockManagerMaster
16/02/22 22:12:42 INFO DiskBlockManager: Created local directory at /tmp/blockmgr-aeca25e3-dc7e-4750-95c6-9a21c6bc60fd
16/02/22 22:12:42 INFO MemoryStore: MemoryStore started with capacity 530.0 MB
16/02/22 22:12:42 WARN SparkConf: The configuration key ‘spark.yarn.applicationMaster.waitTries’ has been deprecated as of Spark 1.3 and and may be removed in the future. Please use the new key ‘spark.yarn.am.waitTime’ instead.
16/02/22 22:12:42 INFO HttpFileServer: HTTP File server directory is /tmp/spark-009d52a3-de7e-4fa1-bea3-97f3ef34ebbe/httpd-c9182234-dec7-438d-8636-b77930ed5f62
16/02/22 22:12:42 INFO HttpServer: Starting HTTP Server
16/02/22 22:12:42 INFO Server: jetty-8.y.z-SNAPSHOT
16/02/22 22:12:42 INFO AbstractConnector: Started SocketConnector@0.0.0.0:51091
16/02/22 22:12:42 INFO Utils: Successfully started service ‘HTTP file server’ on port 51091.
16/02/22 22:12:42 INFO SparkEnv: Registering OutputCommitCoordinator
16/02/22 22:12:43 INFO Server: jetty-8.y.z-SNAPSHOT
16/02/22 22:12:43 INFO AbstractConnector: Started SelectChannelConnector@0.0.0.0:4040
16/02/22 22:12:43 INFO Utils: Successfully started service ‘SparkUI’ on port 4040.
16/02/22 22:12:43 INFO SparkUI: Started SparkUI at http://10.x.x.108:4040
16/02/22 22:12:43 WARN SparkConf: The configuration key ‘spark.yarn.applicationMaster.waitTries’ has been deprecated as of Spark 1.3 and and may be removed in the future. Please use the new key ‘spark.yarn.am.waitTime’ instead.
16/02/22 22:12:43 WARN SparkConf: The configuration key ‘spark.yarn.applicationMaster.waitTries’ has been deprecated as of Spark 1.3 and and may be removed in the future. Please use the new key ‘spark.yarn.am.waitTime’ instead.
16/02/22 22:12:43 WARN SparkConf: The configuration key ‘spark.yarn.applicationMaster.waitTries’ has been deprecated as of Spark 1.3 and and may be removed in the future. Please use the new key ‘spark.yarn.am.waitTime’ instead.
16/02/22 22:12:43 WARN MetricsSystem: Using default name DAGScheduler for source because spark.app.id is not set.
16/02/22 22:12:43 INFO Executor: Starting executor ID driver on host localhost
16/02/22 22:12:43 INFO Utils: Successfully started service ‘org.apache.spark.network.netty.NettyBlockTransferService’ on port 41193.
16/02/22 22:12:43 INFO NettyBlockTransferService: Server created on 41193
16/02/22 22:12:43 INFO BlockManagerMaster: Trying to register BlockManager
16/02/22 22:12:43 INFO BlockManagerMasterEndpoint: Registering block manager localhost:41193 with 530.0 MB RAM, BlockManagerId(driver, localhost, 41193)
16/02/22 22:12:43 INFO BlockManagerMaster: Registered BlockManager
16/02/22 22:12:43 WARN SparkConf: The configuration key ‘spark.yarn.applicationMaster.waitTries’ has been deprecated as of Spark 1.3 and and may be removed in the future. Please use the new key ‘spark.yarn.am.waitTime’ instead.

Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ ‘_/
/___/ .__/\_,_/_/ /_/\_\ version 1.5.2
/_/
Spark context is available as sc, SQL context is available as sqlContext
>

 

If it happens you do not get the prompt displayed:

16/02/22 22:52:12 INFO ShutdownHookManager: Shutdown hook called
16/02/22 22:52:12 INFO ShutdownHookManager: Deleting directory /tmp/spark-009d52a3-de7e-4fa1-bea3-97f3ef34ebbe

Just press Enter.

 

Possible errors

No public key

W: GPG error: https://cran.wu.ac.at trusty/ Release: The following signatures couldn't be verified because the public key is not available: NO_PUBKEY 51716619E084DAB9

Step 1 was not performed.

Hash Sum mismatch

When updating Ubuntu the following error message shows up:

Hash Sum mismatch

Solution:

sudo rm -rf /var/lib/apt/lists/*
sudo apt-get clean
sudo apt-get update

Resources

1 – Spark 1.6 Technical Preview

2- Installing R on Linux

3 – CRAN – Mirrors

4 – CRAN Secure APT

Installing Spark on Hortonworks cluster using Ambari

The environment

Ubuntu Trusty 14.04. Ambari is used to install the cluster. MySql is used for storing Ambari’s metadata.
Spark is installed on a client node.

 

Note!

My experience with administrating Spark from Ambari has made me install Spark manually, not from Ambari and not by using Hortonworks packages. I install Apache Spark manually on a client node – described here.

Some reasons for that are:

  • New Spark version available every quarter – Hortonworks does not keep up
  • Possibility of running different Spark version on the same client
  • Better control over configuration files
  • Custom definition of Spark parameters for running multiple Spark context on the same client node (more in this post).

 

Installation process in Ambari

Hortonworks distribution installed using Ambari. Hortonworks version 2.3.4.
Services installed first: HDFS, MapReduce, YARN, Ambari Metrics, Zookeeper – I prefer to install these first in order to test if the bare minimum is up and running.

In the next step, Hive, Tez and Pig are installed.

After the successful installation, Spark is installed.

Spark versions

Now, Spark is installed. Hortonworks distribution 2.3.4 offers Spark 1.4.1 from the Choose Services menu:

ambari-spark-version

Running command spark-shell from the spark server reveals that 1.5.2 was installed:

spark-152-version-logo
and
sc-version

Spark’s HOME

Spark’s home directory ($SPARK_HOME) is /usr/hdp/current/spark-client. It is smart to export $SPARK_HOME since it is refered to in services that build on top of Spark.
Spark’s conf directory ($SPARK_CONF_DIR) is /usr/hdp/current/spark-client/conf.

Folder current has nothing but links to the Hortonworks version installed. This means that /usr/hdp/current/spark-client is linked to /usr/hdp/2.3.4.0-3485/spark/.

Comments on the installation

Spark installation from Ambari has, among other things, created a linux user spark and a directory on HDFS – /user/spark.

Spark commands, that were installed, are the following:
spark-class, spark-shell, spark-sql, spark-submit – these can be called from anywhere, since they are linked in /usr/bin.
Other spark commands not linked to the /usr/bin but can be executed from $SPARK_HOME/bin are beeline, pyspark, sparkR.

Connection to Hive

In the SPARK_CONF_DIR, hive-site.xml file can be found. The file has the following content:

<configuration>
  <property>
    <name>hive.metastore.uris</name>
    <value>thrift://hive-server:9083</value>
  </property>
</configuration>

With this propery, Spark connects to Hive. here are two lines from the output when command spark-shell is executed:

16/02/22 13:52:37 INFO metastore: Trying to connect to metastore with URI thrift://hive-server:9083
16/02/22 13:52:37 INFO metastore: Connected to metastore.

Logs

Spark’s log files are by default in /var/log/spark. This can be changed in Ambari: Spark-> Configs -> Advanced spark-env for property spark_log_dir.

Running Spark commands

Examples on how to execute the Spark commands (taken from Hortonworks Spark 1.6 Technical Preview).
These should be run as spark user from $SPARK_HOME.

spark-shell

spark-shell --master yarn-client --driver-memory 512m --executor-memory 512m

spark-submit

./bin/spark-submit --class org.apache.spark.examples.SparkPi --master yarn-client --num-executors 3 --driver-memory 512m --executor-memory 512m --executor-cores 1 lib/spark-examples*.jar 10

sparkR

running sparkR ($SPARK_HOME/bin/sparkR) returns the following:

env: R: No such file or directory

R is not installed, yet. How to install R environment is described here.

Resources

Spark 1.6 Technical Preview