My Work cluster in detail

The cluster was built on OpenStack private cloud owned by a Swiss organization Switch.

The Hadoop distributor was Hortonworks, except for Spark and Zeppelin, who were Apache’s.

Potential users

Since the project owner was an organization supporting educational entities in Switzerland, the potential users were researchers, scientists, students…

I had the luxury of having almost unlimited resources on the infrastructure so I have built 5 Hadoop clusters – 4 were Hortonworks Hadoop clusters, one was Apache Hadoop cluster. Out of the 4, one was the Work environment which was exposed to the end users. And this is the cluster that is described in detail in this post.
Keep in mind that I was working on my own on this development – which meant administering and upgrading 5 clusters and doing data science at the same time. In order to make it work, I had to use the YARN inside me and distribute the limited resources effectively.

Initial resources

Keeping in mind the point of distributed systems is scalability, I have defined the initial cluster with the following capabilities.
6 instances with corresponding details:

  • Ambari Server
  • NameNode
  • DataNode (3)
  • Client
Instance RAM VCPU Default disk size Volume No. Volume Size Security group
Ambari 8GB 8 VCPU 20GB None None sg-ambari
NameNode 32GB 8 VCPU 20GB 1 200GB sg-namenode
DataNode (3) 32GB 8 VCPU 20GB 3 200GB sg-datanode
Client 16GB 16 VCPU 20GB 1 500GB sg-client

Note: There were three DataNodes in the initial cluster.

Characteristics of the cluster

The initial cluster had 1.7 TB HDFS, replication factor was 3, block size was the default 128MB. Rack awareness was not set in the initial cluster and the queue was the default.
On the YARN side, I have made some changes and had 84GB RAM (3 x 32GM = 96GB RAM. 4GB per DataNode was left for services on the instance -> 96GB – 12GB = 84GB) as maximum amount of RAM resources for the cluster – the default values by Apache (Hortonworks?) are quite more conservative.

In the cluster building process the versions were Ambari 2.1 and HDP 2.3. When Ambari 2.2 and HDP 2.4 were available, the cluster was upgraded.


Ambari had a server for itself, the database for collecting statistics was MySql. The idea was always to migrate the Ambari Server if needed. The migration to new Ambari server is easy so I could afford to start small for this service.
The Ambari Views was enabled for the users who wanted to upload the files to the HDFS manually. Hive was also available through this service and I on my one of my test environments, I have even embedded Zeppelin in Ambari Views. Though, on the Work cluster, Zeppelin was offered only as an independent service on the Client.
All the ports for Ambari to properly work were in the sg-ambari security group.


The initial plan with the NameNode was to run all the services on it except Spark and Zeppelin. When the resources would expand beyond the instance’s capabilities, some services would be moved to a new instance, or unused services would be stopped (experience showed Hive had little popularity among the academia). Using Ambari, migrating services is an easy process, I could afford to have all services running on one NameNode. Only cluster administrator had access to this instance. With other words, client tools were not installed on this instance.
All the ports for the NameNode to properly work were in the sg-namenode security group.


I started with 3 DataNodes, which offered 1.7TB of storage on the HDFS. The DataNodes were also used as Workers for Spark and Supervisors for Storm. The users had no access to the DataNodes directly – no client was installed here. This would change according to the needs so that some jobs could access data directly locally.
All the ports for the DataNodes to properly work were in the sg-datanode security group.


The client was users’ window to the cluster. Spark 2.0 (before Summer 2016 it was Spark 1.6) was offered as the computational engine – only one. One reason was also easier administration and optimizatoin from my side.
The users could use the command line interface (CLI), RStudio or Zeppelin. Ambari Views as well, but that was running on Ambari instance. More advanced users went with the CLI, users who wanted to learn Spark were using Zeppelin.
Client for Storm was also installed on this instance. Due to more complex programming (in Java), all the Topologies were handled by me, the users were defining requirements and using the data stored by the Storm.
All the ports for the Client to properly work were in the sg-datanode security group.

See below for page 2.

Streaming with Storm – simple example with HDFS bolt

This post describes a simple Storm topology – random words are written to HDFS. The topology is uploaded on the cluster from the client node. Nimbus is on the cluster’s NameNode. I have 4 DataNodes and on each of them a Supervisor is installed. More on how I installed and configured Storm can be found here.

Services used

I am using Hortonworks 2.4, Hadoop is version 2.7.1, Storm is version 0.10.0. All services were installed through Ambari.

Preparing development environment

Create a new maven project. How to install maven is explained here.

mvn archetype:generate -DgroupId=org.package -DartifactId=storm-project -DarchetypeArtifactId=maven-archetype-quickstart -DinteractiveMode=false

When the project is created, step into the directory (in this case it is storm-project) where the pom.xml file is also located.

In the org.package (./src/main/java/org.package), create folder spout. The can be deleted.

There are 3 files important for this topology: pom.xml, the spout file and the topology file.

Prepare pom.xml

The pom file for this case includes Storm dependencies, with scope provided. Storm jars are not packed together with the topology! It is important to match the versions.


Add build node with the plugin

                                <transformer implementation="org.apache.maven.plugins.shade.resource.ServicesResourceTransformer"/>
                                <transformer implementation="org.apache.maven.plugins.shade.resource.ManifestResourceTransformer">


Add clojure in the dependencies node. Be sure to check for newer version



Make sure the version matches Storm installation

    <!-- keep storm out of the jar-with-dependencies -->


Hadoop client XML node. Make sure the version matches your Hadoop installation. org.slf4j is omitted otherwise messages about multiple version of the package are appearing



Hadoop hdfs XML node. Make sure the version matches your Hadoop installation. org.slf4j is again omitted




Now that the pom.xml is in order, you can package the project to see if pom.xml is valid

mvn package

Build success should appear. If not, the pom.xml is invalid and should be taken care of.

Click on the next page for Spout.

Installing and configuring Storm in Ambari

About Storm

Storm is a free and open source distributed real-time computation system.
Storm cluster follows master-slave model and Zookeeper is used for coordination. All data is stored in ZooKeeper.
The basic unit of data processed by Storm is tuple. Tuple consists of predefined list of fields.

Storm cluster on Hadoop

The following graphic explains the architecture one ends up with after following this post. In black text, the Hadoop nodes are shown, in blue text, Storm nodes are shown.

6 nodes in the cluster. One is dedicated NameNode, one is Client and four are DataNodes.
6 nodes in the cluster. One is dedicated to Nimbus, DRPC Server and Storm UI Server, one is Storm Client and four are Supervisors.



Make sure you open the following ports:
Node where Nimbus (master) is (are) installed: 2181, 6627.
Nodes where Supervisors (slaves) will be installed: 6700, 6701 (and so on, depending on the number of workers per supervisor).
Default Storm UI Server port is 8744, open the port on the node where this service is installed.

Adding Service in Ambari

Add Service Service

Select the Storm service:

Click Next.

Assign Masters


Nimbus is the master, responsible for distributing code across worker nodes, assigning tasks, monitoring tasks for any failures and restarting them when required. Nimbus and slaves communicate through ZooKeeper.

Click Next.

Assign Slaves and Clients

Check Supervisors on all datanodes you wish to use as supervisors.
Supervisor nodes are worker nodes.

Click Next.

Customize Services

Define ports on supervisors. One port per worker. By defining the ports one basically defines how many workers per supervisor will run.


Leave the default ports for now.


If everything is ok, Click Deploy.

Install, Start, Test

When the installation is complete, click Next.

Restart Required

Restart HDFS, MapReduce2, YARN and Hive. Ambari reminds you about that. The Storm Web UI should now be available on the server where Storm UI Server is installed and on port 8874.

Adding Nimbus

Adding Nimbus is quite straightforward.

In Ambari, click on service Storm.

On the right side, there is a menu Service Actions. Click on it and select Add Nimbus.

Choose the host to add Nimbus component. In this case, I am adding a Nimbus to mz client node in the cluster.


Click OK on the confirmation box


The Nimbus is now installed. On two instances – client and NameNode.

Restart of the Storm service is needed to make the second Nimbus part of services. The newly added Nimbus has status “Not a Leader”, while the primary Nimbus has status “Leader”.


Storm client? Yes, with a small workaround

Since I am not implementing High Availability for Storm, there is no need for two Nimbuses. The reason I added one Nimbus to the client is to get Storm client on it.

So if I remove the Nimbus from the client node, the Storm packages remain and potential Storm users can access the Storm service from the client – just like any other services in the cluster.

I can remove the Nimbus from the client just like any other service in Ambari – I stop the service and delete it.

The storm.yaml on the Client will be used when uploading the topologies and at the moment, the property nimbus.seeds has 2 properties – client FQDN and NameNode FQDN – each for one Nimbus location. The upload will still work, but if the non-existing Nimbus server is checked first, it will return an error and look for the next Nimbus server on the list.

Overview over Storm in Ambari

The summary in Ambari reveals the following picture:


One Nimbus (master), 4 Supervisors (slaves) and 8 slots (4 Supervisors x 2 ports, one for each worker on each Supervisor).

Learning about Storm

I have taken the Udacity course Real-Time Analytics with Apache Storm by Twitter. Great course! Very well explained and besides learning about Storm, I also became familiar with in-memory database Redis.

My topology

I have a test topology running which takes in tweets and “bolts” them in the following storages:

  • pushes raw JSON files directly to HDFS
  • creates tuples (user-tweet), does data cleansing and pushes them in Redis
  • pushes information about user, tweet, date to MySql

I keep upgrading and improving my Topology.

Further work

  1. Working with Trident
  2. Checking how Spark Streaming can compete with Storm
  3. Testing Apache Samza to find out why LinkedIn was not happy with Storm and decided to develop Samza

Now we can start playing with Storm! Here is an example of Storm topology that takes random words and pushes them into HDFS.

Yarn application has already ended! It might have been killed or unable to launch application master.

If you are struggling with the error message in title of the post check if you are controlling ports that Spark needs. I have experienced that if the ports Spark is using can not be reached, YARN is going to terminate with the error message in the title. So it is best to control Spark ports and open them so that the YARN application would go through. More on Spark and networking here.

Spark chooses random ports and unless you have ALL ports open, you might run into the “endless”

INFO Client: Application report for application_1470560331181_0013 (state: ACCEPTED)

which eventually fails

INFO Client: Application report for application_1470560331181_0013 (state: FAILED)

and the error message returned would be

ERROR SparkContext: Error initializing SparkContext.
org.apache.spark.SparkException: Yarn application has already ended! It might have been killed or unable to launch application master.

Adding something like this in spark-defaults.conf

spark.blockManager.port 38000
spark.broadcast.port 38001
spark.driver.port 38002
spark.executor.port 38003
spark.fileserver.port 38004
spark.replClassServer.port 38005

could solve this issue.

My notes on installing Spark 2.0 are here.

And how to install Spark 1.6 is described here.

Configuring Apache Spark History Server

Prior to configuring and running Spark History Server, Spark should be installed.

How to install Apache Spark 1.6.0 is described here.

How to install Apache spark 2.0 is described here.

Spark History server

Check that $SPARK_HOME/conf/spark-defaults.conf has History Server properties set

spark.eventLog.dir hdfs:///spark-history
spark.eventLog.enabled true
spark.history.fs.logDirectory hdfs:///spark-history
spark.history.provider org.apache.spark.deploy.history.FsHistoryProvider
spark.history.ui.port 18080

spark.history.kerberos.keytab none
spark.history.kerberos.principal none

Create spark-history directory in HDFS

sudo -u hdfs hadoop fs -mkdir /spark-history

Change the owner of the directory

sudo -u hdfs hadoop fs -chown spark:hdfs /spark-history

Change permission (be more restrictive if necessary)

sudo -u hdfs hadoop fs -chmod 777 /spark-history

Add user spark to group hdfs on the instance where Spark History Server is going to run

sudo usermod -a -G hdfs spark

To view Spark jobs from other users
When you open the History Server and you are not able to see Spark jobs you are expecting to see, check the Spark out file in the Spark log directory. If error message “Permission denied” is present, Spark History Server is trying to read the job log file, but has no permission to do so.
Spark user should be added to the group of the spark job owner.
For example, user marko belongs to a group employee. If marko starts a Spark job, the log file for this job will have user and group marko:employee. In order for spark to be able to read the log file, spark user should e added to the employee group. This is done in the following way

sudo usermod -a -G employee spark

Checking spark’s groups

groups spark

should return group employee among spark’s groups.

Start Spark History server

sudo -u spark $SPARK_HOME/sbin/


starting org.apache.spark.deploy.history.HistoryServer, logging to /var/log/spark/spark-spark-org.apache.spark.deploy.history.HistoryServer-1-t-client01.out

Accessing Spark History server from the web UI can be done by accessing spark-server:18080. The following screen should load.

A fresh Spark History Server installation has no applications to show (no applications in hdfs:/spark-history).

Spark History Server offers a great monitoring interface for Spark applications!

WARN ServletHandler: /api/v1/applications

If you happen to start Spark History Server but get neither completed nor incompleted applications on the Web UI, check the log files. If you get something like the following

WARN ServletHandler: /api/v1/applications
        at org.glassfish.jersey.servlet.ServletContainer.service(
        at org.glassfish.jersey.servlet.ServletContainer.service(
        at org.glassfish.jersey.servlet.ServletContainer.service(
        at org.spark_project.jetty.servlet.ServletHolder.handle(
        at org.spark_project.jetty.servlet.ServletHandler.doHandle(
        at org.spark_project.jetty.server.handler.ContextHandler.doHandle(
        at org.spark_project.jetty.servlet.ServletHandler.doScope(
        at org.spark_project.jetty.server.handler.ContextHandler.doScope(
        at org.spark_project.jetty.server.handler.ScopedHandler.handle(
        at org.spark_project.jetty.servlets.gzip.GzipHandler.handle(
        at org.spark_project.jetty.server.handler.ContextHandlerCollection.handle(
        at org.spark_project.jetty.server.handler.HandlerWrapper.handle(
        at org.spark_project.jetty.server.Server.handle(
        at org.spark_project.jetty.server.HttpChannel.handle(
        at org.spark_project.jetty.server.HttpConnection.onFillable(
        at org.spark_project.jetty.util.thread.QueuedThreadPool.runJob(
        at org.spark_project.jetty.util.thread.QueuedThreadPool$

Take the jersey-bundle-*.jar file out of the $SPARK_HOME/jars directory. Hortonworks dont need it, you dont need it 🙂

Configuring Ranger Plugins in Ambari

In previous post, I described how to install Ranger in Ambari on HDP.


Ranger allows (through configuration) both Ranger policies and HDFS permissions to be checked for a user request. When a user request is received in NameNode, Ranger plugin will check for policies set through Ranger admin. If there are no policies, Ranger plugin will check for permission set in HDFS.

It is recommended to have restrictive permission at HDFS level and create permission in Ranger security admin.

Configuring HDFS Plugin happens in two places – HDFS service and Ranger service.

HDFS service

Select HDFS service from the Services menu.

Open Advanced ranger-hdfs-plugin-properties ad check the Enable Ranger for HDFS checkbox.

Change the following property by replacing NAMENODE_HOSTNAME with the RANGER_HOST.


If you are using an older HDP version, check Audit to DB.

audit to db

Change HDFS umask from 022 to 077.

umask 077

Save the properties and restart the service.

The following message appears, click OK to restart HDFS.

dependent configurations

Ranger service

In Ranger, under tab Config

Switch on HDFS Ranger Plugin



Change the audit source type from default solr to db.

audit source type

Save and restart Ranger service.


comming soon…

Adding and configuring service Ranger in Ambari

Ranger is a framework to enable, monitor and manage data security in Hadoop cluster. The service comes from Hortonworks and is a part of Apache family now.

This post describes how Ranger 0.5.0 is installed and configured  with audit data stored in a database. Default setting is Solr, my cluster does not have Solr, but it has a MySql database.

My Hadoop distribution is Hortonworks and versions mentioned in this post are 2.3.4 and 2.5.


Database preparation

Install MySql

(If not installed yet)

sudo apt-get install mysql-server -y

Set up Ranger database

Note for HDP 2.3.4!
Ranger database has to be created manually otherwise the installation will not go through. If you are using HDP 2.5, this is done through Ambari Add Service Wizard. Move on to “Adding Service in Ambari”.

create database ranger;
CREATE USER 'ranger'@'localhost' IDENTIFIED BY 'ranger';
GRANT ALL PRIVILEGES ON *.* TO 'ranger'@'localhost';
CREATE USER 'ranger'@'%' IDENTIFIED BY 'ranger';
GRANT ALL PRIVILEGES ON *.* TO 'ranger'@'%';

If the MySql database is on another server than Ranger, check from RANGER_SERVER if you can log in to the database

mysql -u ranger -pranger -h MYSQL_SERVER

Adding Service in Ambari

Start Add Service Wizard and choose service Ranger

Add service

Some requirements have to be fulfilled.

Ranger Requirements

Check if MySql Java Connector is present on Ambari Server

ls /usr/share/java/mysql-connector-java.jar

Run the following on Ambari Server if the file is present

sudo ambari-server setup --jdbc-db=mysql --jdbc-driver=/usr/share/java/mysql-connector-java.jar


Using python  /usr/bin/python
Setup ambari-server
Copying /usr/share/java/mysql-connector-java.jar to /var/lib/ambari-server/resources
If you are updating existing jdbc driver jar for mysql with mysql-connector-java.jar. Please remove the old driver jar, from all hosts. Restarting services that need the driver, will automatically copy the new jar to the hosts.
JDBC driver was successfully initialized.
Ambari Server 'setup' completed successfully.

Assign masters for both Ranger services. In this case, the services are installed on the NameNode.

Assign masters

Choose DB flavor, tye in ranger DB host and ranger password (same as in the script from the previous chapter)

Wizard - Ranger Admin

Type password for root user and test the connection.

Wizard root password

If the MySql database is on another server, user has to be created and grants for root from Ranger server have to be granted.


In the Audit tab:
– switch off Audit to Solr
– switch on Audit to HDFS
– switch on Audit to DB and type in password for Ranger Audit user. (HDP 2.3.4)

HDP 2.5: Audit to DB is not an option anymore.

Wizard - audit storage

Ranger is now installed and can be accessed on the RANGER_SERVER:6080.

Note: the Ranger WEB UI not showing up?
Make sure port 6080 is open.

If the URL is an internal IP address read on:
External URL has to be corrected to ranger host. Authentication in this example is UNIX.

Wizard - ranger url only 2-3-4

Continue to the next step.

Review of the installation follows, if everything is ok, start with the Install, Start and Test.

Upgrading Hortonworks Data Platform from 2.3.4 to 2.4.0

This post describes how to do an Express Upgrade of Hortonworks Data Platform (HDP) with Ambari.

Ugrading HDP begins with upgrading Ambari, Ambari Metrics and, not mandatory but recommended, adding Grafana.

When this is in place and all services are up and running, Upgrading HDP to 2.4 can begin.


File backup

Creating a backup of all the important files and databases is the first step. The following steps are done on the NameNode.

Create backup directory

mkdir /home/ubuntu/HDP-2.3.4-backup

Run HDFS filesystem check and save the ouptut to a file in the backup directory

sudo -u hdfs hdfs fsck / -files -blocks -locations > /home/ubuntu/HDP-2.3.4-backup/dfs-old-fsck-1.log

Gather basic filesystem information and statistics in a report

sudo -u hdfs hdfs dfsadmin -report > /home/ubuntu/HDP-2.3.4-backup/dfs-old-report-1.log

List the whole HDFS directory and save the ouptut to a file

sudo -u hdfs hdfs dfs -ls -R > /home/ubuntu/HDP-2.3.4-backup/dfs-old-lsr-1.log

Enter Safemode, mandatory for next steps

sudo -u hdfs hdfs dfsadmin -safemode enter

Save current namespace and reset edits log

sudo -u hdfs hdfs dfsadmin -saveNamespace

Make a copy of the VERSION file (here is HDP’s default directoy, file VERSION should reside in ${}/current)

sudo cp /hadoop/hdfs/namenode/current/VERSION /home/ubuntu/HDP-2.3.4-backup/

Leave Safemode

sudo -u hdfs hdfs dfsadmin -safemode leave

Finalize upgrade of HDFS
According to the Apache Hadoop documentation:

“Datanodes delete their previous version working directories, followed by Namenode doing the same. This completes the upgrade process.”

sudo -u hdfs hdfs dfsadmin -finalizeUpgrade

Database backup

My cluster has MySql database that is used by Hive and Ranger. That means I have 3 databases to back up: hive, ranger and ranger_audit (since I am storing audit data in a database).


DAT=`date +%Y%m%d_%H%M%S`
mysqldump -u root -proot hive > /home/ubuntu/HDP-2.3.4-backup/hive_$DAT.sql

This is done beforehand so that you can check the checkbox and move on in the process of upgrade

Hive upgrade warning


This is done beforehand so that you can check the checkbox and move on in the process of upgrade

Ranger Admin warning


DAT=`date +%Y%m%d_%H%M%S`
mysqldump -u root -proot ranger > /home/ubuntu/HDP-2.3.4-backup/ranger_$DAT.sql


DAT=`date +%Y%m%d_%H%M%S`
mysqldump -u root -proot ranger_audit > /home/ubuntu/HDP-2.3.4-backup/ranger_audit_$DAT.sql


Content of backup folder

├── dfs-old-fsck-1.log
├── dfs-old-lsr-1.log
├── dfs-old-report-1.log
├── hive_20160804_074811.sql
├── ranger_20160804_074907.sql
├── ranger_audit_20160804_074914.sql

Click below on Page 2 to continue with the process.

Apache Spark 2.0.0 – Installation and configuration

I am running a HDP 2.4 multinode cluster with Ubuntu Trusty 14.04 on all my nodes. The Spark in this post is installed on my client node. My cluster has HDFS and YARN, among other services. All were installed from Ambari. This is not the case for Apache Spark 2.0, because Hortonworks does not offer Spark 2.0 on HDP 2.4.0

The documentation on the latest Spark version can be found here.

My notes on Spark 2.0 can be found here (if anyone finds them useful).

My post on setting up Apache Spark 1.6.0.

Update 12.January 2018: A post on how to install Apache Spark 2.2.1 Apache Spark 2.2.1 on Ubuntu 16.04 – Hadoop-less instance.



Update and upgrade the system and install Java

sudo add-apt-repository ppa:openjdk-r/ppa
sudo apt-get update
sudo apt-get install openjdk-8-jdk -y

Add JAVA_HOME in the system variables file

sudo vi /etc/environment
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64

Spark user

Create user spark and add it to group hadoop

sudo adduser spark
sudo usermod -a -G hdfs spark

HDFS home directory for user spark

Create spark’s user folder in HDFS

sudo -u hdfs hadoop fs -mkdir -p /user/spark
sudo -u hdfs hadoop fs -chown -R spark:hdfs /user/spark

Spark installation and configuration

Install Spark

Create directory where spark directory is going to reside. Step into the directory

sudo mkdir /usr/apache
cd /usr/apache

Download Spark 2.0.0 from

sudo wget

Unpack the tar file

sudo tar -xvzf spark-2.0.0-bin-hadoop2.7.tgz

Remove the tar file after it has been unpacked

sudo rm spark-2.0.0-bin-hadoop2.7.tgz

Change the ownership of the folder and its elements

sudo chown -R spark:spark spark-2.0.0-bin-hadoop2.7

Update system variables

Step into the spark 2.0.0 directory and run pwd to get full path

cd spark-2.0.0-bin-hadoop2.7

Update the system environment file by adding SPARK_HOME and adding SPARK_HOME/bin to the PATH

sudo vi /etc/environment

export SPARK_HOME=/usr/apache/spark-2.0.0-bin-hadoop2.7

At the end of PATH add


Refresh the system environments

source /etc/environment

Log and pid directories

Create log and pid directories

sudo mkdir /var/log/spark
sudo chown spark:spark /var/log/spark
sudo -u spark mkdir $SPARK_HOME/run

Spark configuration files

Hive configuration

Create a Hive warehouse and give permissions to the users. If Hive service is set up, the path to the Hive warehouse could be /apps/hive/warehouse

sudo -u hive hadoop fs -mkdir /user/hive/warehouse
sudo -u hdfs hadoop fs -chmod -R 777 /user/hive/warehouse

Find the hive-site.xml file – HDP versions usually have it in the following folder: /usr/hdp/current/hive-client/conf and copy it to $SPARK_HOME/conf.
The following property name in the file should be altered:
hive.metastore.warehouse.dir is replaced with spark.sql.warehouse.dir.

Spark environment file

Create a new file in under $SPARK_HOME/conf

sudo -u spark vi conf/

Add the following lines and adjust aaccordingly.

export SPARK_LOG_DIR=/var/log/spark
export HADOOP_HOME=${HADOOP_HOME:-/usr/hdp/current/hadoop-client}
export HADOOP_CONF_DIR=${HADOOP_CONF_DIR:-/usr/hdp/current/hadoop-client/conf}
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
export SPARK_SUBMIT_OPTIONS="--jars ${SPARK_HOME}/lib/spark-csv_2.11-1.4.0.jar"

The last line serves as an example how to add external libreries to Spark. This particular package is quite common and is advised to install it. The package can be downloaded from this site.

Spark default file

Fetch HDP version

hdp-select status hadoop-client | awk '{print $3;}'

Example output for HDP 2.4:

Example output for HDP 2.5:

Create spark-defaults.conf file in $SPARK_HOME/conf

sudo -u spark vi $SPARK_HOME/conf/spark-defaults.conf

Add the following and adjust accordingly (some properties belong to Spark History Server whose configuration is explained in the post in the link below)

spark.driver.extraJavaOptions -Dhdp.version= -Dhdp.version=
spark.eventLog.dir hdfs:///spark-history
spark.eventLog.enabled true
spark.history.fs.logDirectory hdfs:///spark-history
spark.history.provider org.apache.spark.deploy.history.FsHistoryProvider
spark.history.ui.port 18080

spark.history.kerberos.keytab none
spark.history.kerberos.principal none

spark.yarn.containerLauncherMaxThreads 25
spark.yarn.driver.memoryOverhead 384
spark.yarn.executor.memoryOverhead 384
spark.yarn.historyServer.address spark-server:18080
spark.yarn.max.executor.failures 3
spark.yarn.preserve.staging.files false
spark.yarn.queue default
spark.yarn.scheduler.heartbeat.interval-ms 5000
spark.yarn.submit.file.replication 3

spark.jars.packages com.databricks:spark-csv_2.11:1.4.0 lzf

spark.blockManager.port 38000
spark.broadcast.port 38001
spark.driver.port 38002
spark.executor.port 38003
spark.fileserver.port 38004
spark.replClassServer.port 38005

The ports are defined in this configuration file. If they are not, then Spark assigns random ports. More on ports and assigning them in Spark can be found here.

If the ports are not under control, you risk the

Yarn application has already ended! It might have been killed or unable to launch application master.

error. More on that is written here.


Create java-opts file in $SPARK_HOME/conf and add your HDP version

sudo -u spark vi $SPARK_HOME/conf/java-opts



Fixing links in Ubuntu

Since the Hadoop distribution is Hortonworks and Spark is Apache’s, some workaround is in place. Remove the default link and create new ones.
First, the existing link is removed. Then the new link is created, pointing to the $SPARK_HOME/bin.

sudo rm /usr/hdp/current/spark-client
sudo ln -s /usr/apache/spark-2.0.0-bin-hadoop2.7 /usr/hdp/current/spark-client

Spark 2.0.0 is now almost ready.

Jersey problem

If you try to run a spark-submit command on YARN you can expect the following error message:

Exception in thread “main” java.lang.NoClassDefFoundError: com/sun/jersey/api/client/config/ClientConfig

Jar file jersey-bundle-*.jar is not present in the $SPARK_HOME/jars. Adding it fixes this problem:

sudo -u spark wget -P $SPARK_HOME/jars

January 2017 – Update on this issue:
If the following is done, Jersey 1 will be used when starting Spark History Server and the applications in Spark History Server will not be shown. The folowing error message will be generated in the Spark History Server output file:

WARN servlet.ServletHandler: /api/v1/applications
        at org.glassfish.jersey.servlet.ServletContainer.service(

This problem occurs only when one tries to run Spark on YARN, since YARN 2.7.3 uses Jersey 1 and Spark 2.0 uses Jersey 2

One workaround is not to add the Jersey 1 jar described above but disable the YARN Timeline Service in spark-defaults.conf

spark.hadoop.yarn.timeline-service.enabled false

Spark History Server

How Spark History Server is configured and brought to life is explained here. Absolutely worth setting it up, not only because it is very useful and practical for monitoring Spark applications, but also because in Spark 2.0 the graphical interface is more user friendly and, well, more graphical.

Hive SerDe error when querying from spark-sql

If you plan to use spark-sql, it is maybe worth checking this post to avoid the jsonserde not found error message.

Notes about Spark 2.0

My Apache Spark 2.0 Notes

Manipulating files in S3 with Spark is mentioned here.

Preparing a client instance to be added to the cluster

In this post I describe how to prepare a Client node with with attached volume.

The Client and the volume are created through the WebUI on the OpenStack private cloud. The volume is attached to the Client in the WebUI, as well.
The volume is attached to device like this:

/user – /dev/vdb

After the new “soon-to-be” Client instance has been created and volume attached there is some work to be done in the command line interface:

Connect to the new client
Since the client is going to be exposed, there is a public IP address available for it.
Another option is to connect from another instance in the cluster (make sure to update /etc/hosts on all instances in the cluster).

ssh -i .ssh/key client-instance

Update, upgrade and reboot the system.

sudo apt-get update -y && sudo apt-get upgrade -y && sudo reboot

Create the directory where the data for the volume will be stored.

sudo mkdir -p /user

Format file system for the volume to be attached.

sudo mkfs.ext4 /dev/vdb

Mount the volume to the respective directory.

sudo mount /dev/vdb /user

Label the volume for easier future work.

sudo e2label /dev/vdb "user"

Open and update /etc/fstab.
This is smart to do to keep the volumes mounted to the directories after the Client is restarted.

LABEL=user /user ext4 defaults,nobootwait 0 0

Check if volumes are mounted to correct directories.

df -h

Something like this should appear:

/dev/vdb        197G        60M         187G    1%      /user

For future reference, you can check the size of all monuted folders under directory /user.

sudo du -hs /user

Something similar to this should be in the output.

20K /user

Next steps

Adding client to the cluster

Now the Client with added volume is ready to be added to the cluster. This is, to some extent, described in Adding new DataNode to the cluster using Ambari. Although the post describes how to add a DataNode, adding Client is the same except that in step Assign Slaves and Clients, the Client checkbox should be checked.

Configuring home directory

The users working on the client should have their home directory under /user. How this is achieved is described in Configuring home directory on client nodes.