Adding service to HDP using REST API

One way of adding new service to the HDP is by using a graphical interface Ambari. In this post, it is explained how the same is done using Ambari’s REST API. The service added in this post is Pig. Pig is not a “classical” service, rather a package, but from REST API’s point of view, it is a service.

The documentation on how to do this is dated April 21 2014. I have followed it and eventually made it work. The documentation can be found here.

The cluster

I am using AWS EC2 services, operating system is Centos7.

I have an Ambari server, version 2.6.2 and an HDP cluster version 2.6.5. This should work on other versions as well.

My cluster has one NameNode, on which Ambari is installed as well, and one DataNode. The services installed are the bare minimum – HDFS, YARN, MapReduce2, Zookeeper and Hive.

The goal

Install Pig client on the NameNode and on the DataNode.

Adding Pig to the cluster

Variables

export AMBARI_SERVER=PUBLIC_IP
export MASTER_DNS=NAMENODE_PRIVATE_DNS
export SLAVE_DNS=DATANODE_PRIVATE_DNS
export CLUSTER_NAME=mincluster2

Create service on the Cluster

curl -u admin:admin -H "X-Requested-By:ambari" -i -X POST -d '{"ServiceInfo":{"service_name":"PIG"}}' 'http://'$AMBARI_SERVER':8080/api/v1/clusters/'$CLUSTER_NAME'/services'

Pig service is added to the list of services in Ambari.

Pig added - Configs
Pig added - Summary

Check for service on the cluster

curl -k -u admin:admin -H "X-Requested-By:ambari" -i -X GET 'http://'$AMBARI_SERVER':8080/api/v1/clusters/'$CLUSTER_NAME'/services/PIG'

curl - service info
The service is registered on the cluster.

Add components to the service

curl -k -u admin:admin -H "X-Requested-By:ambari" -i -X POST -d '{"RequestInfo":{"context":"Install PIG"}, "Body":{"HostRoles":{"state":"INSTALLED"}}}' 'http://'$AMBARI_SERVER':8080/api/v1/clusters/'$CLUSTER_NAME'/services/PIG/components/PIG'

Running the curl command from previous step to check if the component is added returns the following:

curl - service info after component.png
The component has been added according to the “components” element in the JSON output. The state of the service is still “UNKNOWN”.

Creating configuration is on the next page.

Bash script for creating new user in Hadoop and Ambari Views

Here is a bash script I used a couple of years ago for creating Hadoop users from CLI (or batch). It might be useful for someone.

The script does the following:

creates a Linux user
generates keys
creates home directory in HDFS
adds user to a group
allocates HDFS space quota
gives access in Ambari Views

#!/bin/bash

NEW_USER="$1"
DEPT_NAME="$2"
NAMENODE="t-namenode1"
AMBARI="t-ambari"

#
echo "Creating user "$NEW_USER

#Creating user with no password with user's folder
sudo adduser --disabled-password --gecos "" $NEW_USER

#Create Linux user on the namenode
ssh -i /home/ubuntu/.ssh/key $NAMENODE 'sudo adduser --disabled-password --gecos "" $NEW_USER && sudo chown $NEW_USER:$NEW_USER /home/$NEW_USER'

#Prepare .ssh folder
cd /user/$NEW_USER
sudo mkdir .ssh
sudo chown $NEW_USER:$NEW_USER .ssh/
sudo chmod 700 .ssh

#Create private and public key
sudo -u $NEW_USER  ssh-keygen -t rsa -f $NEW_USER-key

#Copy public key to the authorized_keys
sudo -u $NEW_USER cp $NEW_USER-key.pub .ssh/authorized_keys
sudo -u $NEW_USER chmod 600 .ssh/authorized_keys

#######HDFS
echo "Create system folder for user"
sudo -u hdfs hadoop fs -mkdir /user/$NEW_USER
echo "Change owner of the system folder"
sudo -u hdfs hadoop fs -chown $NEW_USER:hdfs /user/$NEW_USER

#Defining HDFS space quota
echo "Allocate 100g of space on HDFS for the user"
sudo -su hdfs hdfs dfsadmin -setSpaceQuota 100g /department/$DEPT_NAME/users/$NEW_USER

#Access to Ambari Views
curl -iv -u admin:admin -H "X-Requested-By: ambari" -X POST -d  '{"Users/user_name": "$USER_NAME", "Users/password":  "$USER_NAME", "Users/active": true, "Users/admin": false }' http://$AMBARI:8080/api/v1/users

#Add user to a group in Ambari Views
curl -iv -u admin:admin -H "X-Requested-By: ambari" -X POST -d '[{"MemberInfo/user_name":"$NEW_USER", "MemberInfo/group_name":"$DEPT_NAME"}]' http://$AMBARI:8080/api/v1/groups/$DEPT_NAME/members

echo "User's folder on the client:"
ls -l /user/$NEW_USER

echo "User's system folder on HDFS:"
sudo -u $HDFS hadoop fs -ls /user/$NEW_USER

Adding service Druid to HDP 2.6 stack

Druid is a “fast column-oriented distributed data store”, according to the description in Ambari. It is a new service, added in HDP 2.6. The service is Technical Preview and the version offered is 0.9.2. Druid’s website is druid.io.

!!! Hortonworks Data Platform 2.6 is needed in order to install and use Druid !!!

Hortonworks has a very intriguing three-part series on ultra fast analytics with Hive and Druid. The first blog post can be found here.

This blog post describes how Druid is added to the HDP 2.6 stack with Ambari. The documentation I used is here. According to my experience, it does not hold water. I had to make some adjustment in order to start all Druid services.

Requirements

Zookeeper: Druid requires installation of Zookeeper. This service is already installed on my cluster.
Deep storage: deep storage layer for Druid in HDP can either be HDFS or S3. Parameter “druid.storage.type” is used to define this. Installation default is HDFS.
Metadata storage: for holding information about Druid segments and tasks. MySql is my metadata storage of choice.
Batch execution engine: resource manager is YARN, execution engine is MapReduce2. Druid hadoop index tasks use MapReduce jobs for distributed ingestion of data.

All these requirements are taken care of in Ambari, most of them with a sufficient default value.

Services within Druid

Broker – interface between users and Druid’s historical and realtime nodes.
Overlord – maintain a task queue that consists of user-submitted tasks.
Coordinator – serve to assign segments to historical nodes, handle data replication, and to ensure that segments are distributed evenly across the historical nodes.
Druid Router – serve as a mechanism to route queries to multiple broker nodes.
Druid Superset – if you know Superset, you know Druid Superset – data visualization tool.

Pre-work in metadata storage

As mentioned, my metadata storage is MySql. There are some objects that have to be created manually for the Druid installation to go through.

Create druid database

CREATE DATABASE druid DEFAULT CHARACTER SET utf8;
CREATE USER 'druid'@'%' IDENTIFIED BY 'druid';
GRANT ALL PRIVILEGES ON druid.* TO 'druid'@'%';
FLUSH PRIVILEGES;

Create superset database

The superset objects in the database have to be created even though the documentation does not mention this. The installation will not go through unless it can connect to superset database to create tables in superset schema.

CREATE DATABASE superset DEFAULT CHARACTER SET utf8;
CREATE USER 'superset'@'%' IDENTIFIED BY 'druid';
GRANT ALL PRIVILEGES ON superset.* TO 'superset'@'%';
FLUSH PRIVILEGES;

Adding service

In Ambari, click on Add Service and check Druid service.

add service druid

In the next step, you are asked to define which Druid service is going to be installed on which node in the cluster. Remember, you can always move/add services.

assign masters to nodes

The Broker is on the Client node, since that service is the gateway to external world.

In the next step – Assigning Slaves and Clients – the following two needs to be defined where they will be installed:

Druid Historical: Loads data segments.
Druid MiddleManager: Runs Druid indexing tasks.

Generally you should select Druid Historical and Druid MiddleManager for multiple nodes. Both services are on namenode to begin with.

Next step are settings. There are some passwords and MySql server that needs to be defined. Secret key is also something one needs to define. A random string of characters would do the trick.

Be sure to create the objects in the MySql before you proceed with the installation.

installation settings

!!! Superset Database port should be 3306, just like Metadata storage port.

The advanced tab (picture above) is mostly for the superset parameters – entering name, email and password is needed to proceed with the installation. This is later on used in the visualization tool Superset.

Once you click OK, you are asked to doublecheck and change some recommended values. The following ones are related to Druid installation and should be checked to accept the recommended values.

dependency configuration.jpg

In the Review step, check if everything is as it should be and click Deploy.

After the installation completes all Druid services should be up and running. If there is the need to restart any services, do so.

Tweaking MapReduce2

There is one detail not mentioned in Hortonworks documentation when Druid is installed. There are two parameters in MapReduce2 that have to be tweaked in order for Druid to successfully load data. Explanation is at the bottom.

The parameters are:

mapreduce.map.java.opts
mapreduce.reduce.java.opts

The following should be added at the end of the existing values:

-Duser.timezone=UTC -Dfile.encoding=UTF-8

How it looks in Ambari:

map java heap size parameter reduce java heap size parameter

The service MapReduce2 should now be restarted.

Explanation

Various error messages occur in the Druid Console log files when the Druid job start to load the data. The error messages vary depending on the data, but generally, they do not provide any useful information.
From my experience, one error had a problem with the first line in a valid csv file, while in another example, the error was that no data can be indexed (code below).

Caused by: java.lang.RuntimeException: No buckets?? seems there is no data to index.
	at io.druid.indexer.IndexGeneratorJob.run(IndexGeneratorJob.java:176) ~[druid-indexing-hadoop-0.9.2.2.6.0.3-8.jar:0.9.2.2.6.0.3-8]
	at io.druid.indexer.JobHelper.runJobs(JobHelper.java:349) ~[druid-indexing-hadoop-0.9.2.2.6.0.3-8.jar:0.9.2.2.6.0.3-8]
	at io.druid.indexer.HadoopDruidIndexerJob.run(HadoopDruidIndexerJob.java:94) ~[druid-indexing-hadoop-0.9.2.2.6.0.3-8.jar:0.9.2.2.6.0.3-8]
	at io.druid.indexing.common.task.HadoopIndexTask$HadoopIndexGeneratorInnerProcessing.runTask(HadoopIndexTask.java:261) ~[druid-indexing-service-0.9.2.2.6.0.3-8.jar:0.9.2.2.6.0.3-8]
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:1.8.0_111]
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[?:1.8.0_111]
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_111]
	at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_111]
	at io.druid.indexing.common.task.HadoopTask.invokeForeignLoader(HadoopTask.java:201) ~[druid-indexing-service-0.9.2.2.6.0.3-8.jar:0.9.2.2.6.0.3-8]
	... 7 more

Upgrading Ambari 2.4 to 2.5

This post describes how an upgrade from Ambari 2.4.1.0 to 2.5 has been done. The reason for that is to be able to further upgrade HDP to 2.6. Upgrade of HDP from 2.5 to 2.6 is described here.

Ambari Server is installed on Ubuntu 14.04. The same OS is used across the whole HDP cluster.

The following services are upgraded using this blog post:

Ambari Server
Ambari Agent
Ambari Infra
Ambari Metrics
Ambari Collector
Grafana

Backup

It is important to do a database backup of the Ambari database. Metadata for my Ambari is stored in MySql database.

Create a directory for backup

mkdir /home/ubuntu/ambari24-backup

Backup the database (enter password when prompted)

mysqldump -u ambari -p ambari_db > /home/ubuntu/ambari24-backup/ambari.mysql

Make a safe copy of the Ambari Server configuration file

sudo cp /etc/ambari-server/conf/ambari.properties ambari24-backup/

Prepare for installation of Ambari Agent and Server

Stop Ambari Metrics from the Ambari Web UI

Stop Ambari Server on Ambari Server instance

sudo ambari-server stop

Stop all Ambari Agents on all instances in the cluster where it is running

sudo ambari-agent stop

On all instances running Ambari Server or Ambari Agent do the following

sudo mv /etc/apt/sources.list.d/ambari.list db-backups/
sudo wget -nv http://public-repo-1.hortonworks.com/ambari/ubuntu14/2.x/updates/2.5.0.3/ambari.list -O /etc/apt/sources.list.d/ambari.list

Upgrade Ambari Server

sudo apt-get clean all
sudo apt-get update -y
sudo apt-cache show ambari-server | grep Version

The last command should output something like this

Version: 2.5.0.3-7
Version: 2.4.1.0-22

This means version 2.5 is available, Ambari Server can be installed

Install Ambari Server

sudo apt-get install ambari-server

Some lines from the output

The following packages will be upgraded:
  ambari-server

Unpacking ambari-server (2.5.0.3-7) over (2.4.1.0-22) ...

Setting up ambari-server (2.5.0.3-7) ...

Confirm that there is only one ambari server jar file

ll /usr/lib/ambari-server/ambari-server*jar

Output

-rw-r--r-- 1 root root 5806966 Apr  2 23:33 /usr/lib/ambari-server/ambari-server-2.5.0.3.7.jar

Install Ambari Agent

On each host running Ambari agent

sudo apt-get update
sudo apt-get install ambari-agent

Check if the Ambari agent install was a success

dpkg -l ambari-agent

Output from one node

Desired=Unknown/Install/Remove/Purge/Hold
| Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
|/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
||/ Name                         Version        Architecture    Description
+++-============================-==============-===============-=================
ii  ambari-agent                 2.5.0.3-7      amd64           Ambari Agent

Upgrade Ambari DB schema

On Ambari Server instance, run the following command

sudo ambari-server upgrade

The following question shows up. The backup has been done at the beginning. Type y and press Enter.

Ambari Server configured for MySQL. Confirm you have made a backup of the Ambari Server database [y/n] (y)?

Output

INFO: Upgrading database schema
INFO: Return code from schema upgrade command, retcode = 0
INFO: Schema upgrade completed
Adjusting ambari-server permissions and ownership...
Ambari Server 'upgrade' completed successfully.

Start the services

Start Ambari Server

sudo ambari-server start

Start Ambari Agent on all instances where it is installed

sudo ambari-agent start

Post-installation tasks

Hive and Oozie (which I have installed in HDP) are using MySql, so I have to put the jar file in place

sudo ambari-server setup --jdbc-db=mysql --jdbc-driver=/usr/share/java/mysql-connector-java.jar

Installing Ambari Infra for enabling Ranger Audit Access

About the key services mentioned in this post:
Apache Solr – an open-source enterprise search platform. Ranger is using it to store audit logs.
Ambari Infra – core shared service used by Ambari managed components. The database is Solr.

Using a database for Audit Access in Ranger is not an option anymore with HDP 2.5. What is being offered now is Solr and HDFS. It is recommended that Ranger audits are written to Solr and HDFS.
Solr takes care of the search queries from th Ranger Web interface, while HDFS is for more persistent storing of audits.

This was done on an HDP 2.5 cluster on AWS.

Installing Ambari Infra

Even though the HDP’s documentation says Solr should be installed before Ranger, I installed Ranger service first because of my previous Ranger experience when I used MySql for audit logs.

So installing Ambari Infra is really a clicking job. The only thing to check is where the service is going to be installed. I installed it on NameNode. Remember, it is easy to move services from on node to another.

Configuring Ranger with Solr

Click on Ranger and click on Configs -> Ranger Audit. From there Turn on Audit to Solr and SolrCloud.

You should now have enabled both Solr and HDFS for collecting audit logs.

If you now log in to Ranger, you should see audit logs.

If you plan to build an application in Solr, do not use the solr that is intended for Ambari Infra but install Solr.

Very useful documentation on this topic is available here.

Adding Solr via Ambari

Solr is an open source search engine service. The service is a part of the Hortonworks Data Platform and prior to installing it via Ambari, the service should be added (Zeppelin Notebook went through the same in HDP 2.4 – it had to be added manually before installing). Here is the link to the post about how to add Solr to the list of services.

Once the Solr service is available on the Add Services list, it can be installed. It is a simple process – click next a few times, deploy and that is it. Well, not really.

The following error message will appear

error while installing.JPG

This documentation at the bottom states the following:

In the case of the Java preinstall check failing, the easiest remediation is to login to each machine that Solr will be installed on, temporarily set the JAVA_HOME environmental variable, then then use yum/zypper/apt-get to install the package. For example on CentOS:

export JAVA_HOME=/usr/jdk64/jdk1.8.0_77
yum install lucidworks-hdpsearch

The Datanodes did not find any Java so I installed Java

sudo add-apt-repository ppa:openjdk-r/ppa && sudo apt-get update && sudo apt-get install openjdk-8-jdk -y

Added JAVA_HOME to /env/environment

export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64

And I installed the packages on every DataNode.

cd /etc/apt/sources.list.d
sudo wget http://public-repo-1.hortonworks.com/HDP-SOLR-2.5-100/repos/ubuntu14/hdp-solr.list
sudo apt-get update -y
sudo apt-get install lucidworks-hdpsearch

Do not forget to do the same on client nodes and Namenode(s).

When the install is through and if all went well, the following output is given

Result on all nodes:

====
Package lucidworks-hdpsearch was installed
====

The NameNode challenge

The previous steps worked on all nodes except on NameNode! I figured out th

echo $JAVA_HOME

NameNode show the following

/usr/jdk64/jdk1.8.0_77

DataNodes and Client gives me this

/usr/lib/jvm/java-8-openjdk-amd64

So I installed Java on NameNode as described above (I did not remove any versions) and ran the following commands (same as above)

export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
cd /etc/apt/sources.list.d
sudo wget http://public-repo-1.hortonworks.com/HDP-SOLR-2.5-100/repos/ubuntu14/hdp-solr.list
sudo apt-get update -y
sudo apt-get install lucidworks-hdpsearch

Output was

====
Package lucidworks-hdpsearch was installed
====

I went to Ambari, I deleted the Solr with Install Fail status, installed it again and it was a success!

I have no explanation why (or if) the issue was in the Java version.

Adding Solr service to list of services in Ambari

The goal in this post is to add Solr service to the Ambari Add Services list.

Operating system is Ubuntu 14.04 on AWS. I am using Ambari 2.4.1.0, Hortonworks Data Platform version is 2.5.0.0.

Ssh to Ambari server and step into /tmp directory

cd /tmp

Download the Solr package

wget http://public-repo-1.hortonworks.com/HDP-SOLR/hdp-solr-ambari-mp/solr-service-mpack-5.5.2.2.5.tar.gz

Install the package

sudo ambari-server install-mpack --mpack=/tmp/solr-service-mpack-5.5.2.2.5.tar.gz

Output

Using python  /usr/bin/python
Installing management pack
Ambari Server 'install-mpack' completed successfully.

Create definition for the HDP Search repository

sudo vi /var/lib/ambari-server/resources/stacks/HDP/2.5/repos/repoinfo.xml

Find your os and add to it repo for HDP-SOLR (below is example for Ubuntu 14)

  <os family="ubuntu14">
    <repo>
      <baseurl>http://public-repo-1.hortonworks.com/HDP/ubuntu14/2.x/updates/2.5.0.0</baseurl>
      <repoid>HDP-2.5</repoid>
      <reponame>HDP</reponame>
    </repo>
    <repo>
      <baseurl>http://public-repo-1.hortonworks.com/HDP-UTILS-1.1.0.21/repos/ubuntu12</baseurl>
      <repoid>HDP-UTILS-1.1.0.21</repoid>
      <reponame>HDP-UTILS</reponame>
    </repo>
    <repo>
      <baseurl>http://public-repo-1.hortonworks.com/HDP-SOLR-2.5-100/repos/ubuntu14</baseurl>
      <repoid>HDP-SOLR-2.5-100</repoid>
      <reponame>HDP-SOLR</reponame>
    </repo>
  </os>

List of other systems.

 sudo ambari-server restart

Solr can now be installed via Ambari. This is explained in this post.

My Work cluster in detail

The cluster was built on OpenStack private cloud owned by a Swiss organization Switch.

The Hadoop distributor was Hortonworks, except for Spark and Zeppelin, who were Apache’s.

Potential users

Since the project owner was an organization supporting educational entities in Switzerland, the potential users were researchers, scientists, students…

I had the luxury of having almost unlimited resources on the infrastructure so I have built 5 Hadoop clusters – 4 were Hortonworks Hadoop clusters, one was Apache Hadoop cluster. Out of the 4, one was the Work environment which was exposed to the end users. And this is the cluster that is described in detail in this post.
Keep in mind that I was working on my own on this development – which meant administering and upgrading 5 clusters and doing data science at the same time. In order to make it work, I had to use the YARN inside me and distribute the limited resources effectively.

Initial resources

Keeping in mind the point of distributed systems is scalability, I have defined the initial cluster with the following capabilities.
6 instances with corresponding details:

Ambari Server
NameNode
DataNode (3)
Client

Instance	RAM	VCPU	Default disk size	Volume No.	Volume Size	Security group
Ambari	8GB	8 VCPU	20GB	None	None	sg-ambari
NameNode	32GB	8 VCPU	20GB	1	200GB	sg-namenode
DataNode (3)	32GB	8 VCPU	20GB	3	200GB	sg-datanode
Client	16GB	16 VCPU	20GB	1	500GB	sg-client

Note: There were three DataNodes in the initial cluster.

Characteristics of the cluster

The initial cluster had 1.7 TB HDFS, replication factor was 3, block size was the default 128MB. Rack awareness was not set in the initial cluster and the queue was the default.
On the YARN side, I have made some changes and had 84GB RAM (3 x 32GM = 96GB RAM. 4GB per DataNode was left for services on the instance -> 96GB – 12GB = 84GB) as maximum amount of RAM resources for the cluster – the default values by Apache (Hortonworks?) are quite more conservative.

In the cluster building process the versions were Ambari 2.1 and HDP 2.3. When Ambari 2.2 and HDP 2.4 were available, the cluster was upgraded.

Ambari

Ambari had a server for itself, the database for collecting statistics was MySql. The idea was always to migrate the Ambari Server if needed. The migration to new Ambari server is easy so I could afford to start small for this service.
The Ambari Views was enabled for the users who wanted to upload the files to the HDFS manually. Hive was also available through this service and I on my one of my test environments, I have even embedded Zeppelin in Ambari Views. Though, on the Work cluster, Zeppelin was offered only as an independent service on the Client.
All the ports for Ambari to properly work were in the sg-ambari security group.

NameNode

The initial plan with the NameNode was to run all the services on it except Spark and Zeppelin. When the resources would expand beyond the instance’s capabilities, some services would be moved to a new instance, or unused services would be stopped (experience showed Hive had little popularity among the academia). Using Ambari, migrating services is an easy process, I could afford to have all services running on one NameNode. Only cluster administrator had access to this instance. With other words, client tools were not installed on this instance.
All the ports for the NameNode to properly work were in the sg-namenode security group.

DataNode

I started with 3 DataNodes, which offered 1.7TB of storage on the HDFS. The DataNodes were also used as Workers for Spark and Supervisors for Storm. The users had no access to the DataNodes directly – no client was installed here. This would change according to the needs so that some jobs could access data directly locally.
All the ports for the DataNodes to properly work were in the sg-datanode security group.

Client

The client was users’ window to the cluster. Spark 2.0 (before Summer 2016 it was Spark 1.6) was offered as the computational engine – only one. One reason was also easier administration and optimizatoin from my side.
The users could use the command line interface (CLI), RStudio or Zeppelin. Ambari Views as well, but that was running on Ambari instance. More advanced users went with the CLI, users who wanted to learn Spark were using Zeppelin.
Client for Storm was also installed on this instance. Due to more complex programming (in Java), all the Topologies were handled by me, the users were defining requirements and using the data stored by the Storm.
All the ports for the Client to properly work were in the sg-datanode security group.

See below for page 2.

About Storm’s Nimbus

This post describes Nimbus and shows how its use with single Nimbus in Storm cluster, as well as Nimbus H/A.

I have a Hadoop cluster installed using Ambari. The distribution is Hortonworks. Storm installation with Ambari is described here

A basic example of Storm topology – writing to HDFS can be seen here. Might be smart to submit one topology first in orderto easier understand the terms like Bolt, Supervisor, Nimbus…

About Nimbus

Nimbus is the master node in Storm cluster, it is the NameNode to your Hadoop.

Responsibilities:

distributing code to Supervisors
assigning tasks
monitoring tasks
restarting tasks when needed

Thrift

Thrift is a member of Apache family. It is a software framework (binary protocol) used for scalable cross language communication. Nimbus is a thrift service, and wide use of thrift in Storm allows users to define and submit topologies from any language.
Nimbus thrift API exposes all the information needed to monitor he Storm cluster.

ZooKeeper’s role

Nimbus stores all of its data in ZooKeeper. It is fail-fast (like Supervisor), so if Nimbus dies, the restart has no effect on the running tasks on the Supervisors.

Nimbus and Supervisors communicate through Zookeeper. This means that all data is stored in Zookeeper.

Submitting Topology in Storm Cluster

From the Storm client, the topology is submitted first to the Nimbus and Nimbus distributes it further to the Supervisors.

Single Nimbus in Storm Cluster

The only Nimbus in the Storm cluster is installed on the Hadoop NameNode.
If the Nimbus is not running, Storm UI (on port 8744) returns the following error message

java.lang.RuntimeException: Could not find leader nimbus from seed hosts ["nimbus-server1"]. Did you specify a valid list of nimbus hosts for config nimbus.seeds

Start Nimbus service from Ambari.

The Storm UI, under Nimbus Summary shows one host. It’s default port is 6627 and the status is “Leader”.
I am running one simple test Topology RandomWordsHdfsTopology and the log on the Supervisor executing the Bolt is showing me lines in the following manner:

2016-10-08 11:13:37.885 b.s.d.executor [INFO] Execute done TUPLE source: random-words-spout:5, stream: default, id: {}, [Spark] TASK: 4 DELTA:
2016-10-08 11:13:37.986 b.s.d.executor [INFO] Processing received message FOR 4 TUPLE: source: random-words-spout:5, stream: default, id: {}, [Hadoop]
2016-10-08 11:13:37.986 b.s.d.executor [INFO] BOLT ack TASK: 4 TIME:  TUPLE: source: random-words-spout:5, stream: default, id: {}, [Hadoop]
2016-10-08 11:13:37.986 b.s.d.executor [INFO] Execute done TUPLE source: random-words-spout:5, stream: default, id: {}, [Hadoop] TASK: 4 DELTA:
2016-10-08 11:13:38.087 b.s.d.executor [INFO] Processing received message FOR 4 TUPLE: source: random-words-spout:5, stream: default, id: {}, [Kafka]
2016-10-08 11:13:38.088 b.s.d.executor [INFO] BOLT ack TASK: 4 TIME:  TUPLE: source: random-words-spout:5, stream: default, id: {}, [Kafka]
2016-10-08 11:13:38.088 b.s.d.executor [INFO] Execute done TUPLE source: random-words-spout:5, stream: default, id: {}, [Kafka] TASK: 4 DELTA:
2016-10-08 11:13:38.188 b.s.d.executor [INFO] Processing received message FOR 4 TUPLE: source: random-words-spout:5, stream: default, id: {}, [Storm]
2016-10-08 11:13:38.189 b.s.d.executor [INFO] BOLT ack TASK: 4 TIME: 0 TUPLE: source: random-words-spout:5, stream: default, id: {}, [Storm]

And the random words are being written to a file in HDFS.

If the Nimbus shuts down, Zookeeper and Supervisor continue running the Topology. In this case, the log file on the Supervisor keeps logging random words and the file in HDFS continues to be appended. The Storm UI shows the error message posted above and running

storm list

from the Storm client machine returns the same error message.

Starting the Nimbus again and looking at the $STORM_LOGS/nimbus.log on nimbus-server1 teaches us how Nimbus reacts upon restart.
Some lines taken from the log file:

b.s.zookeeper [INFO] nimbus-server1 gained leadership, checking if it has all the topology code locally.
b.s.zookeeper [INFO] active-topology-ids [RandomWordsHdfsTopology-1-1475917797] local-topology-ids [RandomWordsHdfsTopology-1-1475917797] diff-topology []
b.s.zookeeper [INFO] Accepting leadership, all active topology found localy.
b.s.d.nimbus [INFO] Starting Nimbus server...
...
b.s.zookeeper [INFO] Accepting leadership, all active topology found localy.

With other words, the active Topology did not suffer from Nimbus downtime. With Nimbus down, nonew Topologies can be submitted and existing ones cannot be manipulated.

Multiple Nimbus in Storm Cluster

Adding another Nimbus for Nimbus High Availability is simple in Ambari.
The second Nimbus is added on the Client node of the cluster. After it is added and the Storm service restarted, the Storm UI, under Nimbus Summary shows two Nimbus hosts one being Leader and one having status “Not a Leader”.

The client-server2, which has “Not a Leader” Nimbus reveals the following lines in the nimbus.log file:

...
b.s.d.nimbus [INFO] not a leader, skipping cleanup-corrupt-topologies
b.s.d.nimbus [INFO] Starting Nimbus server...
b.s.d.nimbus [INFO] not a leader, skipping assignments
b.s.d.nimbus [INFO] not a leader, skipping cleanup
b.s.d.nimbus [INFO] not a leader skipping , credential renweal.
...
b.s.d.nimbus [INFO] missing topology RandomWordsHdfsTopology-1-1475917797 has state on zookeeper but doesn't have a local dir on this host.
...
b.s.d.nimbus [INFO] trying to download missing topology code from NimbusInfo{host='nimbus-server1', port=6627, isLeader=false}

The “Not a Leader” Nimbus is now updated with the Storm CLuster and its topologies. Now the leader Nimbus is stopped:

b.s.zookeeper [INFO] client-server2 gained leadership, checking if it has all the topology code locally.
b.s.zookeeper [INFO] active-topology-ids [RandomWordsHdfsTopology-1-1475917797] local-topology-ids [RandomWordsHdfsTopology-1-1475917797] diff-topology []
b.s.zookeeper [INFO] Accepting leadership, all active topology found localy.

The Nimbus on client-server2 takes over as the Leader and Nimbus on the nimbus-server1 has status “Offline”.

When multiple Nimbus services are up and running, the “Leader” status is being switched between them. Roughly, this goes on every couple of minutes.

Nimbus has a vital role in the Storm Cluster and it is naive to think as long as Topology is running, I do not need Nimbus.

Installing and configuring Storm in Ambari

About Storm

Storm is a free and open source distributed real-time computation system.
Storm cluster follows master-slave model and Zookeeper is used for coordination. All data is stored in ZooKeeper.
The basic unit of data processed by Storm is tuple. Tuple consists of predefined list of fields.

Storm cluster on Hadoop

The following graphic explains the architecture one ends up with after following this post. In black text, the Hadoop nodes are shown, in blue text, Storm nodes are shown.

6 nodes in the cluster. One is dedicated NameNode, one is Client and four are DataNodes.
6 nodes in the cluster. One is dedicated to Nimbus, DRPC Server and Storm UI Server, one is Storm Client and four are Supervisors.

storm-architecture

Ports

Make sure you open the following ports:
Node where Nimbus (master) is (are) installed: 2181, 6627.
Nodes where Supervisors (slaves) will be installed: 6700, 6701 (and so on, depending on the number of workers per supervisor).
Default Storm UI Server port is 8744, open the port on the node where this service is installed.

Adding Service in Ambari

Add Service Service

Select the Storm service:
01-pic-storm-service

Click Next.

Assign Masters

02-assign-masters

Nimbus is the master, responsible for distributing code across worker nodes, assigning tasks, monitoring tasks for any failures and restarting them when required. Nimbus and slaves communicate through ZooKeeper.

Click Next.

Assign Slaves and Clients

Check Supervisors on all datanodes you wish to use as supervisors.
Supervisor nodes are worker nodes.

Click Next.

Customize Services

Define ports on supervisors. One port per worker. By defining the ports one basically defines how many workers per supervisor will run.

03-supervisor-slots-port

Leave the default ports for now.

Review

If everything is ok, Click Deploy.

Install, Start, Test

When the installation is complete, click Next.

Restart Required

Restart HDFS, MapReduce2, YARN and Hive. Ambari reminds you about that. The Storm Web UI should now be available on the server where Storm UI Server is installed and on port 8874.

Adding Nimbus

Adding Nimbus is quite straightforward.

In Ambari, click on service Storm.

On the right side, there is a menu Service Actions. Click on it and select Add Nimbus.

Choose the host to add Nimbus component. In this case, I am adding a Nimbus to mz client node in the cluster.

adding-nimbus-select-node

Click OK on the confirmation box

adding-nimbus-confirmation

The Nimbus is now installed. On two instances – client and NameNode.

Restart of the Storm service is needed to make the second Nimbus part of services. The newly added Nimbus has status “Not a Leader”, while the primary Nimbus has status “Leader”.

web-ui-nimbus-summary

Storm client? Yes, with a small workaround

Since I am not implementing High Availability for Storm, there is no need for two Nimbuses. The reason I added one Nimbus to the client is to get Storm client on it.

So if I remove the Nimbus from the client node, the Storm packages remain and potential Storm users can access the Storm service from the client – just like any other services in the cluster.

I can remove the Nimbus from the client just like any other service in Ambari – I stop the service and delete it.

The storm.yaml on the Client will be used when uploading the topologies and at the moment, the property nimbus.seeds has 2 properties – client FQDN and NameNode FQDN – each for one Nimbus location. The upload will still work, but if the non-existing Nimbus server is checked first, it will return an error and look for the next Nimbus server on the list.

Overview over Storm in Ambari

The summary in Ambari reveals the following picture:

summary

One Nimbus (master), 4 Supervisors (slaves) and 8 slots (4 Supervisors x 2 ports, one for each worker on each Supervisor).

Learning about Storm

I have taken the Udacity course Real-Time Analytics with Apache Storm by Twitter. Great course! Very well explained and besides learning about Storm, I also became familiar with in-memory database Redis.

My topology

I have a test topology running which takes in tweets and “bolts” them in the following storages:

pushes raw JSON files directly to HDFS
creates tuples (user-tweet), does data cleansing and pushes them in Redis
pushes information about user, tweet, date to MySql

I keep upgrading and improving my Topology.

Further work

Working with Trident
Checking how Spark Streaming can compete with Storm
Testing Apache Samza to find out why LinkedIn was not happy with Storm and decided to develop Samza

Now we can start playing with Storm! Here is an example of Storm topology that takes random words and pushes them into HDFS.