Apache Hadoop 3 as a Service on AWS

Apache Hadoop 3.1 cluster built from CLI. Link to github repository is below.

The general idea is to have a solution that builds an Apache Hadoop 3 cluster from command line. This can be useful for learning purposes, for testing or for spinning a Hadoop cluster for a certain job and then terminating it, hence minimizing costs.

Motivation

a couple of years ago I listened to a Spark Summit conference and one company introduced the following architectural solution: data were sitting in S3, when there was the need for analysis, a Hadoop cluster was created, data was pushed to HDFS and analyses were done. After the results were collected, the Hadoop cluster was terminated.

About

The code has no exception handling, it uses AWS’s t2.micro instances to prove the point. There is a lot of potential in building a friendly user interface to parametrize the solution. There is only one input parameter – number of datanodes. When using AWS’s free instances, make sure you do not have more than 20 of them running.

There are four files:

  • HaaS.sh
  • script_namenode.sh
  • script_datanode.sh
  • terminate_cluster.sh

The HaaS.sh file launches the instances for namenode and datanode(s) (namenode instance is dedicated for namenode related services – no datanode services are installed there). It is advised to start at least one datanode. Example on how to launch a cluster with 5 datanodes: . Haas.sh 5

When EC2 instance for namenode is ready, script_namenode.sh is executed on that instance. When EC2 instance(s) for datanode(s) are ready, script_datanode.sh is executed on the instance(s).

Prerequisities

I have defined one instance as “Initial” instance. This is where the scripts are located and this instance creates and terminates the cluster. This instance is not a part of the Hadoop cluster, it launches the cluster and terminates it. I am using Ubuntu 16.04 for all my instances. Make sure you have awscli package installed and aws configured on this initial instance.

Prerequisities on AWS

  • key pair
  • security group
    • open all traffic for all instances in the same subnet and security group
    • open port 9870 for Namenode Web Interface
    • open port 8088 for Resource Manager (YARN)
    • open port 19888 for MapReduce JobHistory server
  • subnet

Times

Launching a Hadoop cluster with 10 datanodes took less than 10 minutes. When testing, I did also come down to 8 minutes. I am using sleep command in the Haas.sh script in order to wait for the instances to either start running or for Hadoop to download and install (unpack). Room for optimization here as well.

Order of execution

The HaaS.sh script does the following actions:

  • launch namenode instance and read output text into a variable
  • parse the variable to collect instance id and private ip
  • create instances.list and add namenode instance id to it
  • append private ip and instance name to /etc/hosts
  • enable passwordless ssh to namenode
  • launch datanode(s)
  • update local /etc/hosts
  • create workers file
  • enable passwordless ssh to datanode(s)
  • start services on datanode(s)
  • copy /etc/hosts from initial instance to all Hadoop instances
  • copy workers file to namenode’s $HADOOP_HOME/etc/hadoop
  • start services on datanode(s)
  • remove temporary files

 

Link to the scripts can be found here.

Advertisements

Passwordless ssh between two AWS instances

Hadoop clusters require passwordless shh between nodes for proper communication.

This is all done on the instance you wish to connect FROM!

The recipe how I made paswordless shh work between two instances is the following:

  • create ec2 instances – they should be in the same subnet and have the same security group
  • Open ports between them – make sure instances can communicate to each other. Use the default security group which has one rule relevant for this case:
    • Type: All Traffic
    • Source: Custom – id of the security group
  • Log in to the instance you want to connect from to the other instance
  • Run:
    ssh-keygen -t rsa -N "" -f /home/ubuntu/.ssh/id_rsa
    

    to generate a new rsa key.

  • Copy your private AWS key as ~/.ssh/my.key (or whatever name you want to use)
  • Make sure you change the permission to 600
chmod 600 .ssh/my.key
  • Copy the public key to the instance you wish to connect to passwordless
cat ~/.ssh/id_rsa.pub | ssh -i ~/.ssh/my.key ubuntu@10.0.0.X "cat >> ~/.ssh/authorized_keys"

If you test the passwordless ssh to the other machine, it should work.

ssh 10.0.0.X

Bash script for creating new user in Hadoop and Ambari Views

Here is a bash script I used a couple of years ago for creating Hadoop users from CLI (or batch). It might be useful for someone.

The script does the following:

  • creates a Linux user
  • generates keys
  • creates home directory in HDFS
  • adds user to a group
  • allocates HDFS space quota
  • gives access in Ambari Views
#!/bin/bash

NEW_USER="$1"
DEPT_NAME="$2"
NAMENODE="t-namenode1"
AMBARI="t-ambari"

#
echo "Creating user "$NEW_USER

#Creating user with no password with user's folder
sudo adduser --disabled-password --gecos "" $NEW_USER

#Create Linux user on the namenode
ssh -i /home/ubuntu/.ssh/key $NAMENODE 'sudo adduser --disabled-password --gecos "" $NEW_USER && sudo chown $NEW_USER:$NEW_USER /home/$NEW_USER'

#Prepare .ssh folder
cd /user/$NEW_USER
sudo mkdir .ssh
sudo chown $NEW_USER:$NEW_USER .ssh/
sudo chmod 700 .ssh

#Create private and public key
sudo -u $NEW_USER  ssh-keygen -t rsa -f $NEW_USER-key

#Copy public key to the authorized_keys
sudo -u $NEW_USER cp $NEW_USER-key.pub .ssh/authorized_keys
sudo -u $NEW_USER chmod 600 .ssh/authorized_keys

#######HDFS
echo "Create system folder for user"
sudo -u hdfs hadoop fs -mkdir /user/$NEW_USER
echo "Change owner of the system folder"
sudo -u hdfs hadoop fs -chown $NEW_USER:hdfs /user/$NEW_USER

#Defining HDFS space quota
echo "Allocate 100g of space on HDFS for the user"
sudo -su hdfs hdfs dfsadmin -setSpaceQuota 100g /department/$DEPT_NAME/users/$NEW_USER

#Access to Ambari Views
curl -iv -u admin:admin -H "X-Requested-By: ambari" -X POST -d  '{"Users/user_name": "$USER_NAME", "Users/password":  "$USER_NAME", "Users/active": true, "Users/admin": false }' http://$AMBARI:8080/api/v1/users

#Add user to a group in Ambari Views
curl -iv -u admin:admin -H "X-Requested-By: ambari" -X POST -d '[{"MemberInfo/user_name":"$NEW_USER", "MemberInfo/group_name":"$DEPT_NAME"}]' http://$AMBARI:8080/api/v1/groups/$DEPT_NAME/members

echo "User's folder on the client:"
ls -l /user/$NEW_USER

echo "User's system folder on HDFS:"
sudo -u $HDFS hadoop fs -ls /user/$NEW_USER

 

Running Eclipse Scala IDE and Java 9 on Windows 10

For working with Scala in Windows 10 I use Scala IDE build on Eclipse SDK. Build-id is 4.7.1.

I have installed Java 9 and when I wanted to run Scala IDE I got an error message saying I should check the log file C:\marko\workspace\.metadata\.log.
The error message was

!ENTRY org.eclipse.osgi 4 0 2018-01-27 18:28:41.327
!MESSAGE Application error
!STACK 1
org.eclipse.e4.core.di.InjectionException: java.lang.NoClassDefFoundError: javax/annotation/PostConstruct
	at org.eclipse.e4.core.internal.di.InjectorImpl.internalMake(InjectorImpl.java:410)
	at org.eclipse.e4.core.internal.di.InjectorImpl.make(InjectorImpl.java:318)
	at org.eclipse.e4.core.contexts.ContextInjectionFactory.make(ContextInjectionFactory.java:162)
	at org.eclipse.e4.ui.internal.workbench.swt.E4Application.createDefaultHeadlessContext(E4Application.java:491)
	at org.eclipse.e4.ui.internal.workbench.swt.E4Application.createDefaultContext(E4Application.java:505)
	at org.eclipse.e4.ui.internal.workbench.swt.E4Application.createE4Workbench(E4Application.java:204)
	at org.eclipse.ui.internal.Workbench.lambda$3(Workbench.java:614)
	at org.eclipse.core.databinding.observable.Realm.runWithDefault(Realm.java:336)
	at org.eclipse.ui.internal.Workbench.createAndRunWorkbench(Workbench.java:594)
	at org.eclipse.ui.PlatformUI.createAndRunWorkbench(PlatformUI.java:148)
	at org.eclipse.ui.internal.ide.application.IDEApplication.start(IDEApplication.java:151)
	at org.eclipse.equinox.internal.app.EclipseAppHandle.run(EclipseAppHandle.java:196)
	at org.eclipse.core.runtime.internal.adaptor.EclipseAppLauncher.runApplication(EclipseAppLauncher.java:134)
	at org.eclipse.core.runtime.internal.adaptor.EclipseAppLauncher.start(EclipseAppLauncher.java:104)
	at org.eclipse.core.runtime.adaptor.EclipseStarter.run(EclipseStarter.java:388)
	at org.eclipse.core.runtime.adaptor.EclipseStarter.run(EclipseStarter.java:243)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
	at java.base/java.lang.reflect.Method.invoke(Unknown Source)
	at org.eclipse.equinox.launcher.Main.invokeFramework(Main.java:653)
	at org.eclipse.equinox.launcher.Main.basicRun(Main.java:590)
	at org.eclipse.equinox.launcher.Main.run(Main.java:1499)
	at org.eclipse.equinox.launcher.Main.main(Main.java:1472)
Caused by: java.lang.NoClassDefFoundError: javax/annotation/PostConstruct
	at org.eclipse.e4.core.internal.di.InjectorImpl.inject(InjectorImpl.java:124)
	at org.eclipse.e4.core.internal.di.InjectorImpl.internalMake(InjectorImpl.java:399)
	... 23 more
Caused by: java.lang.ClassNotFoundException: javax.annotation.PostConstruct cannot be found by org.eclipse.e4.core.di_1.6.100.v20170421-1418
	at org.eclipse.osgi.internal.loader.BundleLoader.findClassInternal(BundleLoader.java:433)
	at org.eclipse.osgi.internal.loader.BundleLoader.findClass(BundleLoader.java:395)
	at org.eclipse.osgi.internal.loader.BundleLoader.findClass(BundleLoader.java:387)
	at org.eclipse.osgi.internal.loader.ModuleClassLoader.loadClass(ModuleClassLoader.java:150)
	at java.base/java.lang.ClassLoader.loadClass(Unknown Source)
	... 25 more

After some googling I found out I have to make some changes to the eclipse.ini file. The following snippet shows my eclipse.ini file with changes. Scala IDE start now the way it should.

--launcher.appendVmargs
-vm
C:\Program Files\Java\jdk-9.0.1\bin\javaw.exe
-startup
plugins/org.eclipse.equinox.launcher_1.4.0.v20161219-1356.jar
--launcher.library
plugins/org.eclipse.equinox.launcher.win32.win32.x86_64_1.1.500.v20170531-1133
-vmargs
-Xmx2G
-Xms200m
-XX:MaxPermSize=384m
-Dosgi.requiredJavaVersion=1.8
--add-modules=ALL-SYSTEM

Highlighted code was added so that Scala IDE could work on Java 9.

Installing Apache Spark 2.2.1

I have installed older Apache Spark versions and now the time is right to install Spark 2.2.1.

Im using an AWS t2.micro instance with Ubuntu 16.04 on it. MobaXterm is my choice of interface to SSH to the instance.

System update

sudo apt-get update -y
sudo apt-get upgrade -y

Change instance name

Go into the hostname file and change the name to spark

sudo vi /etc/hostname

Change localhost with instance name in hosts file

After the 127.0.0.1 write the name of the instance

sudo vi /etc/hosts

Reboot the instance

sudo reboot

Install and set up Java

If not sooner you will need Java for running History server. Java 8 is installed in the following way

sudo add-apt-repository ppa:openjdk-r/ppa
sudo apt-get update
sudo apt-get install openjdk-8-jdk -y

Add JAVA_HOME to the environment file

sudo vi /etc/environment

And add the following line to the top of the file

export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64

Doublecheck if this is the correct Java home.

Python

Spark 2.2.x supports Python 2.7+/3.4+. When running PySpark, Spark looks for Python in /usr/bin directory. Ubuntu 16.04 on AWS comes only with Python 3.5. When running PySpark without Python 2.7+, the following error message outputs

/usr/apache/spark-2.2.1-bin-hadoop2.7/bin/pyspark: line 45: python: command not found
env: ‘python’: No such file or directory

Options are two: install Python 2.7+ or create a link to Python3. The first alternative is acceptable only if Python packages that are in Python2 but not Python3 are going to be used. Otherwise, the latter alternative is the option. And this is done in the following way

sudo ln -s /usr/bin/python3 /usr/bin/python

Running PySpark now starts PySpark CLI with Python 3.5.2

And yes, running python or python3 will both execute the same action – start python 3.5.

Create user spark

sudo adduser spark

Define password, for the sake of testing, let’s go with spark

Prepare directory for Spark home

sudo mkdir /usr/apache

Step into the directory

cd /usr/apache

Download and unpack Apache Spark 2.2.1

sudo wget http://apache.uib.no/spark/spark-2.2.1/spark-2.2.1-bin-hadoop2.7.tgz
sudo tar -xvzf spark-2.2.1-bin-hadoop2.7.tgz

Clean up – delete the spark’s tgz file

sudo rm spark-2.2.1-bin-hadoop2.7.tgz

Change the owner of the spark directory

sudo chown spark:spark /usr/apache/spark-2.2.1-bin-hadoop2.7

Create Spark home

cd spark-2.2.1-bin-hadoop2.7
pwd

Output of the pwd command is the value for SPARK_HOME in the environment file. Open the file

sudo vi /etc/environment

And add the below SPARK_HOME line before the PATH line

export SPARK_HOME=/usr/apache/spark-2.2.1-bin-hadoop2.7

At the end of PATH add

:${SPARK_HOME}/bin

(You need the colon to separate from previous values)
Refresh the environment file

source /etc/environment

Create log and pid directories

sudo mkdir -p /var/log/spark/logs
sudo chown spark:spark -R /var/log/spark
sudo -u spark mkdir $SPARK_HOME/run
sudo -u spark chmod 777 /var/log/spark/logs

The last line allows every user to read, write to the directory. It is probably best to adjust access according to the needs.

If different users are going to run Spark applications and if those applications would want to be seen in History Server, user spark has to be added to the users’ group.
For example, user ubuntu is running Spark applications, that means user ubuntu writes log files to Spark History log directory (check below for the property spark.history.fs.logDirectory) and Spark is opening them through History Server. To be albe to do the latter, spark has to be a memeber of ubuntu group. This is done in the following way

sudo usermod -a -G ubuntu spark

Prepare spark-env.sh file

Open the file

sudo -u spark vi $SPARK_HOME/conf/spark-env.sh

Add the following values

SPARK_LOG_DIR=/var/log/spark
SPARK_PID_DIR=${SPARK_HOME}/run

Prepare spark-default.sh file

Open the file

sudo -u spark vi $SPARK_HOME/conf/spark-defaults.conf

Add the following values

spark.history.fs.logDirectory file:/var/log/spark/logs
spark.eventLog.enabled true
spark.eventLog.dir file:/var/log/spark/logs
spark.history.provider org.apache.spark.deploy.history.FsHistoryProvider
spark.history.ui.port 18080
spark.blockManager.port 38000
spark.broadcast.port 38001
spark.driver.port 38002
spark.executor.port 38003
spark.fileserver.port 38004
spark.replClassServer.port 38005

Start Spark History

sudo -u spark $SPARK_HOME/sbin/start-history-server.sh

Instances IP address on port 18080 should open the Spark History Server. If not, check the /var/log/spark for errors and messages.

Notes on TensorFlow – Introduction

Introduction

TensorFlow was developed by the Google Brain and it was open sourced in November 2015.

TensorFlow is an open source software library for numerical computation. It is well suited for large-scale Machine Learning.

Basic principle – two steps:
– you define a graph of computations to perform
– TensorFlow takes the graph and runs it using optimized C++ code

It is possible to split the graph and run it parallel across multiple CPUs or GPUs.
TensorFlow supports distributed computing.

TensorFlow’s highlights:
– runs on Windows, Linux, macOS, iOS and Android
– provides simple Python API – TF.Learn, compatible with Scikit-Learn
– provides simple API TF-slim for simple building, training and evaluating neural networks
automatic differentiating – optimization nodes to search for the parameters that minimize cost function
TensorBoard for graph visualization

First, computation graph is created, not even the variables are initialized.

To evaluate the graph, a TensorFlow session needs to be open. The session initializes the variables and evaluates the graph.

TensorFlow program (typically) has two parts:
construction phase – builds a computation graph representing the ML model and computations to train it
execution phase – runs the graph

When evaluating a node, TensorFlow defines the nodes it depends on and evaluates these nodes first. The result below for y is 22. For y to be evaluated, Tensor b has to be evaluated first.

#define a graph
a = tf.constant(1)
b = a + 10
y = b * 2
z = b * 3

#start a session
with tf.Session() as sess:
    #evaluate y
    print(y.eval())
    #evaluate z
    print(z.eval())

If new evaluation, in the same session, is done using Tensor b, this Tensor b is not reused. With other words, b is evaluated twice when Tensor z is evaluated.

All node values are dropped between graph runs.

To evaluate efficiently, make TensorFlow evaluate both Tensors in just one graph:

with tf.Session() as sess:
    y_val, z_val = sess.run([y, z])
    print(y_val)
    print(z_val)

Operations

TensorFlow operations (ops) take any number of inputs and return any number of outputs. Above examples take two inputs and produce one output.
Constants and variables (source ops) take no input.

Inputs and outputs are multidimensional arrays – tensors. Tensors have a type and a shape, they are represented by Numpy ndarrays.

The code below defines 2 lists with different dimensions and one integer variable. Three TensorFlow constant nodes are created. Two nodes are created, one multiplies a matrix with a scalar, and the other one multiplies 2 matrices. Both tensors are run in one graph and the outputs are printed.

list_1_3 = [[1.5, 2.7, 3.9]]
list_2_3 = [[10., 11., 12.], [13., 14., 15.]]
s = 2

#create TensorFlow constant node - matrix in shape (1,3)
tf_matrix_1_3 = tf.constant(list_1_3, dtype=tf.float32, name="tf_matrix_1_3")
#create TensorFlow constant node - matrix in shape (2,3)
tf_matrix_2_3 = tf.constant(list_2_3, dtype=tf.float32, name="tf_matrix_2_3")
#create TensorFlow constant node - scalar
scalar = tf.constant(s, dtype=tf.float32, name="scalar")

#multiply the matrix by scalar
multiply_matrix_scala = tf_matrix_1_3 * scalar

#matrix multiplication, transpose second matrix to follow matrix multiplication rules
multiply_matrices_tf = tf.matmul(tf_matrix_1_3, tf_matrix_2_3, transpose_b=True)

with tf.Session() as sess:
    res1_out, res2_out = sess.run([multiply_matrix_scala, multiply_matrices_tf])
    #print out two NumPy arrays as results of multiplication
    print(res1_out, "\n", res2_out)

Output:

[[ 3. 5.4000001 7.80000019]]
[[ 91.5 115.80000305]]

Main benefit of this code compared to doing it with Numpy is that TensorFlow will automatically run this on GPU card if one is installed and TensorFlow with GPU support is installed.

Placeholders

Placeholder nodes do not perform any computation, they just output the data at runtime. They are used to feed the training data to TensorFlow.

list = [[2, 3, 4], [5, 6, 7]]

#create placeholder with type float32 and unspecified number of rows with 3 columns
placeholder = tf.placeholder(tf.float32, shape=(None, 3))
square = tf.square(placeholder)

with tf.Session() as sess:
    res = sess.run(square, feed_dict={placeholder: list})
    print(res)

Output:

[[ 4. 9. 16.]
[ 25. 36. 49.]]

Adding service Druid to HDP 2.6 stack

Druid is a “fast column-oriented distributed data store”, according to the description in Ambari. It is a new service, added in HDP 2.6. The service is Technical Preview and the version offered is 0.9.2. Druid’s website is druid.io.

!!! Hortonworks Data Platform 2.6 is needed in order to install and use Druid !!!

Hortonworks has a very intriguing three-part series on ultra fast analytics with Hive and Druid. The first blog post can be found here.

This blog post describes how Druid is added to the HDP 2.6 stack with Ambari. The documentation I used is here. According to my experience, it does not hold water. I had to make some adjustment in order to start all Druid services.

Requirements

  • Zookeeper: Druid requires installation of Zookeeper. This service is already installed on my cluster.
  • Deep storage: deep storage layer for Druid in HDP can either be HDFS or S3. Parameter “druid.storage.type” is used to define this. Installation default is HDFS.
  • Metadata storage: for holding information about Druid segments and tasks. MySql is my metadata storage of choice.
  • Batch execution engine: resource manager is YARN, execution engine is MapReduce2. Druid hadoop index tasks use MapReduce jobs for distributed ingestion of data.

All these requirements are taken care of in Ambari, most of them with a sufficient default value.

Services within Druid

  • Broker – interface between users and Druid’s historical and realtime nodes.
  • Overlord – maintain a task queue that consists of user-submitted tasks.
  • Coordinator – serve to assign segments to historical nodes, handle data replication, and to ensure that segments are distributed evenly across the historical nodes.
  • Druid Router – serve as a mechanism to route queries to multiple broker nodes.
  • Druid Superset – if you know Superset, you know Druid Superset – data visualization tool.

Pre-work in metadata storage

As mentioned, my metadata storage is MySql. There are some objects that have to be created manually for the Druid installation to go through.

Log in to MySql as root.

Create druid database

CREATE DATABASE druid DEFAULT CHARACTER SET utf8;
CREATE USER 'druid'@'%' IDENTIFIED BY 'druid';
GRANT ALL PRIVILEGES ON druid.* TO 'druid'@'%';
FLUSH PRIVILEGES;

Create superset database

The superset objects in the database have to be created even though the documentation does not mention this. The installation will not go through unless it can connect to superset database to create tables in superset schema.

CREATE DATABASE superset DEFAULT CHARACTER SET utf8;
CREATE USER 'superset'@'%' IDENTIFIED BY 'druid';
GRANT ALL PRIVILEGES ON superset.* TO 'superset'@'%';
FLUSH PRIVILEGES;

Adding service

In Ambari, click on Add Service and check Druid service.

add service druid

In the next step, you are asked to define which Druid service is going to be installed on which node in the cluster. Remember, you can always move/add services.

assign masters to nodes

The Broker is on the Client node, since that service is the gateway to external world.

In the next step – Assigning Slaves and Clients – the following two needs to be defined where they will be installed:

  • Druid Historical: Loads data segments.
  • Druid MiddleManager: Runs Druid indexing tasks.

Generally you should select Druid Historical and Druid MiddleManager for multiple nodes. Both services are on namenode to begin with.

Next step are settings. There are some passwords and MySql server that needs to be defined. Secret key is also something one needs to define. A random string of characters would do the trick.

Be sure to create the objects in the MySql before you proceed with the installation.

installation settings

!!! Superset Database port should be 3306, just like Metadata storage port.

The advanced tab (picture above) is mostly for the superset parameters – entering name, email and password is needed to proceed with the installation. This is later on used in the visualization tool Superset.

Once you click OK, you are asked to doublecheck and change some recommended values. The following ones are related to Druid installation and should be checked to accept the recommended values.

dependency configuration.jpg

In the Review step, check if everything is as it should be and click Deploy.

After the installation completes all Druid services should be up and running. If there is the need to restart any services, do so.

Tweaking MapReduce2

There is one detail not mentioned in Hortonworks documentation when Druid is installed. There are two parameters in MapReduce2 that have to be tweaked in order for Druid to successfully load data. Explanation is at the bottom.

The parameters are:

  • mapreduce.map.java.opts
  • mapreduce.reduce.java.opts

The following should be added at the end of the existing values:

-Duser.timezone=UTC -Dfile.encoding=UTF-8

How it looks in Ambari:

map java heap size parameterreduce java heap size parameter

The service MapReduce2 should now be restarted.

Explanation

Various error messages occur in the Druid Console log files when the Druid job start to load the data. The error messages vary depending on the data, but generally, they do not provide any useful information.
From my experience, one error had a problem with the first line in a valid csv file, while in another example, the error was that no data can be indexed (code below).

Caused by: java.lang.RuntimeException: No buckets?? seems there is no data to index.
	at io.druid.indexer.IndexGeneratorJob.run(IndexGeneratorJob.java:176) ~[druid-indexing-hadoop-0.9.2.2.6.0.3-8.jar:0.9.2.2.6.0.3-8]
	at io.druid.indexer.JobHelper.runJobs(JobHelper.java:349) ~[druid-indexing-hadoop-0.9.2.2.6.0.3-8.jar:0.9.2.2.6.0.3-8]
	at io.druid.indexer.HadoopDruidIndexerJob.run(HadoopDruidIndexerJob.java:94) ~[druid-indexing-hadoop-0.9.2.2.6.0.3-8.jar:0.9.2.2.6.0.3-8]
	at io.druid.indexing.common.task.HadoopIndexTask$HadoopIndexGeneratorInnerProcessing.runTask(HadoopIndexTask.java:261) ~[druid-indexing-service-0.9.2.2.6.0.3-8.jar:0.9.2.2.6.0.3-8]
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:1.8.0_111]
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[?:1.8.0_111]
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_111]
	at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_111]
	at io.druid.indexing.common.task.HadoopTask.invokeForeignLoader(HadoopTask.java:201) ~[druid-indexing-service-0.9.2.2.6.0.3-8.jar:0.9.2.2.6.0.3-8]
	... 7 more