Installing Spark on Hortonworks cluster using Ambari

The environment

Ubuntu Trusty 14.04. Ambari is used to install the cluster. MySql is used for storing Ambari’s metadata.
Spark is installed on a client node.

 

Note!

My experience with administrating Spark from Ambari has made me install Spark manually, not from Ambari and not by using Hortonworks packages. I install Apache Spark manually on a client node – described here.

Some reasons for that are:

  • New Spark version available every quarter – Hortonworks does not keep up
  • Possibility of running different Spark version on the same client
  • Better control over configuration files
  • Custom definition of Spark parameters for running multiple Spark context on the same client node (more in this post).

 

Installation process in Ambari

Hortonworks distribution installed using Ambari. Hortonworks version 2.3.4.
Services installed first: HDFS, MapReduce, YARN, Ambari Metrics, Zookeeper – I prefer to install these first in order to test if the bare minimum is up and running.

In the next step, Hive, Tez and Pig are installed.

After the successful installation, Spark is installed.

Spark versions

Now, Spark is installed. Hortonworks distribution 2.3.4 offers Spark 1.4.1 from the Choose Services menu:

ambari-spark-version

Running command spark-shell from the spark server reveals that 1.5.2 was installed:

spark-152-version-logo
and
sc-version

Spark’s HOME

Spark’s home directory ($SPARK_HOME) is /usr/hdp/current/spark-client. It is smart to export $SPARK_HOME since it is refered to in services that build on top of Spark.
Spark’s conf directory ($SPARK_CONF_DIR) is /usr/hdp/current/spark-client/conf.

Folder current has nothing but links to the Hortonworks version installed. This means that /usr/hdp/current/spark-client is linked to /usr/hdp/2.3.4.0-3485/spark/.

Comments on the installation

Spark installation from Ambari has, among other things, created a linux user spark and a directory on HDFS – /user/spark.

Spark commands, that were installed, are the following:
spark-class, spark-shell, spark-sql, spark-submit – these can be called from anywhere, since they are linked in /usr/bin.
Other spark commands not linked to the /usr/bin but can be executed from $SPARK_HOME/bin are beeline, pyspark, sparkR.

Connection to Hive

In the SPARK_CONF_DIR, hive-site.xml file can be found. The file has the following content:

<configuration>
  <property>
    <name>hive.metastore.uris</name>
    <value>thrift://hive-server:9083</value>
  </property>
</configuration>

With this propery, Spark connects to Hive. here are two lines from the output when command spark-shell is executed:

16/02/22 13:52:37 INFO metastore: Trying to connect to metastore with URI thrift://hive-server:9083
16/02/22 13:52:37 INFO metastore: Connected to metastore.

Logs

Spark’s log files are by default in /var/log/spark. This can be changed in Ambari: Spark-> Configs -> Advanced spark-env for property spark_log_dir.

Running Spark commands

Examples on how to execute the Spark commands (taken from Hortonworks Spark 1.6 Technical Preview).
These should be run as spark user from $SPARK_HOME.

spark-shell

spark-shell --master yarn-client --driver-memory 512m --executor-memory 512m

spark-submit

./bin/spark-submit --class org.apache.spark.examples.SparkPi --master yarn-client --num-executors 3 --driver-memory 512m --executor-memory 512m --executor-cores 1 lib/spark-examples*.jar 10

sparkR

running sparkR ($SPARK_HOME/bin/sparkR) returns the following:

env: R: No such file or directory

R is not installed, yet. How to install R environment is described here.

Resources

Spark 1.6 Technical Preview

 

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s