Ubuntu Trusty 14.04. Ambari is used to install the cluster. MySql is used for storing Ambari’s metadata.
Spark is installed on a client node.
My experience with administrating Spark from Ambari has made me install Spark manually, not from Ambari and not by using Hortonworks packages. I install Apache Spark manually on a client node – described here.
Some reasons for that are:
- New Spark version available every quarter – Hortonworks does not keep up
- Possibility of running different Spark version on the same client
- Better control over configuration files
- Custom definition of Spark parameters for running multiple Spark context on the same client node (more in this post).
Installation process in Ambari
Hortonworks distribution installed using Ambari. Hortonworks version 2.3.4.
Services installed first: HDFS, MapReduce, YARN, Ambari Metrics, Zookeeper – I prefer to install these first in order to test if the bare minimum is up and running.
In the next step, Hive, Tez and Pig are installed.
After the successful installation, Spark is installed.
Now, Spark is installed. Hortonworks distribution 2.3.4 offers Spark 1.4.1 from the Choose Services menu:
Running command spark-shell from the spark server reveals that 1.5.2 was installed:
Spark’s home directory ($SPARK_HOME) is /usr/hdp/current/spark-client. It is smart to export $SPARK_HOME since it is refered to in services that build on top of Spark.
Spark’s conf directory ($SPARK_CONF_DIR) is /usr/hdp/current/spark-client/conf.
Folder current has nothing but links to the Hortonworks version installed. This means that /usr/hdp/current/spark-client is linked to /usr/hdp/22.214.171.124-3485/spark/.
Comments on the installation
Spark installation from Ambari has, among other things, created a linux user spark and a directory on HDFS – /user/spark.
Spark commands, that were installed, are the following:
spark-class, spark-shell, spark-sql, spark-submit – these can be called from anywhere, since they are linked in /usr/bin.
Other spark commands not linked to the /usr/bin but can be executed from $SPARK_HOME/bin are beeline, pyspark, sparkR.
Connection to Hive
In the SPARK_CONF_DIR, hive-site.xml file can be found. The file has the following content:
<configuration> <property> <name>hive.metastore.uris</name> <value>thrift://hive-server:9083</value> </property> </configuration>
With this propery, Spark connects to Hive. here are two lines from the output when command spark-shell is executed:
16/02/22 13:52:37 INFO metastore: Trying to connect to metastore with URI thrift://hive-server:9083 16/02/22 13:52:37 INFO metastore: Connected to metastore.
Spark’s log files are by default in /var/log/spark. This can be changed in Ambari: Spark-> Configs -> Advanced spark-env for property spark_log_dir.
Running Spark commands
Examples on how to execute the Spark commands (taken from Hortonworks Spark 1.6 Technical Preview).
These should be run as spark user from $SPARK_HOME.
spark-shell --master yarn-client --driver-memory 512m --executor-memory 512m
./bin/spark-submit --class org.apache.spark.examples.SparkPi --master yarn-client --num-executors 3 --driver-memory 512m --executor-memory 512m --executor-cores 1 lib/spark-examples*.jar 10
running sparkR ($SPARK_HOME/bin/sparkR) returns the following:
env: R: No such file or directory
R is not installed, yet. How to install R environment is described here.