Networking in Spark: Configuring ports in Spark

For Spark Context to run, some ports are used. Most of them are randomly chosen which makes it difficult to control them. This post describes how I am controlling Spark’s ports.

In my clusters, some nodes are dedicated client nodes, which means the users can access them, they can store files under their respective home directory (defining home on an attached volume is described here), and run jobs on it.

The Spark jobs can be run in different ways, from different interfaces – Command Line Interface, Zeppelin, RStudio…

 

Links to Spark installation and configuration

Installing Apache Spark 1.6.0 on a multinode cluster

Building Apache Zeppelin 0.6.0 on Spark 1.5.2 in a cluster mode

Building Zeppelin-With-R on Spark and Zeppelin

What Spark Documentation says

Spark UI

Spark User Interface, which shows application’s dashboard, has the default port of 4040 (link). Property name is

spark.ui.port

When submitting a new Spark Context, 4040 is attempted to be used. If this port is taken, 4041 will be tried, if this one is taken, 4042 is tried and so on, until an available port is found (or maximum attempts are met).
If the attempt is unsuccessful, the log is going to display a WARN and attempt the next port. Example follows:

WARN Utils: Service ‘SparkUI’ could not bind on port 4040. Attempting port 4041.
INFO Utils: Successfully started service ‘SparkUI’ on port 4041.
INFO SparkUI: Started SparkUI at http://client-server:4041

According to the log, the Spark UI is now listening on port 4041.

Not much randomizing for this port. This is not the case for ports in the next chapter.

 

Networking

Looking at the documentation about Networking in Spark 1.6.x, this post is focusing on the 6 properties that have default value random in the following picture:

spark networking.JPG

When Spark Context is in the process of creation these receive random values.

spark.blockManager.port
spark.broadcast.port
spark.driver.port
spark.executor.port
spark.fileserver.port
spark.replClassServer.port

These are the properties that should be controlled. They can be controlled in different ways, depending on how the job is run.

 

Scenarios and solutions

If you do not care about the values assigned to these properties then no further steps are needed..

Configuring ports in spark-defaults.conf

If you are running one Spark application per node (for example: submitting python scripts by using spark-submit), you might want to define the properties in the $SPARK_HOME/conf/spark-defaults.conf. Below is an example of what should be added to the configuration file.

spark.blockManager.port 38000
spark.broadcast.port 38001
spark.driver.port 38002
spark.executor.port 38003
spark.fileserver.port 38004
spark.replClassServer.port 38005

If a test is run, for example spark-submit test.py, the Spark UI is by default 4040 and the above mentioned ports are used.

Running the following command

sudo netstat -tulpn | grep 3800

Returns the following output:

tcp6      0      0      :::38000                          :::*      LISTEN      25300/java
tcp6      0      0      10.0.173.225:38002     :::*      LISTEN      25300/java
tcp6      0      0      10.0.173.225:38003     :::*      LISTEN      25300/java
tcp6      0      0      :::38004                          :::*      LISTEN      25300/java
tcp6      0      0      :::38005                          :::*      LISTEN      25300/java

 

Configuring ports directly in a script

In my case, different users would like to use different ways to run Spark applications. Here is an example of how ports are configured through a python script.

"""Pi-estimation.py"""

from random import randint
from pyspark.context import SparkContext
from pyspark.conf import SparkConf

def sample(p):
x, y = randint(0,1), randint(0,1)
print(x)
print(y)
return 1 if x*x + y*y < 1 else 0

conf = SparkConf()
conf.setMaster("yarn-client")
conf.setAppName("Pi")

conf.set("spark.ui.port", "4042")

conf.set("spark.blockManager.port", "38020")
conf.set("spark.broadcast.port", "38021")
conf.set("spark.driver.port", "38022")
conf.set("spark.executor.port", "38023")
conf.set("spark.fileserver.port", "38024")
conf.set("spark.replClassServer.port", "38025")

conf.set("spark.driver.memory", "4g")
conf.set("spark.executor.memory", "4g")

sc = SparkContext(conf=conf)

NUM_SAMPLES = randint(5000000, 100000000)
count = sc.parallelize(xrange(0, NUM_SAMPLES)).map(sample) \
.reduce(lambda a, b: a + b)
print("NUM_SAMPLES is %i" % NUM_SAMPLES)
print "Pi is roughly %f" % (4.0 * count / NUM_SAMPLES)
(The above Pi estimation is a Spark example that comes with Spark installation)

The property values in the script run over the properties in the spark-defaults.conf file. For the runtime of this script port 4042 and ports 38020-38025 are used.

If netstat command is run again for all ports that start with 380

sudo netstat -tulpn | grep 380

The following output is shown:

tcp6           0           0           :::38000                              :::*          LISTEN          25300/java
tcp6           0           0           10.0.173.225:38002         :::*          LISTEN          25300/java
tcp6           0           0           10.0.173.225:38003         :::*          LISTEN          25300/java
tcp6           0           0           :::38004                              :::*          LISTEN          25300/java
tcp6           0           0           :::38005                              :::*          LISTEN          25300/java
tcp6           0           0           :::38020                              :::*          LISTEN          27280/java
tcp6           0           0           10.0.173.225:38022         :::*          LISTEN          27280/java
tcp6           0           0           10.0.173.225:38023         :::*          LISTEN          27280/java
tcp6           0           0           :::38024                              :::*          LISTEN          27280/java

2 processes are running one separate Spark application each on ports that were defined beforehand.

 

Configuring ports in Zeppelin

Since my users use Apache Zeppelin, similar network management had to be done there. Zeppelin is also sending jobs to Spark Context through spark-submit command. That means that the properties can be configured in the same way. This time through an interpreter in Zeppelin:

Choosing menu Interpreter and choosing spark interpreter will get you there. Now it is all about adding new properties and respective values. Do not forget to click on the plus when you are ready to add a new property.
At the very end, save everything and restart the spark interpreter.

Below is an example of how this is done:

spark zeppelin ports

Next time a Spark context is created in Zeppelin, the ports will be taken into account.

 

Conclusion

This can be useful if multiple users are running Spark applications on one machine and have separate Spark Contexts.

In case of Zeppelin, this comes in handy when one Zeppelin instance is deployed per user.

 

Advertisements

SparkContext allocates random ports. How to control the port allocation.

When SparkContext is in process of creation, a bunch of random ports are allocated to run the Spark service. This can be annoying when you have security groups to think of.

Note!
A more detailed post on this topic is here.

Here is an example of how random ports are allocated when Spark service is started:

spark ports random

The only sure bet is 4040 (or 404x depending on how many Spark Web UI have been already started).

On Apache Spark website, under Configuration, under Networking, 6 port properties have default value random. These are the properties that have to be tamed.

spark port random printscreen

(The only 6 properties with default value random among all Spark properties)

Solution

Open $SPARK_HOME/conf/spark-defaults.conf:

sudo -u spark vi conf/spark-defaults.conf

The following properties should be added:

spark.blockManager.port    38000
spark.broadcast.port       38001
spark.driver.port          38002
spark.executor.port        38003
spark.fileserver.port      38004
spark.replClassServer.port 38005

I have picked port range 38000-38005 for my Spark services.

If I run Spark service now, the ports in use are now as defined in the configuration file:

spark ports tamed

Installing Apache Spark 1.6.x on a multinode cluster

I am running a HDP 2.3.4 multinode cluster with Ubuntu Trusty 14.04 on all my nodes. The Spark in this post is installed on my client node. My cluster has HDFS and YARN, among other services. All were installed from Ambari. This is not the case for Apache Spark 1.6, because Hortonworks does not offer Spark 1.6 on HDP 2.3.4

The documentation on Spark version 1.6 is here.

My post on setting up Apache Spark 2.0.0.

Prerequisities

Java

Update and upgrade the system and install Java

sudo add-apt-repository ppa:openjdk-r/ppa
sudo apt-get update
sudo apt-get install openjdk-8-jdk -y

Add JAVA_HOME in the system variables file

sudo vi /etc/environment
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64

Spark user

Create user spark and add it to group hadoop

sudo adduser spark
sudo usermod -a -G hadoop spark

HDFS home directory for user spark

Create spark’s user folder in HDFS

sudo -u hdfs hadoop fs -mkdir -p /user/spark
sudo -u hdfs hadoop fs -chown -R spark:hdfs /user/spark

Spark installation and configuration

Install Spark

Create directory where spark directory is going to reside. Hortonworks installs its services under /usr/hdp. I am following their lead so I am installing all Apache services under /usr/apache. Create the directory and step into it.

sudo mkdir /usr/apache
cd /usr/apache

Download Spark 1.6.0 from https://spark.apache.org/downloads.html. I have Hadoop 2.7.1, version 2.6 does the trick.

sudo wget http://d3kbcqa49mib13.cloudfront.net/spark-1.6.0-bin-hadoop2.6.tgz

Unpack the tar file

sudo tar -xvzf spark-1.6.0-bin-hadoop2.6.tgz

Remove the tar file after it has been unpacked

sudo rm spark-1.6.0-bin-hadoop2.6.tgz

Change the ownership of the folder and its elements

sudo chown -R spark:spark spark-1.6.0-bin-hadoop2.6

Update system variables

Step into the spark 1.6.0 directory and run pwd to get full path

cd spark-1.6.0-bin-hadoop2.6
pwd

Update the system environment file by adding SPARK_HOME and adding Spark_HOME/bin to the PATH

sudo vi /etc/environment

export SPARK_HOME=/usr/apache/spark-1.6.0-bin-hadoop2.6

At the end of PATH add

${SPARK_HOME}/bin

Refresh the system environments

source /etc/environment

Change the owner of $SPARK_HOME to spark

sudo chown -R spark:spark $SPARK_HOME

Log and pid directories

Create log and pid directories

sudo mkdir /var/log/spark
sudo chown spark:spark /var/log/spark
sudo -u spark mkdir $SPARK_HOME/run

Spark configuration files

Hive configuration

sudo -u spark vi $SPARK_HOME/conf/hive-site.xml
<configuration>
<property>
<name>hive.metastore.uris</name>
<!--Make sure that <value> points to the Hive Metastore URI in your cluster -->
<value>thrift://hive-server:9083</value>
<description>URI for client to contact metastore server</description>
</property>
</configuration>

Spark environment file

Create a new file in under $SPARK_HOME/conf

sudo vi conf/spark-env.sh

Add the following lines and adjust aaccordingly.

export SPARK_LOG_DIR=/var/log/spark
export SPARK_PID_DIR=${SPARK_HOME}/run
export HADOOP_HOME=${HADOOP_HOME:-/usr/hdp/current/hadoop-client}
export HADOOP_CONF_DIR=${HADOOP_CONF_DIR:-/usr/hdp/current/hadoop-client/conf}
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
export SPARK_SUBMIT_OPTIONS="--jars ${SPARK_HOME}/lib/spark-csv_2.10-1.4.0.jar"

The last line serves as an example how to add external libreries to Spark. This particular package is quite common and is advised to install it. The package can be downloaded from this site.

Spark default file

Fetch HDP version

hdp-select status hadoop-client | awk '{print $3;}'

Example output

2.3.4.0-3485

Create spark-defaults.conf file in $SPARK_HOME/conf

sudo -u spark vi conf/spark-defaults.conf

Add the following and adjust accordingly (some properties belong to Spark History Server whose configuration is explained in the post in the link below)

spark.driver.extraJavaOptions -Dhdp.version=2.3.4.0-3485
spark.yarn.am.extraJavaOptions -Dhdp.version=2.3.4.0-3485
spark.eventLog.dir hdfs:///spark-history
spark.eventLog.enabled true
spark.history.fs.logDirectory hdfs:///spark-history
spark.history.provider org.apache.spark.deploy.history.FsHistoryProvider
spark.history.ui.port 18080

spark.history.kerberos.keytab none
spark.history.kerberos.principal none

spark.yarn.containerLauncherMaxThreads 25
spark.yarn.driver.memoryOverhead 384
spark.yarn.executor.memoryOverhead 384
spark.yarn.historyServer.address spark-server:18080
spark.yarn.max.executor.failures 3
spark.yarn.preserve.staging.files false
spark.yarn.queue default
spark.yarn.scheduler.heartbeat.interval-ms 5000
spark.yarn.submit.file.replication 3

spark.jars.packages com.databricks:spark-csv_2.10:1.4.0

spark.io.compression.codec lzf

spark.blockManager.port 38000
spark.broadcast.port 38001
spark.driver.port 38002
spark.executor.port 38003
spark.fileserver.port 38004
spark.replClassServer.port 38005

The ports are defined in this configuration file. If they are not, then Spark assigns random ports. More on ports and assigning them in Spark can be found here.

JAVA OPTS

Create java-opts file in $SPARK_HOME/conf and add your HDP version

sudo -u spark vi conf/java-opts

-Dhdp.version=2.4.0.0-169

Fixing links in Ubuntu

Since the Hadoop distribution is Hortonworks and Spark is Apache’s, some workaround is in place. Remove the default link and create new ones

sudo rm /usr/hdp/current/spark-client
sudo ln -s /usr/apache/spark-1.6.0-bin-hadoop2.6 /usr/hdp/current/spark-client
sudo ln -s /usr/hdp/current/spark-client/bin/sparkR /usr/bin/sparkR

Spark 1.6.0 is now ready.

How Spark History Server is configured and brought to life is explained here.