Yarn application has already ended! It might have been killed or unable to launch application master.

If you are struggling with the error message in title of the post check if you are controlling ports that Spark needs. I have experienced that if the ports Spark is using can not be reached, YARN is going to terminate with the error message in the title. So it is best to control Spark ports and open them so that the YARN application would go through. More on Spark and networking here.

Spark chooses random ports and unless you have ALL ports open, you might run into the “endless”

INFO Client: Application report for application_1470560331181_0013 (state: ACCEPTED)

which eventually fails

INFO Client: Application report for application_1470560331181_0013 (state: FAILED)

and the error message returned would be

ERROR SparkContext: Error initializing SparkContext.
org.apache.spark.SparkException: Yarn application has already ended! It might have been killed or unable to launch application master.

Adding something like this in spark-defaults.conf

spark.blockManager.port 38000
spark.broadcast.port 38001
spark.driver.port 38002
spark.executor.port 38003
spark.fileserver.port 38004
spark.replClassServer.port 38005

could solve this issue.

My notes on installing Spark 2.0 are here.

And how to install Spark 1.6 is described here.

Configuring Apache Spark History Server

Prior to configuring and running Spark History Server, Spark should be installed.

How to install Apache Spark 1.6.0 is described here.

How to install Apache spark 2.0 is described here.

Spark History server

Check that $SPARK_HOME/conf/spark-defaults.conf has History Server properties set

spark.eventLog.dir hdfs:///spark-history
spark.eventLog.enabled true
spark.history.fs.logDirectory hdfs:///spark-history
spark.history.provider org.apache.spark.deploy.history.FsHistoryProvider
spark.history.ui.port 18080

spark.history.kerberos.keytab none
spark.history.kerberos.principal none

Create spark-history directory in HDFS

sudo -u hdfs hadoop fs -mkdir /spark-history

Change the owner of the directory

sudo -u hdfs hadoop fs -chown spark:hdfs /spark-history

Change permission (be more restrictive if necessary)

sudo -u hdfs hadoop fs -chmod 777 /spark-history

Add user spark to group hdfs on the instance where Spark History Server is going to run

sudo usermod -a -G hdfs spark

To view Spark jobs from other users
When you open the History Server and you are not able to see Spark jobs you are expecting to see, check the Spark out file in the Spark log directory. If error message “Permission denied” is present, Spark History Server is trying to read the job log file, but has no permission to do so.
Spark user should be added to the group of the spark job owner.
For example, user marko belongs to a group employee. If marko starts a Spark job, the log file for this job will have user and group marko:employee. In order for spark to be able to read the log file, spark user should e added to the employee group. This is done in the following way

sudo usermod -a -G employee spark

Checking spark’s groups

groups spark

should return group employee among spark’s groups.

Start Spark History server

sudo -u spark $SPARK_HOME/sbin/start-history-server.sh

Output:

starting org.apache.spark.deploy.history.HistoryServer, logging to /var/log/spark/spark-spark-org.apache.spark.deploy.history.HistoryServer-1-t-client01.out

Accessing Spark History server from the web UI can be done by accessing spark-server:18080. The following screen should load.

spark18080
A fresh Spark History Server installation has no applications to show (no applications in hdfs:/spark-history).

Spark History Server offers a great monitoring interface for Spark applications!

WARN ServletHandler: /api/v1/applications

If you happen to start Spark History Server but get neither completed nor incompleted applications on the Web UI, check the log files. If you get something like the following

WARN ServletHandler: /api/v1/applications
java.lang.NullPointerException
        at org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:388)
        at org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:341)
        at org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:228)
        at org.spark_project.jetty.servlet.ServletHolder.handle(ServletHolder.java:812)
        at org.spark_project.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:587)
        at org.spark_project.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1127)
        at org.spark_project.jetty.servlet.ServletHandler.doScope(ServletHandler.java:515)
        at org.spark_project.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1061)
        at org.spark_project.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
        at org.spark_project.jetty.servlets.gzip.GzipHandler.handle(GzipHandler.java:479)
        at org.spark_project.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:215)
        at org.spark_project.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97)
        at org.spark_project.jetty.server.Server.handle(Server.java:499)
        at org.spark_project.jetty.server.HttpChannel.handle(HttpChannel.java:311)
        at org.spark_project.jetty.server.HttpConnection.onFillable(HttpConnection.java:257)
        at org.spark_project.jetty.io.AbstractConnection$2.run(AbstractConnection.java:544)
        at org.spark_project.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:635)
        at org.spark_project.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:555)
        at java.lang.Thread.run(Thread.java:745)

Take the jersey-bundle-*.jar file out of the $SPARK_HOME/jars directory. Hortonworks dont need it, you dont need it 🙂

Networking in Spark: Configuring ports in Spark

For Spark Context to run, some ports are used. Most of them are randomly chosen which makes it difficult to control them. This post describes how I am controlling Spark’s ports.

In my clusters, some nodes are dedicated client nodes, which means the users can access them, they can store files under their respective home directory (defining home on an attached volume is described here), and run jobs on it.

The Spark jobs can be run in different ways, from different interfaces – Command Line Interface, Zeppelin, RStudio…

 

Links to Spark installation and configuration

Installing Apache Spark 1.6.0 on a multinode cluster

Building Apache Zeppelin 0.6.0 on Spark 1.5.2 in a cluster mode

Building Zeppelin-With-R on Spark and Zeppelin

What Spark Documentation says

Spark UI

Spark User Interface, which shows application’s dashboard, has the default port of 4040 (link). Property name is

spark.ui.port

When submitting a new Spark Context, 4040 is attempted to be used. If this port is taken, 4041 will be tried, if this one is taken, 4042 is tried and so on, until an available port is found (or maximum attempts are met).
If the attempt is unsuccessful, the log is going to display a WARN and attempt the next port. Example follows:

WARN Utils: Service ‘SparkUI’ could not bind on port 4040. Attempting port 4041.
INFO Utils: Successfully started service ‘SparkUI’ on port 4041.
INFO SparkUI: Started SparkUI at http://client-server:4041

According to the log, the Spark UI is now listening on port 4041.

Not much randomizing for this port. This is not the case for ports in the next chapter.

 

Networking

Looking at the documentation about Networking in Spark 1.6.x, this post is focusing on the 6 properties that have default value random in the following picture:

spark networking.JPG

When Spark Context is in the process of creation these receive random values.

spark.blockManager.port
spark.broadcast.port
spark.driver.port
spark.executor.port
spark.fileserver.port
spark.replClassServer.port

These are the properties that should be controlled. They can be controlled in different ways, depending on how the job is run.

 

Scenarios and solutions

If you do not care about the values assigned to these properties then no further steps are needed..

Configuring ports in spark-defaults.conf

If you are running one Spark application per node (for example: submitting python scripts by using spark-submit), you might want to define the properties in the $SPARK_HOME/conf/spark-defaults.conf. Below is an example of what should be added to the configuration file.

spark.blockManager.port 38000
spark.broadcast.port 38001
spark.driver.port 38002
spark.executor.port 38003
spark.fileserver.port 38004
spark.replClassServer.port 38005

If a test is run, for example spark-submit test.py, the Spark UI is by default 4040 and the above mentioned ports are used.

Running the following command

sudo netstat -tulpn | grep 3800

Returns the following output:

tcp6      0      0      :::38000                          :::*      LISTEN      25300/java
tcp6      0      0      10.0.173.225:38002     :::*      LISTEN      25300/java
tcp6      0      0      10.0.173.225:38003     :::*      LISTEN      25300/java
tcp6      0      0      :::38004                          :::*      LISTEN      25300/java
tcp6      0      0      :::38005                          :::*      LISTEN      25300/java

 

Configuring ports directly in a script

In my case, different users would like to use different ways to run Spark applications. Here is an example of how ports are configured through a python script.

"""Pi-estimation.py"""

from random import randint
from pyspark.context import SparkContext
from pyspark.conf import SparkConf

def sample(p):
x, y = randint(0,1), randint(0,1)
print(x)
print(y)
return 1 if x*x + y*y < 1 else 0

conf = SparkConf()
conf.setMaster("yarn-client")
conf.setAppName("Pi")

conf.set("spark.ui.port", "4042")

conf.set("spark.blockManager.port", "38020")
conf.set("spark.broadcast.port", "38021")
conf.set("spark.driver.port", "38022")
conf.set("spark.executor.port", "38023")
conf.set("spark.fileserver.port", "38024")
conf.set("spark.replClassServer.port", "38025")

conf.set("spark.driver.memory", "4g")
conf.set("spark.executor.memory", "4g")

sc = SparkContext(conf=conf)

NUM_SAMPLES = randint(5000000, 100000000)
count = sc.parallelize(xrange(0, NUM_SAMPLES)).map(sample) \
.reduce(lambda a, b: a + b)
print("NUM_SAMPLES is %i" % NUM_SAMPLES)
print "Pi is roughly %f" % (4.0 * count / NUM_SAMPLES)
(The above Pi estimation is a Spark example that comes with Spark installation)

The property values in the script run over the properties in the spark-defaults.conf file. For the runtime of this script port 4042 and ports 38020-38025 are used.

If netstat command is run again for all ports that start with 380

sudo netstat -tulpn | grep 380

The following output is shown:

tcp6           0           0           :::38000                              :::*          LISTEN          25300/java
tcp6           0           0           10.0.173.225:38002         :::*          LISTEN          25300/java
tcp6           0           0           10.0.173.225:38003         :::*          LISTEN          25300/java
tcp6           0           0           :::38004                              :::*          LISTEN          25300/java
tcp6           0           0           :::38005                              :::*          LISTEN          25300/java
tcp6           0           0           :::38020                              :::*          LISTEN          27280/java
tcp6           0           0           10.0.173.225:38022         :::*          LISTEN          27280/java
tcp6           0           0           10.0.173.225:38023         :::*          LISTEN          27280/java
tcp6           0           0           :::38024                              :::*          LISTEN          27280/java

2 processes are running one separate Spark application each on ports that were defined beforehand.

 

Configuring ports in Zeppelin

Since my users use Apache Zeppelin, similar network management had to be done there. Zeppelin is also sending jobs to Spark Context through spark-submit command. That means that the properties can be configured in the same way. This time through an interpreter in Zeppelin:

Choosing menu Interpreter and choosing spark interpreter will get you there. Now it is all about adding new properties and respective values. Do not forget to click on the plus when you are ready to add a new property.
At the very end, save everything and restart the spark interpreter.

Below is an example of how this is done:

spark zeppelin ports

Next time a Spark context is created in Zeppelin, the ports will be taken into account.

 

Conclusion

This can be useful if multiple users are running Spark applications on one machine and have separate Spark Contexts.

In case of Zeppelin, this comes in handy when one Zeppelin instance is deployed per user.

 

SparkR and R – DataFrame and data.frame

I currently work as a Big Data Engineer at the University of St. Gallen. Many researchers work here and are using R to make their research easier. They are familiar with R’s limitations and workarounds.

One of my tasks is introducing SparkR to the researchers. This post gives a short introduction to SparkR and R and clears out any doubt about data frames.

R is a popular tool for statistics and data analysis. It uses dataframes (data.frame), has rich visualization capabilities and many libraries the R community is developing.

The challenge with R is how to make it work on big data; how to use R one huge datasets and on big data clusters.

Spark is a fast and general engine for data processing. It is growing rapidly and has been adopted by many organizations for running faster calculations on big datasets.

SparkR is an R package that provides an interface to use Spark from R. it enables R users to run job on big data clusters with Spark. SparkR API 1.6.0 is available here.

data.frame in R is a list of vectors with equal length.

DataFrame in Spark is a distributed collection of data organized into named columns.

When working with SparkR and R, it is very important to understand that there are two different data frames in question – R data.frame and Spark DataFrame. Proper combination of both is what gets the job done on big data with R.

In practice, the first step is to process the big data using SparkR and its DataFrames. As much as possible is done in this stage (cleaning, filtering, aggregation, various statistical operations).

When the dataset is processed by Spark R it can be collected into an R data.frame. This is done by calling collect() which “transforms” a SparkR DataFrame into an R data.frame. When collect() is called the elements of SparkR DataFrame from all workers are collected and pushed into an R data.frame on the client – where SparkR functionality can be used.

SparkR R flow

Figure 1

Common question many R users ask is “Can I run collect on all my big data and then do R analysis?” The answer is no. If you do that then you are where you were before looking into SparkR – you are doing all your processing, cleaning, wrangling, data science on your client.

Useful installation posts

How to manually install Spark 1.6.0 on a multinode Hadoop cluster is described here:Installing Apache Spark 1.6.0 on a multinode cluster

How to install SparkR on a multinode Hadoop cluster is described here: Installing R on Hadoop cluster to run sparkR

I am testing SparkR and Pyspark in Zeppelin and the Zeppelin installation process is here: Building Zeppelin-With-R on Spark and Zeppelin

Practical examples

Let us see how this works in practice:

I have a file in Hadoop (HDFS), file size is 1.9 GB, it is a CSV file with something over 20 million rows. Looking at the Figure 1, this file is in the blue box.

I run a read.df() command to load the data from the data source into a DataFrame (orange box in Figure 1).

crsp <- read.df(sqlContext, "/tmp/crsp/AllCRSP.csv", source = "com.databricks.spark.csv", inferSchema = "true", header = "true")

Object crsp is a SparkR object, not an R object, which means I can run SparkR commands on it.

If I run str(crsp) command, which is an R command I get the following:

Formal class ‘DataFrame’ [package “SparkR”] with 2 slots
..@ env:<environment: 0x3b44be0>
..@ sdf:Class ‘jobj’ <environment: 0x2aaf5f8>

This does not look familiar to an R user. Data frame crsp is a SparkR DataFrame.

SparkR DataFrame to R data.frame

Since DataFrame crsp has over 20 million rows, I am going to take a small sample to create a new Dataframe for this example:

df_sample <- sample(crsp, withReplacement = False, fraction = 0.0001)

I have created a new DataFrame called df_frame. It has a bit over 2000 rows and SparkR functionality can be used to manipulate it.

I am going to create an R data.frame out of the df_sample DataFrame:

rdf_sample <- collect(df_sample)

I now have two data frames: SparkR DataFrame called df_sample and R data.frame called rdf_sample.

Running str() on both object gives me the following outputs:

str(df_sample)

Formal class ‘DataFrame’ [package “SparkR”] with 2 slots
..@ env:<environment: 0xdb19d40>
..@ sdf:Class ‘jobj’ <environment: 0xd9b4500>

str(rdf_sample)

‘data.frame’: 2086 obs. of 16 variables:
$ PERMNO : int 15580 59248 90562 15553 11499 61241 14219 14868 14761 11157 …
$ Date : chr “20151120” “20111208” “20061213” “20110318” …
$ SICCD : chr “6320” “2082” “2082” “6719” …
$ TICKER : chr “AAME” “TAP” “TAP” “AGL” …
$ NAICS : int 524113 312120 312120 551112 333411 334413 541511 452910 211111 334511 …
$ PERMCO : int 5 33 33 116 176 211 283 332 352 376 …
$ HSICMG : int 63 20 20 49 35 36 73 54 29 38 …
$ BIDLO : num 4.51 40.74 74 38.75 4.03 …
$ ASKHI : num 4.89 41.26 75 39.4 4.11 …
$ PRC : num 4.69 41.01 -74.5 38.89 4.09 …
$ VOL : int 1572 1228200 0 449100 20942 10046911 64258 100 119798 19900 …
$ SHROUT : int 20547 175544 3073 78000 14289 778060 24925 2020 24165 3728 …
$ CFACPR : num 1 1 2 1 1 1 0.2 1 1 1 …
$ CFACSHR: num 1 1 2 1 1 1 0.2 1 1 1 …
$ OPENPRC: num 4.88 41.23 NA 39.07 4.09 …
$ date : chr “20151120” “20111208” “20061213” “20110318” …

Datatype conversion example

Running a SparkR command on the DataFrame:

df_sample$Date <- cast(df_sample$Date, "string")

Is going to convert the datatype from Integer to String.

Running an R command on it will not be effective:

df_sample$Date <- as.character(df_sample$Date)

Output:

Error in as.character.default(df_sample$Date) :
no method for coercing this S4 class to a vector

Since I have converted the DataFrame df_sample to data.frame rdf_sample, I can now put my R hat on and use R functionalities:

rdf_sample$Date <- as.character(rdf_sample$Date)

R data.frame to SparkR DataFrame

In some cases, you have to go the other way – converting an R data.frame to SparkR DataFrame. This is done by using createDataFrame() method

new_df_sample <- createDataFrame(sqlContext, rdf_sample)

If I run str(new_df_sample) I get the following output:

Formal class ‘DataFrame’ [package “SparkR”] with 2 slots
..@ env:<environment: 0xdd9ddb0>
..@ sdf:Class ‘jobj’ <environment: 0xdd99f40>

Conclusion

Knowing which data frame you are about to manipulate saves you a great deal of trouble. It is important to understand what can SparkR do for you so that you can take the best from both worlds.

SparkR is a newer addition to the Spark family. The community is still small so patience is the key ingredient.

Zeppelin thrift error “Can’t get status information “

I have multiple users on one client who are going to use/test ZeppelinR. For every Zeppelin user I create a copy of built Zeppelin folder in user’s home directory. I dedicate a port to that user (8080 is for my testing, running), for example my first user got port 8082. This is done in user’s $ZEPPELIN_HOME/conf/zeppelin-site.xml.

Example for one user:

<property>
  <name>zeppelin.server.port</name>
  <value>8082</value>
  <description>Server port.</description>
</property>

Running Zeppelin as root is not a big problem. Running ZeppelinR as root is also not so problematic. Running it as a normal Linux user can give some challenges.

There is this error message that can surprise you when starting a new Spark context from Zeppelin Web UI.

Taken from Zeppelin log file (zeppelin-user_running_zeppelin-t-client01.log):

ERROR [2016-03-18 08:10:47,401] ({Thread-20} RemoteScheduler.java[getStatus]:270) – Can’t get status information
org.apache.thrift.transport.TTransportException
at org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:132)
at org.apache.thrift.transport.TTransport.readAll(TTransport.java:86)
at org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:429)
at org.apache.thrift.protocol.TBinaryProtocol.readI32(TBinaryProtocol.java:318)
at org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:219)
at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:69)
at org.apache.zeppelin.interpreter.thrift.RemoteInterpreterService$Client.recv_getStatus(RemoteInterpreterService.java:355)
at org.apache.zeppelin.interpreter.thrift.RemoteInterpreterService$Client.getStatus(RemoteInterpreterService.java:342)
at org.apache.zeppelin.scheduler.RemoteScheduler$JobStatusPoller.getStatus(RemoteScheduler.java:256)
at org.apache.zeppelin.scheduler.RemoteScheduler$JobStatusPoller.run(RemoteScheduler.java:205)
ERROR [2016-03-18 08:11:47,347] ({pool-1-thread-2} RemoteScheduler.java[getStatus]:270) – Can’t get status information
org.apache.thrift.transport.TTransportException
at org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:132)
at org.apache.thrift.transport.TTransport.readAll(TTransport.java:86)
at org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:429)
at org.apache.thrift.protocol.TBinaryProtocol.readI32(TBinaryProtocol.java:318)
at org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:219)
at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:69)
at org.apache.zeppelin.interpreter.thrift.RemoteInterpreterService$Client.recv_getStatus(RemoteInterpreterService.java:355)
at org.apache.zeppelin.interpreter.thrift.RemoteInterpreterService$Client.getStatus(RemoteInterpreterService.java:342)
at org.apache.zeppelin.scheduler.RemoteScheduler$JobStatusPoller.getStatus(RemoteScheduler.java:256)
at org.apache.zeppelin.scheduler.RemoteScheduler$JobRunner.run(RemoteScheduler.java:335)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:178)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:292)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

The zeppelin out file (zeppelin-user_running_zeppelin-t-client01.out) gives a more concrete description of the problem:

Exception in thread "Thread-80" org.apache.zeppelin.interpreter.InterpreterException: java.lang.RuntimeException: Could not find rzeppelin - it must be in either R/lib or ../R/lib
 at org.apache.zeppelin.interpreter.ClassloaderInterpreter.getScheduler(ClassloaderInterpreter.java:146)
 at org.apache.zeppelin.interpreter.LazyOpenInterpreter.getScheduler(LazyOpenInterpreter.java:115)
 at org.apache.zeppelin.interpreter.Interpreter.destroy(Interpreter.java:124)
 at org.apache.zeppelin.interpreter.InterpreterGroup$2.run(InterpreterGroup.java:115)
Caused by: java.lang.RuntimeException: Could not find rzeppelin - it must be in either R/lib or ../R/lib
 at org.apache.zeppelin.rinterpreter.RContext$.apply(RContext.scala:353)
 at org.apache.zeppelin.rinterpreter.RInterpreter.rContext$lzycompute(RInterpreter.scala:43)
 at org.apache.zeppelin.rinterpreter.RInterpreter.rContext(RInterpreter.scala:43)
 at org.apache.zeppelin.rinterpreter.RInterpreter.getScheduler(RInterpreter.scala:80)
 at org.apache.zeppelin.rinterpreter.RRepl.getScheduler(RRepl.java:93)
 at org.apache.zeppelin.interpreter.ClassloaderInterpreter.getScheduler(ClassloaderInterpreter.java:144)
 ... 3 more

The way I solved it was by running Zeppelin service from the $ZEPPELIN_HOME. For users to be able to start the Zeppelin service I have created a script:

export ZEPPELIN_HOME=/home/${USER}/Zeppelin-With-R
cd ${ZEPPELIN_HOME}
/home/${USER}/Zeppelin-With-R/bin/zeppelin-daemon.sh start

Now I can start and stop the Zeppelin service and start new Spark contexts with no problem.

Here is an example of my YARN applications:

zeppelin services in YARN

And here are the outputs from Zeppelin when scala, sparkR and Hive are tested:

zeppelin user test results

 

Building Zeppelin-With-R on Spark and Zeppelin

For the need of my employeer I am working on setting up different environments for researchers to do their statistical analyses using distributed systems.
In this post, I am going to describe how Zeppelin with R was installed using this github project.

Ubuntu 14.04 Trusty is my Linux flavour. Spark 1.6.0 is manually installed on the cluster (link) on Openstack, Hadoop (2.7.1) distribution is Hortonworks and sparkR has been installed earlier (link).

One of the nodes in the cluster is a dedicated client node. On this node Zeppelin with R is installed.

Prerequisities

  • Spark should be installed (1.6.0 in  this case, my earlier version 1.5.2 also worked well).
  • Java – my version is java version “1.7.0_95”
  • Maven and git (how to install them)
  • User running the Zeppelin service has to have a folder under in HDFS under /user. If the user has, for example, ran Spark earlier, then this folder was created already, otherwise Spark services could not be ran.
    Example on how to create an HDFS folder under /user and change owner:

    sudo -u hdfs hadoop fs -mkdir /user/user1
    sudo -u hdfs hadoop fs -chown user1:user1 /user/user1
    
  • Create zeppelin user
    sudo adduser zeppelin

R installation process

From the shell as root

In order to have ZeppelinR running properly some R packages have to be installed. Installing the R packages has proven to be problematic if some packages are not installed as root user first.

  1. Install Node.js package manager
    sudo apt-get install npm -y
  2. The following packages need to be installed for the R package devtools installation to go through properly.
    sudo apt-get install libcurl4-openssl-dev -y
    sudo apt-get install libxml2-dev -y
  3. Later on, when R package dplyr is being installed, some warnings pop out. Just to be on the safe side these two packages should be installed.
    sudo apt-get install r-cran-rmysql -y
    sudo apt-get install libpq-dev –y
  4. For successfully installing Cairo package in R, the following two should be installed.
    sudo apt-get install libcairo2-dev –y
    sudo apt-get install libxt-dev libxaw7-dev -y
  5. To install IRkernel/repr later the following package needs to be installed.
    sudo apt-get install libzmq3-dev –y

From R as root

  1. Run R as root.
    sudo R
  2. Install the following packages in R:
    install.packages("evaluate", dependencies = TRUE, repos='http://cran.us.r-project.org')
    install.packages("base64enc", dependencies = TRUE, repos='http://cran.us.r-project.org')
    install.packages("devtools", dependencies = TRUE, repos='http://cran.us.r-project.org')
    install.packages("Cairo", dependencies = TRUE, repos='http://cran.us.r-project.org')

    (The reason why I am running one package at a time is to control what is going on when package is being installed)

  3. Load devtools for github command
    library(devtools)
  4. Install IRkernel/repr package
    install_github('IRkernel/repr')
  5. Install these packages
    install.packages("dplyr", dependencies = TRUE, repos='http://cran.us.r-project.org')
    install.packages("caret", dependencies = TRUE, repos='http://cran.us.r-project.org')
    install.packages("repr", dependencies = TRUE, repos='http://irkernel.github.io/')
  6. Install R interface to Google Charts API
    install.packages('googleVis', dependencies = TRUE, repos='http://cran.us.r-project.org')
  7. Exit R
    q()

Zeppelin installation process

Hortonworks installs its Hadoop under /usr/hdp. I decided to follow the pattern and install Apache services under /usr/apache.

  1. Create log folder for Zeppelin log files
    sudo mkdir /var/log/zeppelin
    sudo chown zeppelin:zeppelin /var/log/zeppelin
  2. Go to /usr/apache (or wherever your home to ZeppelinR is going to be) and clone the github project.
    sudo git clone https://github.com/elbamos/Zeppelin-With-R

    Zeppelin-With-R folder is now created. This is going to be ZeppelinR’s home. In my case this would be /usr/apache/Zeppelin-With-R.

  3. Change the ownership of the folder
    sudo chown –R zeppelin:zeppelin Zeppelin-With-R
  4. Adding global variable ZEPPELIN_HOME
    Open the environment file

    sudo vi /etc/environment

    And add the variable

    export ZEPPELIN_HOME=/usr/apache/Zeppelin-With-R

    Save and exit the file and do not forget to reload it.

    source /etc/environment
  5. Change user to zeppelin (or whoever is going to build the Zeppelin-With-R)
    su zeppelin
  6. Make sure you are in $ZEPPELIN_HOME and build Zeppelin-With R
    mvn clean package -Pspark-1.6 -Dspark.version=1.6.0 -Dhadoop.version=2.7.1 -Phadoop-2.6 -Pyarn -Ppyspark -DskipTests
  7. Initial information before build starts
    zeppelin with r build start
    R interpreter is on the list.
  8. Successful build
    zeppelin with r build end
    ZeppelinR is now installed. The next step is configuration.

Configuring ZeppelinR

  1. Copying and modifying hive-site.xml (as root). From Hive’s con folder, copy the hive-site.conf file to $ZEPPELIN_HOME/conf.
    sudo cp /etc/hive/conf/hive-site.xml $ZEPPELIN_HOME/conf/
  2. Change the owner of the file to zeppelin:zeppelin.
    sudo chown zeppelin:zeppelin $ZEPPELIN_HOME/conf/hive-site.xml
  3. Log in as zeppelin and modify the hive-site.xml file.
    vi $ZEPPELIN_HOME/conf/hive-site.xml

    remove “s” from the value of properties hive.metastore.client.connect.retry.delay and
    hive.metastore.client.socket.timeout to avoid a number format exception.

  4. Create folder for Zeppelin pid.
    mkdir $ZEPPELIN_HOME/run
  5. Create zeppelin-site.xml and zeppelin-env.sh from respective templates.
    cp $ZEPPELIN_HOME/conf/zeppelin-site.xml.template $ZEPPELIN_HOME/conf/zeppelin-site.xml
    cp $ZEPPELIN_HOME/conf/zeppelin-env.sh.template $ZEPPELIN_HOME/conf/zeppelin-env.sh
  6. Open $ZEPPELIN_HOME/conf/zeppelin-env.sh and add:
    export SPARK_HOME=/usr/apache/spark-1.6.0-bin-hadoop2.6
    export HADOOP_CONF_DIR=/etc/hadoop/conf
    export ZEPPELIN_JAVA_OPTS= -Dhdp.version=2.3.4.0-3485
    export ZEPPELIN_LOG_DIR=/var/log/zeppelin
    export ZEPPELIN_PID_DIR=${ZEPPELIN_HOME}/run

    Parameter Dhdp.version should match your Hortonworks distribution version.
    Save and exit the file.

Zeppelin-With-R is now ready for use. Start it by running

$ZEPPELIN_HOME/bin/zeppelin-daemon.sh start

How to configure Zeppelin interpreters is described in this post.

 

Note!
My experience shows that if you run ZeppelinR as zeppelin user, you will not be able to use spark.r functionalities. The error message I am getting is unable to start device X11cairo. The reason is lack of permissions on certain files within R installation. This is something I still have to figure out. For now running as root does the trick.

http//:zeppelin-server:8080 takes you to the Zeppelin Web UI. How to configure interpreters Spark and Hive is described in this post.

When interpreters are configures, a notebook RInterpreter is available for test.

 

SparkContext allocates random ports. How to control the port allocation.

When SparkContext is in process of creation, a bunch of random ports are allocated to run the Spark service. This can be annoying when you have security groups to think of.

Note!
A more detailed post on this topic is here.

Here is an example of how random ports are allocated when Spark service is started:

spark ports random

The only sure bet is 4040 (or 404x depending on how many Spark Web UI have been already started).

On Apache Spark website, under Configuration, under Networking, 6 port properties have default value random. These are the properties that have to be tamed.

spark port random printscreen

(The only 6 properties with default value random among all Spark properties)

Solution

Open $SPARK_HOME/conf/spark-defaults.conf:

sudo -u spark vi conf/spark-defaults.conf

The following properties should be added:

spark.blockManager.port    38000
spark.broadcast.port       38001
spark.driver.port          38002
spark.executor.port        38003
spark.fileserver.port      38004
spark.replClassServer.port 38005

I have picked port range 38000-38005 for my Spark services.

If I run Spark service now, the ports in use are now as defined in the configuration file:

spark ports tamed

Building Apache Zeppelin 0.6.0 on Spark 1.5.2 & 1.6.0 in a cluster mode

I have Ubuntu 14.04 Trusty and a multinode Hadoop cluster. Hadoop distribution is Hortonworks 2.3.4. Spark is installed through Ambari Web UI and running version is 1.5.2 (upgraded to 1.6.0).

I am going to explain how I built and set up Apache Zeppelin 0.6.0 on Spark 1.5.2 and 1.6.0

Prerequisities

Non root account

Apache Zeppelin creators recommend not to use root account. For this service, I have created a new user zeppelin.

Java 7

Zeppelin uses Java 7. My system has Java 8, so I have installed Java 7 just for Zeppelin. Installation is in the following directory done as user zeppelin.

/home/zeppelin/prerequisities/jdk1.7.0_79

JAVA_HOME is added to the user’s bashrc.

export JAVA_HOME=/home/zeppelin/prerequisities/jdk1.7.0_79

Zeppelin log directory

Create zeppelin log directory.

sudo mkdir /var/log/zeppelin

Change ownership.

sudo chown zeppelin:zeppelin /var/log/zeppelin

If this is not done, Zeppelin’s log files are written in folder logs right in the current folder.

Clone and Build

Log in as user zeppelin and go to users home directory.

/home/zeppelin

Clone the source code from github.

git clone https://github.com/apache/incubator-zeppelin.git incubator-zeppelin

Zeppelin has a home now.

/home/zeppelin/incubator-zeppelin

Go into Zeppelin home and build Zeppelin

mvn clean package -Pspark-1.5 -Dspark.version=1.5.2 -Dhadoop.version=2.7.1 -Phadoop-2.6 -Pyarn -DskipTests

Build order.

zeppelin build start

7:31 minutes later, Zeppelin is successfully built.

zeppelin build success

Note!

If you try with something like the following 2 examples:

mvn clean package -Pspark-1.5 -Dspark.version=1.5.0 -Dhadoop.version=2.7.1 -Phadoop-2.7 -Pyarn -DskipTests
mvn clean package -Pspark-1.5 -Dspark.version=1.5.2 -Dhadoop.version=2.7.1 -Phadoop-2.7 -Pyarn –DskipTests

Build will succeed, but this warning will appear at the bottom of Build report:

[WARNING] The requested profile “hadoop-2.7” could not be activated because it does not exist.

Hadoop version mentioned in the maven execution must be 2.6 even though actual Hadoop version is 2.7.x.

hive-site.xml

Copy hive-site.xml from hive folder (this is done on Hortonworks distribution, users using other distribution should check where the file is located).

sudo cp /etc/hive/conf/hive-site.xml $ZEPPELIN_HOME/conf

Change ownership of the file.

sudo chown zeppelin:zeppelin $ZEPPELIN_HOME/conf/hive-site.xml

zeppelin-env.sh

Go to Zeppelin home and create zeppelin-env.sh by using the template in conf directory.

cp conf/zeppelin-env.sh.template conf/zeppelin-env.sh

Open it and add the following variables:

export JAVA_HOME=/home/zeppelin/prerequisities/jdk1.7.0_79
export HADOOP_CONF_DIR=/etc/hadoop/conf
export ZEPPELIN_JAVA_OPTS="-Dhdp.version=2.3.4.0-3485"
export ZEPPELIN_LOG_DIR=/var/log/zeppelin

The variable in the third line depends on the Hortonworks build. Find your hdp version by executing

hdp-select status hadoop-client

If your Hortonworks version is 2.3.4, the output is:

hadoop-client – 2.3.4.0-3485

Zeppelin daemon

Start Zeppelin from Zeppelin home

./bin/zeppelin-daemon.sh start

Status after starting the daemon:

zeppelin start

One can check if service is up:

./bin/zeppelin-daemon.sh status

Status:

zeppelin status

Zeppelin can be restarted in the following way:

./bin/zeppelin-daemon.sh restart

Status:

zeppelin restart

Stopping Zeppelin:

./bin/zeppelin-daemon.sh stop

Status:

zeppelin stop

Configuring interpreters in Zeppelin

Apache Zeppelin comes with many default interpreters. It is also possible to create your own interpreters. How to configure default Spark and Hive interpreters is covered in this post.

Installing Apache Spark 1.6.x on a multinode cluster

I am running a HDP 2.3.4 multinode cluster with Ubuntu Trusty 14.04 on all my nodes. The Spark in this post is installed on my client node. My cluster has HDFS and YARN, among other services. All were installed from Ambari. This is not the case for Apache Spark 1.6, because Hortonworks does not offer Spark 1.6 on HDP 2.3.4

The documentation on Spark version 1.6 is here.

My post on setting up Apache Spark 2.0.0.

Prerequisities

Java

Update and upgrade the system and install Java

sudo add-apt-repository ppa:openjdk-r/ppa
sudo apt-get update
sudo apt-get install openjdk-8-jdk -y

Add JAVA_HOME in the system variables file

sudo vi /etc/environment
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64

Spark user

Create user spark and add it to group hadoop

sudo adduser spark
sudo usermod -a -G hadoop spark

HDFS home directory for user spark

Create spark’s user folder in HDFS

sudo -u hdfs hadoop fs -mkdir -p /user/spark
sudo -u hdfs hadoop fs -chown -R spark:hdfs /user/spark

Spark installation and configuration

Install Spark

Create directory where spark directory is going to reside. Hortonworks installs its services under /usr/hdp. I am following their lead so I am installing all Apache services under /usr/apache. Create the directory and step into it.

sudo mkdir /usr/apache
cd /usr/apache

Download Spark 1.6.0 from https://spark.apache.org/downloads.html. I have Hadoop 2.7.1, version 2.6 does the trick.

sudo wget http://d3kbcqa49mib13.cloudfront.net/spark-1.6.0-bin-hadoop2.6.tgz

Unpack the tar file

sudo tar -xvzf spark-1.6.0-bin-hadoop2.6.tgz

Remove the tar file after it has been unpacked

sudo rm spark-1.6.0-bin-hadoop2.6.tgz

Change the ownership of the folder and its elements

sudo chown -R spark:spark spark-1.6.0-bin-hadoop2.6

Update system variables

Step into the spark 1.6.0 directory and run pwd to get full path

cd spark-1.6.0-bin-hadoop2.6
pwd

Update the system environment file by adding SPARK_HOME and adding Spark_HOME/bin to the PATH

sudo vi /etc/environment

export SPARK_HOME=/usr/apache/spark-1.6.0-bin-hadoop2.6

At the end of PATH add

${SPARK_HOME}/bin

Refresh the system environments

source /etc/environment

Change the owner of $SPARK_HOME to spark

sudo chown -R spark:spark $SPARK_HOME

Log and pid directories

Create log and pid directories

sudo mkdir /var/log/spark
sudo chown spark:spark /var/log/spark
sudo -u spark mkdir $SPARK_HOME/run

Spark configuration files

Hive configuration

sudo -u spark vi $SPARK_HOME/conf/hive-site.xml
<configuration>
<property>
<name>hive.metastore.uris</name>
<!--Make sure that <value> points to the Hive Metastore URI in your cluster -->
<value>thrift://hive-server:9083</value>
<description>URI for client to contact metastore server</description>
</property>
</configuration>

Spark environment file

Create a new file in under $SPARK_HOME/conf

sudo vi conf/spark-env.sh

Add the following lines and adjust aaccordingly.

export SPARK_LOG_DIR=/var/log/spark
export SPARK_PID_DIR=${SPARK_HOME}/run
export HADOOP_HOME=${HADOOP_HOME:-/usr/hdp/current/hadoop-client}
export HADOOP_CONF_DIR=${HADOOP_CONF_DIR:-/usr/hdp/current/hadoop-client/conf}
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
export SPARK_SUBMIT_OPTIONS="--jars ${SPARK_HOME}/lib/spark-csv_2.10-1.4.0.jar"

The last line serves as an example how to add external libreries to Spark. This particular package is quite common and is advised to install it. The package can be downloaded from this site.

Spark default file

Fetch HDP version

hdp-select status hadoop-client | awk '{print $3;}'

Example output

2.3.4.0-3485

Create spark-defaults.conf file in $SPARK_HOME/conf

sudo -u spark vi conf/spark-defaults.conf

Add the following and adjust accordingly (some properties belong to Spark History Server whose configuration is explained in the post in the link below)

spark.driver.extraJavaOptions -Dhdp.version=2.3.4.0-3485
spark.yarn.am.extraJavaOptions -Dhdp.version=2.3.4.0-3485
spark.eventLog.dir hdfs:///spark-history
spark.eventLog.enabled true
spark.history.fs.logDirectory hdfs:///spark-history
spark.history.provider org.apache.spark.deploy.history.FsHistoryProvider
spark.history.ui.port 18080

spark.history.kerberos.keytab none
spark.history.kerberos.principal none

spark.yarn.containerLauncherMaxThreads 25
spark.yarn.driver.memoryOverhead 384
spark.yarn.executor.memoryOverhead 384
spark.yarn.historyServer.address spark-server:18080
spark.yarn.max.executor.failures 3
spark.yarn.preserve.staging.files false
spark.yarn.queue default
spark.yarn.scheduler.heartbeat.interval-ms 5000
spark.yarn.submit.file.replication 3

spark.jars.packages com.databricks:spark-csv_2.10:1.4.0

spark.io.compression.codec lzf

spark.blockManager.port 38000
spark.broadcast.port 38001
spark.driver.port 38002
spark.executor.port 38003
spark.fileserver.port 38004
spark.replClassServer.port 38005

The ports are defined in this configuration file. If they are not, then Spark assigns random ports. More on ports and assigning them in Spark can be found here.

JAVA OPTS

Create java-opts file in $SPARK_HOME/conf and add your HDP version

sudo -u spark vi conf/java-opts

-Dhdp.version=2.4.0.0-169

Fixing links in Ubuntu

Since the Hadoop distribution is Hortonworks and Spark is Apache’s, some workaround is in place. Remove the default link and create new ones

sudo rm /usr/hdp/current/spark-client
sudo ln -s /usr/apache/spark-1.6.0-bin-hadoop2.6 /usr/hdp/current/spark-client
sudo ln -s /usr/hdp/current/spark-client/bin/sparkR /usr/bin/sparkR

Spark 1.6.0 is now ready.

How Spark History Server is configured and brought to life is explained here.

Installing R on Hadoop cluster to run sparkR

The environment

Ubuntu 14.04 Trusty is the operating system. Hortonworks Hadoop distribution is used to install the multinode cluster.

The following process of installation has been successful for the following versions:

  • Spark 1.4.1
  • Spark 1.5.2
  • Spark 1.6.0

Prerequisities

This post assumes Spark is already installed on the system. How this can be done is explained in one of my posts here.

Setting up sparkR

If command sparkR ($SPARK_HOME/bin/sparkR) is run right after Spark installation is complete, the following error is returned:

env: R: No such file or directory

R packages have to be installed on all nodes in order for sparkR to work properly. Spark 1.6 Technical Preview on Hortonworks website gives us the link on how to set up R on Linux (Installing R on Linux). I experienced the process to be different.

Again, this process should be done on all nodes.

  1. The Ubuntu archives on CRAN are signed with the key of “Michael Rutter marutter@gmail.com” with key ID E084DAB9 (link)
    sudo apt-key adv --keyserver keyserver.ubuntu.com --recv-keys E084DAB9
  2. Now add the key to apt.
    gpg -a --export E084DAB9 | sudo apt-key add -
  3. Fetch Linux codename
    LVERSION=`lsb_release -c | cut -c 11-`

    In this case, it is trusty.

  4. From Cran – Mirrors or CRAN Mirrors US, find a suitable CRAN mirror and use it in the next step. In this case, a CRAN mirror from Austria is used – cran.wu.ac.at.
  5. Append the repository to the system’s sources.list file
    echo deb https://cran.wu.ac.at/bin/linux/ubuntu $LVERSION/ | sudo tee -a /etc/apt/sources.list
    

    The line added to the sources.list should look something like this:

    deb https://cran.wu.ac.at/bin/linux/ubuntu trusty/

  6. Update the system
    sudo apt-get update
  7. Install r-base package
    sudo apt-get install r-base -y
  8. Install r-base-dev package
    sudo apt-get install r-base-dev -y
  9. In order to avoid the following warn when Spark service is started:

    WARN shortcircuit.DomainSocketFactory: The short-circuit local reads feature cannot be used because libhadoop cannot be loaded.

    Add the following export to the system environment file:

    export LD_LIBRARY_PATH=/usr/hdp/current/hadoop-client/lib/native
  10. Test the installation by starting R
    R

    Something like this should show up and R should start.

    R version 3.2.3 (2015-12-10) — “Wooden Christmas-Tree”

  11. Exit R.
    q()

R is now installed on the first node. Steps 1 to 7 should be repeated on all nodes in the cluster.

Testing sparkR

If R is installed on all nodes (I haven’t mentioned that yet in this post), we can test it from the node where Spark client is installed:

cd $SPARK_HOME
./bin/sparkR

Hello message from Spark invites the user to the sparkR environment.

sparkR-hello-window

Output when starting SparkR

R version 3.2.3 (2015-12-10) — “Wooden Christmas-Tree”
Copyright (C) 2015 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type ‘license()’ or ‘licence()’ for distribution details.

Natural language support but running in an English locale

R is a collaborative project with many contributors.
Type ‘contributors()’ for more information and
‘citation()’ on how to cite R or R packages in publications.

Type ‘demo()’ for some demos, ‘help()’ for on-line help, or
‘help.start()’ for an HTML browser interface to help.
Type ‘q()’ to quit R.

Launching java with spark-submit command /usr/hdp/2.3.4.0-3485/spark/bin/spark-submit “sparkr-shell” /tmp/RtmpN1gWPL/backend_port18f039dadb86
16/02/22 22:12:40 WARN SparkConf: The configuration key ‘spark.yarn.applicationMaster.waitTries’ has been deprecated as of Spark 1.3 and and may be removed in the future. Please use the new key ‘spark.yarn.am.waitTime’ instead.
16/02/22 22:12:41 WARN SparkConf: The configuration key ‘spark.yarn.applicationMaster.waitTries’ has been deprecated as of Spark 1.3 and and may be removed in the future. Please use the new key ‘spark.yarn.am.waitTime’ instead.
16/02/22 22:12:41 INFO SparkContext: Running Spark version 1.5.2
16/02/22 22:12:41 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform… using builtin-java classes where applicable
16/02/22 22:12:41 WARN SparkConf: The configuration key ‘spark.yarn.applicationMaster.waitTries’ has been deprecated as of Spark 1.3 and and may be removed in the future. Please use the new key ‘spark.yarn.am.waitTime’ instead.
16/02/22 22:12:41 INFO SecurityManager: Changing view acls to: ubuntu
16/02/22 22:12:41 INFO SecurityManager: Changing modify acls to: ubuntu
16/02/22 22:12:41 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(ubuntu); users with modify permissions: Set(ubuntu)
16/02/22 22:12:42 INFO Slf4jLogger: Slf4jLogger started
16/02/22 22:12:42 INFO Remoting: Starting remoting
16/02/22 22:12:42 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriver@10.x.xxx.xxx:59112]
16/02/22 22:12:42 INFO Utils: Successfully started service ‘sparkDriver’ on port 59112.
16/02/22 22:12:42 INFO SparkEnv: Registering MapOutputTracker
16/02/22 22:12:42 WARN SparkConf: The configuration key ‘spark.yarn.applicationMaster.waitTries’ has been deprecated as of Spark 1.3 and and may be removed in the future. Please use the new key ‘spark.yarn.am.waitTime’ instead.
16/02/22 22:12:42 WARN SparkConf: The configuration key ‘spark.yarn.applicationMaster.waitTries’ has been deprecated as of Spark 1.3 and and may be removed in the future. Please use the new key ‘spark.yarn.am.waitTime’ instead.
16/02/22 22:12:42 INFO SparkEnv: Registering BlockManagerMaster
16/02/22 22:12:42 INFO DiskBlockManager: Created local directory at /tmp/blockmgr-aeca25e3-dc7e-4750-95c6-9a21c6bc60fd
16/02/22 22:12:42 INFO MemoryStore: MemoryStore started with capacity 530.0 MB
16/02/22 22:12:42 WARN SparkConf: The configuration key ‘spark.yarn.applicationMaster.waitTries’ has been deprecated as of Spark 1.3 and and may be removed in the future. Please use the new key ‘spark.yarn.am.waitTime’ instead.
16/02/22 22:12:42 INFO HttpFileServer: HTTP File server directory is /tmp/spark-009d52a3-de7e-4fa1-bea3-97f3ef34ebbe/httpd-c9182234-dec7-438d-8636-b77930ed5f62
16/02/22 22:12:42 INFO HttpServer: Starting HTTP Server
16/02/22 22:12:42 INFO Server: jetty-8.y.z-SNAPSHOT
16/02/22 22:12:42 INFO AbstractConnector: Started SocketConnector@0.0.0.0:51091
16/02/22 22:12:42 INFO Utils: Successfully started service ‘HTTP file server’ on port 51091.
16/02/22 22:12:42 INFO SparkEnv: Registering OutputCommitCoordinator
16/02/22 22:12:43 INFO Server: jetty-8.y.z-SNAPSHOT
16/02/22 22:12:43 INFO AbstractConnector: Started SelectChannelConnector@0.0.0.0:4040
16/02/22 22:12:43 INFO Utils: Successfully started service ‘SparkUI’ on port 4040.
16/02/22 22:12:43 INFO SparkUI: Started SparkUI at http://10.x.x.108:4040
16/02/22 22:12:43 WARN SparkConf: The configuration key ‘spark.yarn.applicationMaster.waitTries’ has been deprecated as of Spark 1.3 and and may be removed in the future. Please use the new key ‘spark.yarn.am.waitTime’ instead.
16/02/22 22:12:43 WARN SparkConf: The configuration key ‘spark.yarn.applicationMaster.waitTries’ has been deprecated as of Spark 1.3 and and may be removed in the future. Please use the new key ‘spark.yarn.am.waitTime’ instead.
16/02/22 22:12:43 WARN SparkConf: The configuration key ‘spark.yarn.applicationMaster.waitTries’ has been deprecated as of Spark 1.3 and and may be removed in the future. Please use the new key ‘spark.yarn.am.waitTime’ instead.
16/02/22 22:12:43 WARN MetricsSystem: Using default name DAGScheduler for source because spark.app.id is not set.
16/02/22 22:12:43 INFO Executor: Starting executor ID driver on host localhost
16/02/22 22:12:43 INFO Utils: Successfully started service ‘org.apache.spark.network.netty.NettyBlockTransferService’ on port 41193.
16/02/22 22:12:43 INFO NettyBlockTransferService: Server created on 41193
16/02/22 22:12:43 INFO BlockManagerMaster: Trying to register BlockManager
16/02/22 22:12:43 INFO BlockManagerMasterEndpoint: Registering block manager localhost:41193 with 530.0 MB RAM, BlockManagerId(driver, localhost, 41193)
16/02/22 22:12:43 INFO BlockManagerMaster: Registered BlockManager
16/02/22 22:12:43 WARN SparkConf: The configuration key ‘spark.yarn.applicationMaster.waitTries’ has been deprecated as of Spark 1.3 and and may be removed in the future. Please use the new key ‘spark.yarn.am.waitTime’ instead.

Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ ‘_/
/___/ .__/\_,_/_/ /_/\_\ version 1.5.2
/_/
Spark context is available as sc, SQL context is available as sqlContext
>

 

If it happens you do not get the prompt displayed:

16/02/22 22:52:12 INFO ShutdownHookManager: Shutdown hook called
16/02/22 22:52:12 INFO ShutdownHookManager: Deleting directory /tmp/spark-009d52a3-de7e-4fa1-bea3-97f3ef34ebbe

Just press Enter.

 

Possible errors

No public key

W: GPG error: https://cran.wu.ac.at trusty/ Release: The following signatures couldn't be verified because the public key is not available: NO_PUBKEY 51716619E084DAB9

Step 1 was not performed.

Hash Sum mismatch

When updating Ubuntu the following error message shows up:

Hash Sum mismatch

Solution:

sudo rm -rf /var/lib/apt/lists/*
sudo apt-get clean
sudo apt-get update

Resources

1 – Spark 1.6 Technical Preview

2- Installing R on Linux

3 – CRAN – Mirrors

4 – CRAN Secure APT