Installing R on Hadoop cluster to run sparkR

The environment

Ubuntu 14.04 Trusty is the operating system. Hortonworks Hadoop distribution is used to install the multinode cluster.

The following process of installation has been successful for the following versions:

  • Spark 1.4.1
  • Spark 1.5.2
  • Spark 1.6.0

Prerequisities

This post assumes Spark is already installed on the system. How this can be done is explained in one of my posts here.

Setting up sparkR

If command sparkR ($SPARK_HOME/bin/sparkR) is run right after Spark installation is complete, the following error is returned:

env: R: No such file or directory

R packages have to be installed on all nodes in order for sparkR to work properly. Spark 1.6 Technical Preview on Hortonworks website gives us the link on how to set up R on Linux (Installing R on Linux). I experienced the process to be different.

Again, this process should be done on all nodes.

  1. The Ubuntu archives on CRAN are signed with the key of “Michael Rutter marutter@gmail.com” with key ID E084DAB9 (link)
    sudo apt-key adv --keyserver keyserver.ubuntu.com --recv-keys E084DAB9
  2. Now add the key to apt.
    gpg -a --export E084DAB9 | sudo apt-key add -
  3. Fetch Linux codename
    LVERSION=`lsb_release -c | cut -c 11-`

    In this case, it is trusty.

  4. From Cran – Mirrors or CRAN Mirrors US, find a suitable CRAN mirror and use it in the next step. In this case, a CRAN mirror from Austria is used – cran.wu.ac.at.
  5. Append the repository to the system’s sources.list file
    echo deb https://cran.wu.ac.at/bin/linux/ubuntu $LVERSION/ | sudo tee -a /etc/apt/sources.list
    

    The line added to the sources.list should look something like this:

    deb https://cran.wu.ac.at/bin/linux/ubuntu trusty/

  6. Update the system
    sudo apt-get update
  7. Install r-base package
    sudo apt-get install r-base -y
  8. Install r-base-dev package
    sudo apt-get install r-base-dev -y
  9. In order to avoid the following warn when Spark service is started:

    WARN shortcircuit.DomainSocketFactory: The short-circuit local reads feature cannot be used because libhadoop cannot be loaded.

    Add the following export to the system environment file:

    export LD_LIBRARY_PATH=/usr/hdp/current/hadoop-client/lib/native
  10. Test the installation by starting R
    R

    Something like this should show up and R should start.

    R version 3.2.3 (2015-12-10) — “Wooden Christmas-Tree”

  11. Exit R.
    q()

R is now installed on the first node. Steps 1 to 7 should be repeated on all nodes in the cluster.

Testing sparkR

If R is installed on all nodes (I haven’t mentioned that yet in this post), we can test it from the node where Spark client is installed:

cd $SPARK_HOME
./bin/sparkR

Hello message from Spark invites the user to the sparkR environment.

sparkR-hello-window

Output when starting SparkR

R version 3.2.3 (2015-12-10) — “Wooden Christmas-Tree”
Copyright (C) 2015 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type ‘license()’ or ‘licence()’ for distribution details.

Natural language support but running in an English locale

R is a collaborative project with many contributors.
Type ‘contributors()’ for more information and
‘citation()’ on how to cite R or R packages in publications.

Type ‘demo()’ for some demos, ‘help()’ for on-line help, or
‘help.start()’ for an HTML browser interface to help.
Type ‘q()’ to quit R.

Launching java with spark-submit command /usr/hdp/2.3.4.0-3485/spark/bin/spark-submit “sparkr-shell” /tmp/RtmpN1gWPL/backend_port18f039dadb86
16/02/22 22:12:40 WARN SparkConf: The configuration key ‘spark.yarn.applicationMaster.waitTries’ has been deprecated as of Spark 1.3 and and may be removed in the future. Please use the new key ‘spark.yarn.am.waitTime’ instead.
16/02/22 22:12:41 WARN SparkConf: The configuration key ‘spark.yarn.applicationMaster.waitTries’ has been deprecated as of Spark 1.3 and and may be removed in the future. Please use the new key ‘spark.yarn.am.waitTime’ instead.
16/02/22 22:12:41 INFO SparkContext: Running Spark version 1.5.2
16/02/22 22:12:41 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform… using builtin-java classes where applicable
16/02/22 22:12:41 WARN SparkConf: The configuration key ‘spark.yarn.applicationMaster.waitTries’ has been deprecated as of Spark 1.3 and and may be removed in the future. Please use the new key ‘spark.yarn.am.waitTime’ instead.
16/02/22 22:12:41 INFO SecurityManager: Changing view acls to: ubuntu
16/02/22 22:12:41 INFO SecurityManager: Changing modify acls to: ubuntu
16/02/22 22:12:41 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(ubuntu); users with modify permissions: Set(ubuntu)
16/02/22 22:12:42 INFO Slf4jLogger: Slf4jLogger started
16/02/22 22:12:42 INFO Remoting: Starting remoting
16/02/22 22:12:42 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriver@10.x.xxx.xxx:59112]
16/02/22 22:12:42 INFO Utils: Successfully started service ‘sparkDriver’ on port 59112.
16/02/22 22:12:42 INFO SparkEnv: Registering MapOutputTracker
16/02/22 22:12:42 WARN SparkConf: The configuration key ‘spark.yarn.applicationMaster.waitTries’ has been deprecated as of Spark 1.3 and and may be removed in the future. Please use the new key ‘spark.yarn.am.waitTime’ instead.
16/02/22 22:12:42 WARN SparkConf: The configuration key ‘spark.yarn.applicationMaster.waitTries’ has been deprecated as of Spark 1.3 and and may be removed in the future. Please use the new key ‘spark.yarn.am.waitTime’ instead.
16/02/22 22:12:42 INFO SparkEnv: Registering BlockManagerMaster
16/02/22 22:12:42 INFO DiskBlockManager: Created local directory at /tmp/blockmgr-aeca25e3-dc7e-4750-95c6-9a21c6bc60fd
16/02/22 22:12:42 INFO MemoryStore: MemoryStore started with capacity 530.0 MB
16/02/22 22:12:42 WARN SparkConf: The configuration key ‘spark.yarn.applicationMaster.waitTries’ has been deprecated as of Spark 1.3 and and may be removed in the future. Please use the new key ‘spark.yarn.am.waitTime’ instead.
16/02/22 22:12:42 INFO HttpFileServer: HTTP File server directory is /tmp/spark-009d52a3-de7e-4fa1-bea3-97f3ef34ebbe/httpd-c9182234-dec7-438d-8636-b77930ed5f62
16/02/22 22:12:42 INFO HttpServer: Starting HTTP Server
16/02/22 22:12:42 INFO Server: jetty-8.y.z-SNAPSHOT
16/02/22 22:12:42 INFO AbstractConnector: Started SocketConnector@0.0.0.0:51091
16/02/22 22:12:42 INFO Utils: Successfully started service ‘HTTP file server’ on port 51091.
16/02/22 22:12:42 INFO SparkEnv: Registering OutputCommitCoordinator
16/02/22 22:12:43 INFO Server: jetty-8.y.z-SNAPSHOT
16/02/22 22:12:43 INFO AbstractConnector: Started SelectChannelConnector@0.0.0.0:4040
16/02/22 22:12:43 INFO Utils: Successfully started service ‘SparkUI’ on port 4040.
16/02/22 22:12:43 INFO SparkUI: Started SparkUI at http://10.x.x.108:4040
16/02/22 22:12:43 WARN SparkConf: The configuration key ‘spark.yarn.applicationMaster.waitTries’ has been deprecated as of Spark 1.3 and and may be removed in the future. Please use the new key ‘spark.yarn.am.waitTime’ instead.
16/02/22 22:12:43 WARN SparkConf: The configuration key ‘spark.yarn.applicationMaster.waitTries’ has been deprecated as of Spark 1.3 and and may be removed in the future. Please use the new key ‘spark.yarn.am.waitTime’ instead.
16/02/22 22:12:43 WARN SparkConf: The configuration key ‘spark.yarn.applicationMaster.waitTries’ has been deprecated as of Spark 1.3 and and may be removed in the future. Please use the new key ‘spark.yarn.am.waitTime’ instead.
16/02/22 22:12:43 WARN MetricsSystem: Using default name DAGScheduler for source because spark.app.id is not set.
16/02/22 22:12:43 INFO Executor: Starting executor ID driver on host localhost
16/02/22 22:12:43 INFO Utils: Successfully started service ‘org.apache.spark.network.netty.NettyBlockTransferService’ on port 41193.
16/02/22 22:12:43 INFO NettyBlockTransferService: Server created on 41193
16/02/22 22:12:43 INFO BlockManagerMaster: Trying to register BlockManager
16/02/22 22:12:43 INFO BlockManagerMasterEndpoint: Registering block manager localhost:41193 with 530.0 MB RAM, BlockManagerId(driver, localhost, 41193)
16/02/22 22:12:43 INFO BlockManagerMaster: Registered BlockManager
16/02/22 22:12:43 WARN SparkConf: The configuration key ‘spark.yarn.applicationMaster.waitTries’ has been deprecated as of Spark 1.3 and and may be removed in the future. Please use the new key ‘spark.yarn.am.waitTime’ instead.

Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ ‘_/
/___/ .__/\_,_/_/ /_/\_\ version 1.5.2
/_/
Spark context is available as sc, SQL context is available as sqlContext
>

 

If it happens you do not get the prompt displayed:

16/02/22 22:52:12 INFO ShutdownHookManager: Shutdown hook called
16/02/22 22:52:12 INFO ShutdownHookManager: Deleting directory /tmp/spark-009d52a3-de7e-4fa1-bea3-97f3ef34ebbe

Just press Enter.

 

Possible errors

No public key

W: GPG error: https://cran.wu.ac.at trusty/ Release: The following signatures couldn't be verified because the public key is not available: NO_PUBKEY 51716619E084DAB9

Step 1 was not performed.

Hash Sum mismatch

When updating Ubuntu the following error message shows up:

Hash Sum mismatch

Solution:

sudo rm -rf /var/lib/apt/lists/*
sudo apt-get clean
sudo apt-get update

Resources

1 – Spark 1.6 Technical Preview

2- Installing R on Linux

3 – CRAN – Mirrors

4 – CRAN Secure APT

2 thoughts on “Installing R on Hadoop cluster to run sparkR

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s