Setting up RStudio Server to run with Apache Spark

I have installed R and SparkR on my Hadoop/Spark cluster. That is described in this post. I have also installed Apache Zeppelin with R to use SparkR with Zeppelin (here).
So far, I can offer my users SparkR through CLI and Apache Zeppelin. But they all want one interface – RStudio. This post describes how to install RStudio Server and configure it to work with Apache Spark.

On my cluster, I am running Apache Spark 1.6.0, manually installed (installation process). Underneath is a multinode Hadoop cluster from Hortonworks.

RStudio Server is installed on one client node in the cluster:

Update the Ubuntu system
```
sudo apt-get update
```
Download the repository file (make sure you are downloading RStudio Server, not the client!)
```
sudo wget https://download2.rstudio.org/rstudio-server-0.99.893-amd64.deb
```
Install gdebi (about gdebi)
```
sudo apt-get install gdebi-core -y
```
Install package libjpeg62
```
sudo apt-get install libjpeg62 -y
```
In case you get the following error

You might want to run ‘apt-get -f install’ to correct these:
The following packages have unmet dependencies:
rstudio : Depends: libgstreamer0.10-0 but it is not going to be installed Depends: libgstreamer-plugins-base0.10-0 but it is not going to be installed
E: Unmet dependencies. Try ‘apt-get -f install’ with no packages (or specify a solution).

Run:
```
sudo apt-get -f install
```

Install RStudio Server

sudo gdebi rstudio-server-0.99.893-amd64.deb

During the installation, the following question outputs. Type “y” and press Enter.

RStudio is a set of integrated tools designed to help you be more productive with R. It includes a console, syntax-highlighting editor that supports direct code execution, as well as tools for plotting, history, and workspace management.
Do you want to install the software package? [y/N]:
Find your path to $SPARK_HOME
```
echo $SPARK_HOME
```
Setting environment variable in Rprofile.site
Location of the file should be /usr/lib/R/etc/Rprofile.site. Open the Rprofile.site file and append the following line to it (or whatever your home to Spark is)
```
Sys.setenv(SPARK_HOME="/usr/apache/spark-1.6.0-bin-hadoop2.6")
```
Restart RStudio
```
sudo rstudio-server restart
```
RStudio with Spark is now installed and can be accessed on

http://rstudio-server:8787
Log in with one Unix user (if you do not have one run sudo adduser user1). User cannot be root or have ID lower than 100.

Load library SparkR

library(SparkR, lib.loc = c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib")))

SparkContext environment values (used for parameter sparkEnvir when creating SparkContext in the next step). These can be adjusted according to the cluster and user needs.
```
spark_env = list('spark.executor.memory' = '4g',
'spark.executor.instances' = '4',
'spark.executor.cores' = '4',
'spark.driver.memory' = '4g')
```

Creating SparkContext

sc <- sparkR.init(master = "yarn-client", appName = "RStudio", sparkEnvir = spark_env, sparkPackages="com.databricks:spark-csv_2.10:1.4.0")

Creating an sqlConext
```
sqlContext <- sparkRSQL.init(sc)
```
In case the SparkContext has to be initialized all over again, stop it first. Then repeat the last two steps.
```
sparkR.stop()
```

In YARN Resource Manager Console the running application can be controlled:

RStudio - YARN status

SparkR in RStudio is now ready for use. In order to get a better understanding of how SparkR works with R, check this post: DataFrame vs data.frame.

markobigdata

Big Data documentation in a blog

Setting up RStudio Server to run with Apache Spark

Leave a comment Cancel reply

Share this:

Leave a comment Cancel reply