Setting up RStudio Server to run with Apache Spark

I have installed R and SparkR on my Hadoop/Spark cluster. That is described in this post. I have also installed Apache Zeppelin with R to use SparkR with Zeppelin (here).
So far, I can offer my users SparkR through CLI and Apache Zeppelin. But they all want one interface – RStudio. This post describes how to install RStudio Server and configure it to work with Apache Spark.

On my cluster, I am running Apache Spark 1.6.0, manually installed (installation process). Underneath is a multinode Hadoop cluster from Hortonworks.

RStudio Server is installed on one client node in the cluster:

  1. Update the Ubuntu system
    sudo apt-get update
  2. Download the repository file (make sure you are downloading RStudio Server, not the client!)
    sudo wget https://download2.rstudio.org/rstudio-server-0.99.893-amd64.deb
  3. Install gdebi (about gdebi)
    sudo apt-get install gdebi-core -y
  4. Install package libjpeg62
    sudo apt-get install libjpeg62 -y
  5. In case you get the following error

    You might want to run ‘apt-get -f install’ to correct these:
    The following packages have unmet dependencies:
    rstudio : Depends: libgstreamer0.10-0 but it is not going to be installed                                    Depends: libgstreamer-plugins-base0.10-0 but it is not going to be installed
    E: Unmet dependencies. Try ‘apt-get -f install’ with no packages (or specify a solution).

    Run:

    sudo apt-get -f install
  6. Install RStudio Server
    sudo gdebi rstudio-server-0.99.893-amd64.deb
  7. During the installation, the following question outputs. Type “y” and press Enter.

    RStudio is a set of integrated tools designed to help you be more productive with R. It includes a console, syntax-highlighting editor that supports direct code execution, as well as tools for plotting, history, and workspace management.
    Do you want to install the software package? [y/N]:

  8. Find your path to $SPARK_HOME
    echo $SPARK_HOME
  9. Setting environment variable in Rprofile.site
    Location of the file should be /usr/lib/R/etc/Rprofile.site. Open the Rprofile.site file and append the following line to it (or whatever your home to Spark is)

    Sys.setenv(SPARK_HOME="/usr/apache/spark-1.6.0-bin-hadoop2.6")
  10. Restart RStudio
    sudo rstudio-server restart
  11. RStudio with Spark is now installed and can be accessed on

    http://rstudio-server:8787

  12. Log in with one Unix user (if you do not have one run sudo adduser user1). User cannot be root or have ID lower than 100.
  13. Load library SparkR
    library(SparkR, lib.loc = c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib")))
  14. SparkContext environment values (used for parameter sparkEnvir when creating SparkContext in the next step). These can be adjusted according to the cluster and user needs.
    spark_env = list('spark.executor.memory' = '4g',
    'spark.executor.instances' = '4',
    'spark.executor.cores' = '4',
    'spark.driver.memory' = '4g')
  15. Creating SparkContext
    sc <- sparkR.init(master = "yarn-client", appName = "RStudio", sparkEnvir = spark_env, sparkPackages="com.databricks:spark-csv_2.10:1.4.0")
  16. Creating an sqlConext
    sqlContext <- sparkRSQL.init(sc)
  17. In case the SparkContext has to be initialized all over again, stop it first. Then repeat the last two steps.
    sparkR.stop()

In YARN Resource Manager Console the running application can be controlled:

RStudio - YARN status

SparkR in RStudio is now ready for use. In order to get a better understanding of how SparkR works with R, check this post: DataFrame vs data.frame.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s