For the need of my employeer I am working on setting up different environments for researchers to do their statistical analyses using distributed systems.
In this post, I am going to describe how Zeppelin with R was installed using this github project.
Ubuntu 14.04 Trusty is my Linux flavour. Spark 1.6.0 is manually installed on the cluster (link) on Openstack, Hadoop (2.7.1) distribution is Hortonworks and sparkR has been installed earlier (link).
One of the nodes in the cluster is a dedicated client node. On this node Zeppelin with R is installed.
Prerequisities
- Spark should be installed (1.6.0 in this case, my earlier version 1.5.2 also worked well).
- Java – my version is java version “1.7.0_95”
- Maven and git (how to install them)
- User running the Zeppelin service has to have a folder under in HDFS under /user. If the user has, for example, ran Spark earlier, then this folder was created already, otherwise Spark services could not be ran.
Example on how to create an HDFS folder under /user and change owner:sudo -u hdfs hadoop fs -mkdir /user/user1 sudo -u hdfs hadoop fs -chown user1:user1 /user/user1
- Create zeppelin user
sudo adduser zeppelin
R installation process
From the shell as root
In order to have ZeppelinR running properly some R packages have to be installed. Installing the R packages has proven to be problematic if some packages are not installed as root user first.
- Install Node.js package manager
sudo apt-get install npm -y
- The following packages need to be installed for the R package devtools installation to go through properly.
sudo apt-get install libcurl4-openssl-dev -y sudo apt-get install libxml2-dev -y
- Later on, when R package dplyr is being installed, some warnings pop out. Just to be on the safe side these two packages should be installed.
sudo apt-get install r-cran-rmysql -y sudo apt-get install libpq-dev –y
- For successfully installing Cairo package in R, the following two should be installed.
sudo apt-get install libcairo2-dev –y sudo apt-get install libxt-dev libxaw7-dev -y
- To install IRkernel/repr later the following package needs to be installed.
sudo apt-get install libzmq3-dev –y
From R as root
- Run R as root.
sudo R
- Install the following packages in R:
install.packages("evaluate", dependencies = TRUE, repos='http://cran.us.r-project.org') install.packages("base64enc", dependencies = TRUE, repos='http://cran.us.r-project.org') install.packages("devtools", dependencies = TRUE, repos='http://cran.us.r-project.org') install.packages("Cairo", dependencies = TRUE, repos='http://cran.us.r-project.org')
(The reason why I am running one package at a time is to control what is going on when package is being installed)
- Load devtools for github command
library(devtools)
- Install IRkernel/repr package
install_github('IRkernel/repr')
- Install these packages
install.packages("dplyr", dependencies = TRUE, repos='http://cran.us.r-project.org') install.packages("caret", dependencies = TRUE, repos='http://cran.us.r-project.org') install.packages("repr", dependencies = TRUE, repos='http://irkernel.github.io/')
- Install R interface to Google Charts API
install.packages('googleVis', dependencies = TRUE, repos='http://cran.us.r-project.org')
- Exit R
q()
Zeppelin installation process
Hortonworks installs its Hadoop under /usr/hdp. I decided to follow the pattern and install Apache services under /usr/apache.
- Create log folder for Zeppelin log files
sudo mkdir /var/log/zeppelin sudo chown zeppelin:zeppelin /var/log/zeppelin
- Go to /usr/apache (or wherever your home to ZeppelinR is going to be) and clone the github project.
sudo git clone https://github.com/elbamos/Zeppelin-With-R
Zeppelin-With-R folder is now created. This is going to be ZeppelinR’s home. In my case this would be /usr/apache/Zeppelin-With-R.
- Change the ownership of the folder
sudo chown –R zeppelin:zeppelin Zeppelin-With-R
- Adding global variable ZEPPELIN_HOME
Open the environment filesudo vi /etc/environment
And add the variable
export ZEPPELIN_HOME=/usr/apache/Zeppelin-With-R
Save and exit the file and do not forget to reload it.
source /etc/environment
- Change user to zeppelin (or whoever is going to build the Zeppelin-With-R)
su zeppelin
- Make sure you are in $ZEPPELIN_HOME and build Zeppelin-With R
mvn clean package -Pspark-1.6 -Dspark.version=1.6.0 -Dhadoop.version=2.7.1 -Phadoop-2.6 -Pyarn -Ppyspark -DskipTests
- Initial information before build starts
R interpreter is on the list. - Successful build
ZeppelinR is now installed. The next step is configuration.
Configuring ZeppelinR
- Copying and modifying hive-site.xml (as root). From Hive’s con folder, copy the hive-site.conf file to $ZEPPELIN_HOME/conf.
sudo cp /etc/hive/conf/hive-site.xml $ZEPPELIN_HOME/conf/
- Change the owner of the file to zeppelin:zeppelin.
sudo chown zeppelin:zeppelin $ZEPPELIN_HOME/conf/hive-site.xml
- Log in as zeppelin and modify the hive-site.xml file.
vi $ZEPPELIN_HOME/conf/hive-site.xml
remove “s” from the value of properties hive.metastore.client.connect.retry.delay and
hive.metastore.client.socket.timeout to avoid a number format exception. - Create folder for Zeppelin pid.
mkdir $ZEPPELIN_HOME/run
- Create zeppelin-site.xml and zeppelin-env.sh from respective templates.
cp $ZEPPELIN_HOME/conf/zeppelin-site.xml.template $ZEPPELIN_HOME/conf/zeppelin-site.xml cp $ZEPPELIN_HOME/conf/zeppelin-env.sh.template $ZEPPELIN_HOME/conf/zeppelin-env.sh
- Open $ZEPPELIN_HOME/conf/zeppelin-env.sh and add:
export SPARK_HOME=/usr/apache/spark-1.6.0-bin-hadoop2.6 export HADOOP_CONF_DIR=/etc/hadoop/conf export ZEPPELIN_JAVA_OPTS= -Dhdp.version=2.3.4.0-3485 export ZEPPELIN_LOG_DIR=/var/log/zeppelin export ZEPPELIN_PID_DIR=${ZEPPELIN_HOME}/run
Parameter Dhdp.version should match your Hortonworks distribution version.
Save and exit the file.
Zeppelin-With-R is now ready for use. Start it by running
$ZEPPELIN_HOME/bin/zeppelin-daemon.sh start
How to configure Zeppelin interpreters is described in this post.
Note!
My experience shows that if you run ZeppelinR as zeppelin user, you will not be able to use spark.r functionalities. The error message I am getting is unable to start device X11cairo. The reason is lack of permissions on certain files within R installation. This is something I still have to figure out. For now running as root does the trick.
http//:zeppelin-server:8080 takes you to the Zeppelin Web UI. How to configure interpreters Spark and Hive is described in this post.
When interpreters are configures, a notebook RInterpreter is available for test.
One thought on “Building Zeppelin-With-R on Spark and Zeppelin”