My Work cluster in detail

Services on the cluster

The chapter briefly describes the user and administrator services in the cluster.


Spark was the key service offered in the cluster. With Spark 2.0 this became even more obvious that this is going to be clusters computational engine. It offers Java, Scala, R and Python API. Spark SQL is available, Spark streaming for data streaming is improved, machine learning (MLlib) libraries are updated and new ones are available and GraphX is available for network analysis.

Ambari Views

Ambari Views was offered for manual file upload to the HDFS and other file manipulation alternatives. It was a standard practice that the cluster users had a big file on their disk they would like to run calculation on – Ambari Views has proven to do the job.
Hive was also accesible from this interface.


After installing the needed R packages and sparkR, RStudio could be used to work with sparkR. This tool is known to R users from before and was also more stable than Zeppelin.


Open source tool works on the princip of notebooks, like Jupyter or iPython. Still quite new in the ASF family, but has great potensial in collaboration with Spark. Great for learning and testing purposes, it is possible to run a cron job via Zeppelin. The interface sends spark-submit commands in the background.


Tool used for streaming data and near real time analytics. Java was the main language used when topologies were defined. Not many Java developers among researchers, so this tool was mostly used by myself to ingest data into HDFS, MySQL, Redis.


Sqoop was used to transfer data between HDFS and MySql (works with other major RDBMS as well).


Command Line Interface was an alternative for more advanced users who were also familiar with Linux. The Client node was the entry point to the cluster for the users and they got their private key and space on Client instance.


Tool for monitoring and administrating the cluster – only administrators had access to it. A few users got read-only access to become familiar with it (from Ambari 2.4, those users are not read-only users anymore but Cluster only users).


The tool was used for defining access rights to folders in HDFS.


Tool for data visualizing. It automatically creates a data source to Ambari metadata database and by default there are plenty dashboards available for monitoring the cluster.
Grafana can also be used for visualizing users data.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s