Services on the cluster
The chapter briefly describes the user and administrator services in the cluster.
Spark
Spark was the key service offered in the cluster. With Spark 2.0 this became even more obvious that this is going to be clusters computational engine. It offers Java, Scala, R and Python API. Spark SQL is available, Spark streaming for data streaming is improved, machine learning (MLlib) libraries are updated and new ones are available and GraphX is available for network analysis.
Ambari Views
Ambari Views was offered for manual file upload to the HDFS and other file manipulation alternatives. It was a standard practice that the cluster users had a big file on their disk they would like to run calculation on – Ambari Views has proven to do the job.
Hive was also accesible from this interface.
RStudio
After installing the needed R packages and sparkR, RStudio could be used to work with sparkR. This tool is known to R users from before and was also more stable than Zeppelin.
Zeppelin
Open source tool works on the princip of notebooks, like Jupyter or iPython. Still quite new in the ASF family, but has great potensial in collaboration with Spark. Great for learning and testing purposes, it is possible to run a cron job via Zeppelin. The interface sends spark-submit commands in the background.
Storm
Tool used for streaming data and near real time analytics. Java was the main language used when topologies were defined. Not many Java developers among researchers, so this tool was mostly used by myself to ingest data into HDFS, MySQL, Redis.
Sqoop
Sqoop was used to transfer data between HDFS and MySql (works with other major RDBMS as well).
CLI
Command Line Interface was an alternative for more advanced users who were also familiar with Linux. The Client node was the entry point to the cluster for the users and they got their private key and space on Client instance.
Ambari
Tool for monitoring and administrating the cluster – only administrators had access to it. A few users got read-only access to become familiar with it (from Ambari 2.4, those users are not read-only users anymore but Cluster only users).
Ranger
The tool was used for defining access rights to folders in HDFS.
Grafana
Tool for data visualizing. It automatically creates a data source to Ambari metadata database and by default there are plenty dashboards available for monitoring the cluster.
Grafana can also be used for visualizing users data.