Provision Apache Spark in AWS with Hashistack and Ansible

Automation is the key word when it comes to using cloud services. Pay-as-you-go is the philosophy behind it.

In this post, I explain how I automate provision of Apache Spark cluster on Amazon. The configuration of the cluster is done prior to the provisioning using the Jinja2 file templates. The cluster, once provisioning is completed, is therefor ready to use immediately.

The technologies and services used:

  • AWS EC2 (Centos7)
  • Terraform for provisioning the infrastructure in AWS
  • Consul for cluster’s configuration settings
  • Ansible for software installation on the cluster
  • GitHub for version control
  • Docker as test and development environment
  • Powershell for running Docker and provision
  • Visual Studio Code for writing code, running Powershell

In order to use a service like EC2 in AWS, the Virtual Private Cloud must be established. This is something I have automized using Terraform and Consul and described here. This provision is a “long-live” provision since running the services to build a VPC have practically no cost.


I will not go into details of how to install all the technologies and services from the list. However, this GitHub repository does build a Docker container with installed local Consul and latest Terraform. I suggest investing some time and creating a Consul with connection to your own GitHub repository that stores the configuration.

Repository on GitHub

The repository can be found at this address.

Repository Structure

There are two modules used in this project: instance and provision-spark. The module instance is pure Terraform code and does the provisioning of the instances (Spark’s master and workers) in the AWS. The output (DNS and IP addresses) of this module is the input for the module provision-spark which is more complex. It is written in Terraform, Ansible and Jinja2. Below is the structure of the Ansible part of the module.

Roles java and spark are applied to all instances. The java role takes care of the prerequisites and the spark role downloads and installs spark, creates Linux objects needed for Spark to work .  The start_spark_master applies only to the master instance and start_spark_workers to the worker instances. The path to the YAML file that executes the roles is available here.

Running the code

Terraform’s command from provision-spark folder “terraform apply -auto-approve” starts provisioning. Configuration is taken from Consul to populate the variables in Terraform. Ansible (inventory) file is created by Terraform after the EC2 instances are launched and started. After the inventory file is created, Terraform executes the spark.yml file and the rest is in the hands of Ansible. If everything goes well, the output is similar to the following:

This is a Terraform output as defined in the file.

Checking if Spark Master is available by using the public IP and port 8080 should return an interface similar to this one:

Five workers were set up in the configuration file and we can see there is a cluster with 6 instances: one is the Spark Master, the other five are the workers.

The Spark cluster is now ready.

Automatic Code Execution

The Spark cluster is now ready for use. Full automation process is achieved when the Spark code is automatically executed from the Terraform code once the cluster is available. In the repository, under resources, there is a folder execute_on_spark which executes either a Python or a Scala code on the provisioned Spark cluster.

If the parameters in Consul are correctly set up and the execute_on_spark/execute.yml is added as a new resource at the end of the Terraform script, then the GitHub repository is downloaded to master instance and executed.

This is a small TO-DO project at the end of this post to show how it is done.