Automating access from Apache Spark to S3 with Ansible

According to the Apache Spark documentation, Spark jobs must authenticate with S3 to be able to read or write data in the object storage. There are different ways of achieving that:

  • When Spark is running in a cloud infrastructure, the credentials are usually automatically set up.
  • spark-submit reads the AWS_ACCESS_KEY, AWS_SECRET_KEY and AWS_SESSION_TOKEN environment variables and sets the associated authentication options for the s3n and s3a connectors to Amazon S3.
  • In a Hadoop cluster, settings may be set in the core-site.xml file.
  • Authentication details may be manually added to the Spark configuration in spark-defaults.conf.
  • Alternatively, they can be programmatically set in the SparkConf instance used to configure the application’s SparkContext.

Honestly, I wouldn’t know much about the first option. It might have something to do with running Databricks on AWS.

The second option requires to set environment variables on all servers of the Spark cluster. If using Ansible, this can be done but only on a level of a task or role. This means that if you run a long-live Spark cluster, the variables will not be available once you start using the cluster.

The fourth option is the one that will receive the attention in this post. The spark-defaults.conf is the default configuration file and proper configuration in the file tunes your Spark cluster.

There are five configuration tuples needed to manipulate S3 data with Apache Spark. They are explained below.

Getting environmental variables into Docker

The following approach is suitable for a proof of concept or a testing. An enterprise solution should use service like Hashicorp Vault, Ansible Vault, AWS IAM or similar.

I am using Docker on Windows 10. The folder where DockerFile resides also has a file called aws_cred.env. Make sure this file is added to the .gitignore file so that it is not checked into source code repository! The env file holds the AWS key and secret key needed to authenticate with S3. The file structure is like this:

AWS_ACCESS_KEY_ID=
AWS_SECRET_ACCESS_KEY=

When running the docker container with option –env-file the environmental variables in the file get exported to the Docker container.

In the Ansible code, they can both be looked-up in the following way:

{{ lookup('env', 'AWS_ACCESS_KEY_ID') }}
{{ lookup('env', 'AWS_SECRET_ACCESS_KEY') }}

These can be used in the Jinja2 template file spark-defaults.conf.j2 to generate a Spark configuration file. The configuration tuples relevant in this case are these two:

spark.hadoop.fs.s3a.access.key {{ lookup('env', 'AWS_ACCESS_KEY_ID') }}
spark.hadoop.fs.s3a.secret.key {{ lookup('env', 'AWS_SECRET_ACCESS_KEY') }}

This now gives you the access to the S3 buckets, never mind if they are public or private.

The JAR files

First, the following tuple is mandatory for the Spark configuration:

spark.hadoop.fs.s3a.impl      org.apache.hadoop.fs.s3a.S3AFileSystem

This tells Spark what kind of file system it is dealing with. The JAR files are the library sources for this configuration.

Two libraries must be added to the instances of the Spark cluster:

  • aws-java-sdk-1.7.4
  • hadoop-aws-2.7.3

The above mentioned Jinja2 file also holds two configuration tuples relevant for these JAR files:

spark.driver.extraClassPath   /usr/spark-s3-jars/aws-java-sdk-1.7.4.jar:/usr/spark-s3-jars/hadoop-aws-2.7.3.jar
spark.executor.extraClassPath /usr/spark-s3-jars/aws-java-sdk-1.7.4.jar:/usr/spark-s3-jars/hadoop-aws-2.7.3.jar

Be careful with the versions because they must match the Spark version. The above combination has proven to work on Spark installation packages that support Hadoop 2.7. Last two tasks in this main.yml do the job for the Spark cluster.

Once the files are downloaded (for example, I download them to /usr/spark-s3-jars) Apache Spark can start reading and writing to the S3 object storage.

Provision Apache Spark in AWS with Hashistack and Ansible

Provision Apache Spark in AWS with Hashistack and Ansible

Automation is the key word when it comes to using cloud services. Pay-as-you-go is the philosophy behind it.

In this post, I explain how I provision Apache Spark cluster on Amazon. The configuration of the cluster is done prior to the provisioning using the Jinja2 file templates. The cluster, once provisioning is completed, is therefor ready to use immediately.

One of the points with automation is to make data scientists more independant of data engineers: data engineer builds the solution and data scientist uses it without having the need for engineering experience.
In this case, the data scientist hast to configure the cluster using the YAML file and prepare a GitHub repository.

There are two ways of using this solution:

  • A long-live Spark cluster
    Spark cluster serves as a solution for running various jobs. The cluster is always available.
  • One-time job execution
    Spark cluster is provisioned for a specific job which is executed and then the cluster is destroyed. Data Scientist is responsible for data input and data storage in the code (example).

Technologies and services

  • AWS EC2 (Centos7)
  • Terraform for provisioning the infrastructure in AWS
  • Consul for cluster’s configuration settings
  • Ansible for software installation on the cluster
  • GitHub for version control
  • Docker as test and development environment
  • Powershell for running Docker and provision
  • Visual Studio Code for software development, running Powershell

In order to use a service like EC2 in AWS, the Virtual Private Cloud must be established. This is something I have automized using Terraform and Consul and described here. This provision is a “long-live” provision since VPC has practically no cost.

Prerequisites

I will not go into details of how to install all the technologies and services from the list. However, this GitHub repository does build a Docker container with Consul and latest Terraform. Consul in the Docker is an agent which connects to a global Consul server in Amazon. Documenting the global Consul is on my TO-DO list.

I suggest investing some time and creating a Consul with connection to your own GitHub repository that stores the configuration.

Repository on GitHub

The repository can be found at this address.

Repository Structure

There are two modules used in this project: instance and provision-spark. The module instance is pure Terraform code and does the provisioning of the instances (Spark’s master and workers) in the AWS. The output (DNS and IP addresses) of this module is the input for the module provision-spark which is more complex. It is written in Terraform, Ansible and Jinja2.

Ansible Roles

Below is the structure of the Ansible part of the module.

Roles prereq and spark are applied to all instances. The prereq role takes care of the prerequisites (java, anaconda) and the spark role downloads and installs Spark, and creates Linux objects needed for Spark to work. The start_spark_master applies only to the master instance and start_spark_workers to the worker instances. The role execute_on_spark automatically executes a job on Spark cluster (more on that later).

The path to the YAML file that executes the roles is available here.

Cluster Configuration

Cluster is configured in YAML format and the configuration is sent to the global Consul server. One configuration block servers one cluster. Example for cluster lr_iris can be found here.

Running the code

Provisioning starts in module provision-spark where the line

terraform apply -auto-approve

starts provisioning the cluster. Configuration is taken from Consul to populate the variables in Terraform. Ansible (inventory) file is created by Terraform after the EC2 instances are launched and started. After the inventory file is created, Terraform executes the spark.yml file and the rest is in the hands of Ansible. If everything goes well, the output is similar to the following:

This is a Terraform output as defined in the output.tf file.

The Spark cluster is now ready.

View in AWS Console

The instances in the Spark cluster look like this in AWS console:

Spark as a Service

Spark services running on master and workers are handled as services using systemctl. The services are created and started using Ansible: Spark workers start a service called spark-worker whose Ansible code can be seen here.
Spark Master has two services: spark-master and sparkhs (Spark History Server). Ansible code for both services is here.

Spark Master

Checking if Spark Master is available by using the public IP address and port 8080 should return an interface similar to this one:

Five workers were set up in the configuration file. This means we have a cluster with six instances: one is the Spark Master, the other five are the workers.

Spark History Server

Spark History Server, just like Spark Master become significant if long-live cluster is used. It helps monitoring and debugging the jobs (applications in Spark language).
Spark History Server can be reached at port 18080 on Spark Master.

Above is an example of an application that was executed on the Spark cluster. Note Event log directory – it is pointing to a local directory which will be removed once the cluster is destroyed. This is not an issue if we are running a long-live cluster, but if we want to keep logs for one-time clusters it is advised to store the logs externally. In this case, since Amazon is used, storing to S3 would be the best option.

Automatic Code Execution

The Spark cluster is now ready to use. Full automation process is achieved when the Spark code is automatically executed from the Terraform code once the cluster is available. In the repository, one of the Ansible roles is execute_on_spark which executes either a Python or a Scala code on the provisioned Spark cluster.

Which Spark code will be executed depends on the configuration in the YAML file. A path to a GitHub repository is part of the configuration and that repository is cloned to the Spark Master and executed.

An example mentioned above can be found here. The example is one of Hello Worlds in data science – Logistic Regression on Iris dataset. In this case, the Data Scientist is responsible for the input data and storing the results outside of the Spark cluster.

    input_file = "s3a://hdp-hive-s3/test/iris.csv"
    output_dir = "s3a://hdp-hive-s3/test/git_iris_out"

When the cluster is ready, the repository is cloned and the code is executed. Inside the code, the Data Scientist defines input and output. In this case, object storage S3 is used to do a one-time job, save the results and the Spark cluster is of no use anymore.

Service configuration tools and files

The previous post mentions Consul and git2consul which are storing the parameters and fetching data from GitHub to Consul. It is only fair to gain more in-depth knowledge about them.

Consul is a product of a company called HashiCorp and together with Terraform (and other I do not mention yet) forms a group of tools called HashiStack. Consul is a tool for service configuration we build with our scripts. The scripts are the general presentation of the to-be state, while the configuration in Consul personalizes the infrastructure we plan to build (provision).

Service git2consul “mirrors the contents of a git repository into Consul KVs”. With other words, the service reads a git repository and creates/updates key-value pairs in Consul.

git2consul_Consul_cooperation

High level presentation of git2consul and Consul cooperation – git2consul periodically reads from a given GitHub repository and updates the Consul server

The previous post describes how local Consul server is started when the Docker container is ran. Local Consul is acceptable for testing purposes, it is possible and advised to build a distributed Consul service which offers High Availability (avoids single point of failure).

Configuration in YAML

A dedicated GitHub repository for the configuration parameters for my IaC projects can be found here. One YAML file for one project. For example configuration for the VPC architecture defines the services built in VPC that serve as the foundation for clusters built on top, for example configurations (I write in plural since there are/can be more than one) for a Spark cluster.

The git2consul configuration file inside the Docker container holds parameters, among them also the URL to the GitHub repository that serves as configuration repository. The file I am using for my git2consul service is here. It is copied over in the container when the image is created.

The following graphic show the same configuration parameters in three ways. First image is YAML file as seen in GitHub (I use Atom for development), second picture shows the same parameters as seen using a consul API from the command line in the Docker and the third picture shows a print screen of the same parameters in Consul web server.

configuration_example_aws

Three views of same key-value pairs – GitHub, command line and Consul web server

Another example, this one of Machine Learning in Spark shows how two different machine learning projects are configured in spark.yml. The prerequisite to run either of this is the VPC infrastructure and the input files (key spark_job_args). This example show that the scripts used to build the Spark cluster are untouched while the configuration in Consul personalizes the use case. If a new Spark job should be run, it is best to copy an existing block of key-value pairs and change to fit the needs.

A more complex example is the hdp.yml file which holds key-value pairs for five different Hadoop clusters. All can be provisioned using the same Terraform and Ansible scripts.

Writing to Consul at runtime

As mentioned a couple of times, the VPC in AWS is prerequisite and the established VPC is where all the following solutions are built in. This requires saving some values of the VPC so that they can be picked up at the provisioning of the next solution. These values are saved in Consul and are NOT pushed to GitHub – git2consul works one way only.

Once the VPC is provisioned, Terraform writes to Consul in a path defined by the user. In my example, everything starting with generated under the aws is coming from Consul.

consul_generated

Key-value pairs generated when VPC is provisioned. Observe the last line – it specifies the name which should be used to gather all generated key-value pairs under.

These values are further picked up in other Terraform scripts so that the infrastructure that is being build knows where to fit in. I mentioned Spark and Hadoop earlier – the instances launched in AWS need the generated key-value pairs for successful launch.

Introduction to Automation in the Cloud

An attempt to explain how open source tools for automation are used for minimizing costs and maximizing control over infrastructure in the cloud.

Introduction

Automation or Infrastracture-as-Code (IaC) is the idea where all the infrastructure is written in scripts and the scripts are executed when needed. In the “old days” (and some vital parts of organization’s solutions) the infrastructure represented physical servers in the basement with software installed and maintained by the in-house engineers with the help of vendor’s consultants. With the Infrastructure-as-Code the only thing maintained are the scripts while the basement is housing the table-tennis table. The scripts are maintained by data engineers (so called DevOps engineers) and broader audience can now build, maintain and destroy the infrastructure. It does help that the cloud vendors have simplified the services that were once the domain of the network engineers, for example.

The tendency in the areas of data storage and data processing (or everywhere in the IT fields) is to move to a cloud. A private cloud, a public cloud or a hybrid. Those are the options. Moving everything to a public cloud (big three: AWS, Google Cloud Platform or Azure) will make you a smart consumer of those services moneywise. Your goal is to pay-as-you-go, meaning run your applications when needed on the infrastructure you need and destroy the infrastructure when results are saved.

“Pay-as-you-go in cloud”

For succeeding in pay-as-you-go concept, two things have emerged on the market:

  • cheap object storage (S3 on AWS, Blob Storage on Azure and Cloud Storage on Google Cloud Platform).
  • tools for Infrastructure-as-Code (IaC)

Cheap object storage is exactly that: low cost storage of files in all form, shapes and types. This allows to store data cheap and build infrastructure for processing when needed. This follows the idea of dividing storage and processing.

“Division of storage and processing resources”

Days of having Hadoop just to have Hadoop are over and a company needs a good reason to justify having and maintaining a Hadoop cluster. The division of storage and processing works if the infrastructure is dynamic, rather, if the infrastructure-as-code can fulfil user’s needs. The responsibility falls on the DevOp engineers and the tools.

There is no doubt that the tools are there, plenty to choose from already from the open source community. Since I am following the philosophy where companies pay less for licence and more for knowledge I focus on open source technologies in cloud.

“Organizations will pay less for licenses and more for knowledge”

Infrastructure-as-code should offer a robust and general solution where the infrastructure is configured through input parameters. With other words, users define the input parameters, run the code and get the customized solution. This is what I attempt to demonstrate in a few of my GitHub repositories. I will come to this in my later posts.

Choosing the tools to do the job is not simple. As it is not simple to pick the most suitable cloud distributor. Here in Norway, Azure is the most popular cloud solution, in my opinion, not because of quality but because of the market position and good sales people at Microsoft.

Myself, I have experience mostly with AWS (a reader might observe that I write Amazon Web Services as AWS while Google Cloud Platform is not GCP) and some with OpenStack and VMWare. Choosing a cloud vendor is not as problematic as it is choosing the architecture in your cloud. Using services provided by the cloud vendor results in a possible risk to be locked to one technology or vendor. Migration to another, similar, solution might be costly. And this should be an option always when working with new technologies where there are uncertainties if the proposed architecture will deliver.

“Locking yourself to one distributor can be risky”

The technology stack I use in my examples is the following:

Cloud vendor: AWS

Cloud Vendor’s services: S3 (object storage), VPC (virtual private cloud – mandatory for launching instances in AWS), EC2 (instances in the cloud – Linux servers)

aws

Object storage S3 for storing data is separated from the processing resources (made up from EC2 instances) which are in the mandatory VPC. Any other storage can be used if it has connectors, as well as S3 can be accessed externally.

Infrastructure as Code tools: Terraform (automation of services in the cloud), Consul (configuration of infrastructure to be created) and Ansible (software installation and administration of instances built in the cloud)

IaC tools.JPG

Symbiosis between the IaC tools: user stores configuration of desired infrastructure to Consul, Terraform reads the configuration at provisioning, saves new parameters back to Consul and at the same time executes the Ansible scripts which install and setup the software for the desired solution.

Work environment: Docker for Windows (container with Linux environment on local machine), PowerShell (for Docker creation and development and test of scripts)

Version control: GitHub and GitHub Desktop (for pulling and pushing to the repositories)

work-environment

Repository with files for Docker container creation is cloned from GitHub using GitHub Desktop. PowerShell is used to create the Docker image and start the Docker container. This Docker container represents the entry point to the Infrastructure-as-Code development and testing.

IDEs for coding: Atom (for Terraform, Ansible and Consul configuration), PyCharm and Jupyter (for Python scripts) and Intellij IDEA (for Scala scripts)

Next post goes in depth on the DevOps environment.