Docker, AWS, Python3 and boto3

The idea behind is to have an independent environment to integrate Amazon Web Services’ objects and services with Python applications.

The GitHub repository with example can be found here. The README.md will probably serve you better than this blog post if you just want to get started.

The environment is offered in a form of a Docker container, which I am running on Windows 10. The above repository has a DockerFile available so the container can be build wherever.

Python 3 is the language of choice to work against the AWS and for that a library boto3 is needed. This is an AWS SDK for Python and it is used to integrate Python applications with AWS services.

Bare minimum

To get started, all is needed is access key and secret key (which requires an IAM user with assigned policies), Python and installed boto3.

The policies the user gets assigned are going to reflect in the Python code. It can be frustrating at the beginning to assign the right policies so maybe for the purpose of testing, give the user all rights to a service and narrow it down later.

Where to begin

The best service to begin with is object data storage AWS S3 where you can manipulate with buckets (folders) and objects (files). And you also see immediate results in AWS console. Costs are also minimal and there are no services running “under” S3 that need attention first. My repository has a simple Python package which lists all available buckets.

Credentials and sessions

To integrate Python application and AWS services, an IAM user is needed and users access key and service key. They can be provided in different ways, in this case, I have used sessions – which allow users (dev, test, prod…) to change at runtime. This example of credentials file with sessions gives the general idea about how to create multiple sessions.

The Python test file shows how to initialize a session.

Exception handling

Handling exceptions in Python3 and with boto3 is demonstrated in the test package. Note that the excpetion being caught is a boto3 exception.

Further work

The environment is set up, PyCharm can be used for software development while Docker can execute the tests.

There is nothing stopping you from developing a Python application.

After gaining some confidence, it would be smart to check the policies and create policies that allow a user or group excatly what they need to be allowed.

Dilemma

How far will boto3 take one organization? Is it smart to consider using, for example, Terraform when building VPC and launching EC2 instances?

It is worth making that decision and use an Infrastructure-as-Code tool on a higher level to automate faster. And prehaps use boto3 to do more granular work like manipulating objects in S3 or dealinh with users and policies.

Nginx, Gunicorn and Dash on CentOS

Challenge to solve

I am building a website for analyses of basketball games based on the play-by-play data publicly available after endgame. My logic (parsing, fetching from the internet, algorithms, etc) is written in Python and I wanted to continue using Python all the way, also when building front end. To do that, I have chosen Dash which builds on top of Flask.

My plan was to publish the web-based analytic app called Hubie behind a domain with port 80 and run it on a Linux server in the cloud. Gunicorn is the server of choice for the web application and Nginx is a web server which in this case serves as a reverse proxy.

The web application deployed is using a DNS name created in Azure.

Virtual environment is used to test the web application from port 8080, and to execute python3 and gunicorn commands suitable for Python3.

One of the challenges CentOS 7 has is that it still used Python 2 as its default Python. Installing Python3 and changing paths in /usr/bin might seem a good solution, but it will come back and haunt you. That is why it is best to create a virtual environment with the desired Python version.

Environment

Service used to host the server is Virtual Machine on Azure. The Linux server is using image CentOS-based 7.7 and the instance size is Standard D2s v3.

Install packages

Preparing CentOS environment by installing the necessary packages:

sudo yum update -y
sudo yum install -y epel-release
sudo yum -y install python3-pip nginx git 
sudo  yum install --enablerepo="epel" ufw -y
sudo yum install -y policycoreutils-{python,devel}
sudo pip3 install virtualenv

Create www-data group

sudo groupadd www-data
sudo usermod -a -G www-data centos

Create virtual environment

This command creates a new folder inside /home/$USER/ with the same name as the virtual environment. In this case, the path to the virtual environment home is /home/centos/hubievenv.

virtualenv hubievenv

Activating the virtual environment will enforce Python installed in the virtual environment.

source hubievenv/bin/activate

Executing the above command makes a change to the command line:
(hubievenv) [centos@hubie4 ~]$

The virtual environment can be exited by typing deactivate command. Before that, the virtual environment needs to be prepared.

Install packages with pip in virtual environment

pip install gunicorn flask dash plotly pandas boto3

If not using dash, only flask, remove the dash package from the install list. No need to install flask if you are only using dash. Package boto3 is installed because my data source is AWS S3.

If you get the following error:

ERROR: botocore 1.13.33 has requirement python-dateutil<2.8.1,>=2.1; python_version >= "2.7", but you'll have python-dateutil 2.8.1 which is incompatible.

Downgrade python-dateutil:

pip install python-dateutil==2.8.0

Any other Python package needed should be installed in the virtual environment. Deactivate the virtual environment when done installing.

Create home directory for git repository

This step is not needed to make nginx and gunicorn work.

My source code for the web app is in a GitHub repository.

mkdir git
git clone https://github.com/markokole/hubie.git

Executing above commands means home to my web app project is /home/centos/git/hubie. This will come in handy later on.

Test web application

I am still in the virtual environment for the testing purpose.

Since the application I am using as an example connects to AWS S3, credentials are needed.

export AWS_ACCESS_KEY_ID=<AWS_ACCESS_KEY_ID>
 export AWS_SECRET_ACCESS_KEY=<AWS_SECRET_ACCESS_KEY>

Stepping into the git repository and executing the following command:

gunicorn --chdir logic -b 0.0.0.0:8080 hubie:server

Should load the website in a browser once you enter IP_ADDRESS:8080 or DNS_NAME:8080

Make sure you open the port 8080!

The page loads successfully, and now we work towards loading the page with port 80.

Exit virtual environment:

deactivate

Create gunicorn service

To create a Gunicorn service, two services will be created, one depending on the other.

Create gunicorn.socket file

First service creates a socket file which listens for connections.

sudo vi /etc/systemd/system/gunicorn.socket

[Unit]
Description=gunicorn socket
[Socket]
ListenStream=/run/gunicorn.sock
[Install]
WantedBy=sockets.target

No need to start this service since it is a dependence of the service described below.

Create gunicorn.service file

This file creates the Gunicorn service and prior to that starts the above mentioned socket service. Make sure both files have the same name.

sudo vi /etc/systemd/system/gunicorn.service

[Unit]
Description=gunicorn daemon
Requires=gunicorn.socket
After=network.target

[Service]
User=centos
Group=www-data
WorkingDirectory=/home/centos/git/hubie
Environment="AWS_ACCESS_KEY_ID=<AWS_ACCESS_KEY_ID>"
Environment="AWS_SECRET_ACCESS_KEY=<AWS_SECRET_ACCESS_KEY>"
Environment="PATH=/home/centos/hubievenv/bin/gunicorn"

ExecStart=/home/centos/hubievenv/bin/gunicorn --workers 3 --chdir /home/centos/git/hubie/logic --bind unix:/run/gunicorn.sock hubie:server

[Install]
WantedBy=multi-user.target

The group www-data has to exist before this service is started. Alter the parameters accordingly.

Start the gunicorn service

When the service file is created, start the service.

sudo systemctl start gunicorn

Enable the service so that it starts automatically after server restart.

sudo systemctl enable gunicorn

Check for status of the service with below command.

sudo systemctl status gunicorn

If the gunicorn service does not start add execute right to the world to the gunicorn.sock file.

sudo chmod 667 /run/gunicorn.sock

Configure Nginx and Gunicorn

Gunicorn configuration file

First, create two folders in the nginx home (/etc/nginx), folder sites-available will store the gunicorn configuration file, sites-enabled will store the symbolic link of the file.

sudo mkdir /etc/nginx/{sites-available,sites-enabled}

Create the configuration file. Keep in mind the file has to be of type *.conf.

sudo vi /etc/nginx/sites-available/gunicorn.conf

server {
    listen 80;
    server_name mydomain.com www.mydomain.com;

    location = /favicon.ico { access_log off; log_not_found off; }
    location /hubie/ {
        root /home/centos/git;
    }
    location / {
        proxy_pass http://unix:/run/gunicorn.sock;
    }
}

Server name is the DNS name or IP address of the server.

First location ignores the error of missing favicon. ico file.
Second location defines the project name with root as the home directory of the repository.

Create symbolic link

Create a symbolic link of the file in the sites-available folder.

sudo ln -s /etc/nginx/sites-available/gunicorn.conf /etc/nginx/sites-enabled

Nginx configuration file

The nginx configuration file should be changed as well.

sudo mv /etc/nginx/nginx.conf /etc/nginx/nginx.conf.default
sudo vi /etc/nginx/nginx.conf

user nginx;
worker_processes auto;
error_log /var/log/nginx/error.log;
pid /run/nginx.pid;

include /usr/share/nginx/modules/*.conf;

events {
    worker_connections 1024;
}

http {
    log_format  main  '$remote_addr - $remote_user [$time_local] "$request" '
                      '$status $body_bytes_sent "$http_referer" '
                      '"$http_user_agent" "$http_x_forwarded_for"';

    access_log  /var/log/nginx/access.log  main;

    sendfile            on;
    tcp_nopush          on;
    tcp_nodelay         on;
    keepalive_timeout   65;
    types_hash_max_size 2048;

    include             /etc/nginx/mime.types;
    default_type        application/octet-stream;


    include /etc/nginx/sites-enabled/*.conf;
    server_names_hash_bucket_size 64;
}

The file is pretty much similar to the default file, except the last two lines.

Check validity of nginx.conf

sudo nginx -t

Restart the nginx service.

sudo systemctl restart nginx

Create nginx.ini for ufw

Last file we create is a nginx.ini file to fix the firewall issues. Linux package ufw is required for the job.

sudo vi /etc/ufw/applications.d/nginx.ini

[Nginx HTTP]
title=Web Server
description=Enable NGINX HTTP traffic
ports=80/tcp

[Nginx HTTPS] \
title=Web Server (HTTPS) \
description=Enable NGINX HTTPS traffic
ports=443/tcp

[Nginx Full]
title=Web Server (HTTP,HTTPS)
description=Enable NGINX HTTP and HTTPS traffic
ports=80,443/tcp

sudo ufw enable

Answer “y” to the question.

sudo ufw allow 'Nginx Full'

Execute the following two commands and the Dash web app will be ready to use.

sudo grep nginx /var/log/audit/audit.log | audit2allow -M nginx
sudo semodule -i nginx.pp

If you check the browser, the page with the server’s DNS or IP loads on port 80.

Some error messages

502 bad gateway

connect() to unix:/run/gunicorn.sock failed (13: Permission denied) while connecting to upstream

When you run into this error, and believe me you will, make sure user nginx has access to the *.sock file in the above mentioned error message. Even though service nginx is not owned by nginx, nginx is still accessing the socket file.
With below command, it is possible to monitor the nginx error messages:

sudo tail -f var/log/nginx/error.log

504 Gateway Time-out

upstream timed out (110: Connection timed out) while reading response header from upstream

In the file that defines the service – in this example gunicorn.service – add the following option:

--timeout 120

remember to restart the service. And for more details regarding this solution, check out this stackoverflow post.

Automating access from Apache Spark to S3 with Ansible

According to the Apache Spark documentation, Spark jobs must authenticate with S3 to be able to read or write data in the object storage. There are different ways of achieving that:

When Spark is running in a cloud infrastructure, the credentials are usually automatically set up.
spark-submit reads the AWS_ACCESS_KEY, AWS_SECRET_KEY and AWS_SESSION_TOKEN environment variables and sets the associated authentication options for the s3n and s3a connectors to Amazon S3.
In a Hadoop cluster, settings may be set in the core-site.xml file.
Authentication details may be manually added to the Spark configuration in spark-defaults.conf.
Alternatively, they can be programmatically set in the SparkConf instance used to configure the application’s SparkContext.

Honestly, I wouldn’t know much about the first option. It might have something to do with running Databricks on AWS.

The second option requires to set environment variables on all servers of the Spark cluster. If using Ansible, this can be done but only on a level of a task or role. This means that if you run a long-live Spark cluster, the variables will not be available once you start using the cluster.

The fourth option is the one that will receive the attention in this post. The spark-defaults.conf is the default configuration file and proper configuration in the file tunes your Spark cluster.

There are five configuration tuples needed to manipulate S3 data with Apache Spark. They are explained below.

Getting environmental variables into Docker

The following approach is suitable for a proof of concept or a testing. An enterprise solution should use service like Hashicorp Vault, Ansible Vault, AWS IAM or similar.

I am using Docker on Windows 10. The folder where DockerFile resides also has a file called aws_cred.env. Make sure this file is added to the .gitignore file so that it is not checked into source code repository! The env file holds the AWS key and secret key needed to authenticate with S3. The file structure is like this:

AWS_ACCESS_KEY_ID=
AWS_SECRET_ACCESS_KEY=

When running the docker container with option –env-file the environmental variables in the file get exported to the Docker container.

In the Ansible code, they can both be looked-up in the following way:

{{ lookup('env', 'AWS_ACCESS_KEY_ID') }}
{{ lookup('env', 'AWS_SECRET_ACCESS_KEY') }}

These can be used in the Jinja2 template file spark-defaults.conf.j2 to generate a Spark configuration file. The configuration tuples relevant in this case are these two:

spark.hadoop.fs.s3a.access.key {{ lookup('env', 'AWS_ACCESS_KEY_ID') }}
spark.hadoop.fs.s3a.secret.key {{ lookup('env', 'AWS_SECRET_ACCESS_KEY') }}

This now gives you the access to the S3 buckets, never mind if they are public or private.

The JAR files

First, the following tuple is mandatory for the Spark configuration:

spark.hadoop.fs.s3a.impl      org.apache.hadoop.fs.s3a.S3AFileSystem

This tells Spark what kind of file system it is dealing with. The JAR files are the library sources for this configuration.

Two libraries must be added to the instances of the Spark cluster:

aws-java-sdk-1.7.4
hadoop-aws-2.7.3

The above mentioned Jinja2 file also holds two configuration tuples relevant for these JAR files:

spark.driver.extraClassPath   /usr/spark-s3-jars/aws-java-sdk-1.7.4.jar:/usr/spark-s3-jars/hadoop-aws-2.7.3.jar
spark.executor.extraClassPath /usr/spark-s3-jars/aws-java-sdk-1.7.4.jar:/usr/spark-s3-jars/hadoop-aws-2.7.3.jar

Be careful with the versions because they must match the Spark version. The above combination has proven to work on Spark installation packages that support Hadoop 2.7. Last two tasks in this main.yml do the job for the Spark cluster.

Once the files are downloaded (for example, I download them to /usr/spark-s3-jars) Apache Spark can start reading and writing to the S3 object storage.

Zealpath and Trivago: case for AWS Cloud Engineer position

Tl;dr: https://github.com/markokole/trivago-cicd-pipeline-aws

Trivago uses Zealpath to find potential engineers to join their team. Zealpath is a website which hosts challenges that everyone can solve and submit, and with that apply for a job.

This is my first time using Zealpath and approach seems very practical. In worst case you learn about company’s technology stack (or some of it) and the way they think and solve problems. I have “applied” for the position AWS Cloud Engineer and 72 hours were given to submit the solution. My intention, honestly, was not to apply for Trivago job but to learn something new about automation and pipelines in AWS.

The case is described here. I am aware that they might remove the link at some point so I copied the text to the GitHub repository where the solution is.

Once you apply, the clock start ticking. You download a data.zip file and follow the instructions.

The confusion

The zip file itself is a bit confusing since all the files in the top directory appear to be in the two folders as well. I have removed all the duplicates from the home directory which left me with only README file.

The technology stack

The AWS services making up the pipeline are:

Athena
Cloudformation
Glue
S3

A DockerFile has been created to automate the provision of the pipeline.

The solution

My solution is in a GitHub repository. Hopefully it is well enough documented for anyone to understand it. It should be quite simple once you have an AWS account and Docker on Windows 10 installed. I have not tested it on Linux system.

All one needs to do is copy the DockerFile to a folder on a local machine, add a file called aws_cred.env and build the container.

But! Before all that is done, the variable s3_bucket in the Jupyter Notebook needs to be updated to the bucket name you plan to use. I really didn’t understand why the duplicates in the zip file. That is also the reason why I created the tar.gz file with the code from Zealpath’s zip file. I have also taken out the files I assume are duplicates.

markobigdata

Big Data documentation in a blog

Category: S3