Nginx, Gunicorn and Dash on CentOS

Challenge to solve

I am building a website for analyses of basketball games based on the play-by-play data publicly available after endgame. My logic (parsing, fetching from the internet, algorithms, etc) is written in Python and I wanted to continue using Python all the way, also when building front end. To do that, I have chosen Dash which builds on top of Flask.

My plan was to publish the web-based analytic app called Hubie behind a domain with port 80 and run it on a Linux server in the cloud. Gunicorn is the server of choice for the web application and Nginx is a web server which in this case serves as a reverse proxy.

The web application deployed is using a DNS name created in Azure.

Virtual environment is used to test the web application from port 8080, and to execute python3 and gunicorn commands suitable for Python3.

One of the challenges CentOS 7 has is that it still used Python 2 as its default Python. Installing Python3 and changing paths in /usr/bin might seem a good solution, but it will come back and haunt you. That is why it is best to create a virtual environment with the desired Python version.

Environment

Service used to host the server is Virtual Machine on Azure. The Linux server is using image CentOS-based 7.7 and the instance size is Standard D2s v3.

Install packages

Preparing CentOS environment by installing the necessary packages:

sudo yum update -y
sudo yum install -y epel-release
sudo yum -y install python3-pip nginx git 
sudo  yum install --enablerepo="epel" ufw -y
sudo yum install -y policycoreutils-{python,devel}
sudo pip3 install virtualenv

Create www-data group

sudo groupadd www-data
sudo usermod -a -G www-data centos

Create virtual environment

This command creates a new folder inside /home/$USER/ with the same name as the virtual environment. In this case, the path to the virtual environment home is /home/centos/hubievenv.

virtualenv hubievenv

Activating the virtual environment will enforce Python installed in the virtual environment.

source hubievenv/bin/activate

Executing the above command makes a change to the command line:
(hubievenv) [centos@hubie4 ~]$

The virtual environment can be exited by typing deactivate command. Before that, the virtual environment needs to be prepared.

Install packages with pip in virtual environment

pip install gunicorn flask dash plotly pandas boto3

If not using dash, only flask, remove the dash package from the install list. No need to install flask if you are only using dash. Package boto3 is installed because my data source is AWS S3.

If you get the following error:

ERROR: botocore 1.13.33 has requirement python-dateutil<2.8.1,>=2.1; python_version >= "2.7", but you'll have python-dateutil 2.8.1 which is incompatible.

Downgrade python-dateutil:

pip install python-dateutil==2.8.0

Any other Python package needed should be installed in the virtual environment. Deactivate the virtual environment when done installing.

Create home directory for git repository

This step is not needed to make nginx and gunicorn work.

My source code for the web app is in a GitHub repository.

mkdir git
git clone https://github.com/markokole/hubie.git

Executing above commands means home to my web app project is /home/centos/git/hubie. This will come in handy later on.

Test web application

I am still in the virtual environment for the testing purpose.

Since the application I am using as an example connects to AWS S3, credentials are needed.

export AWS_ACCESS_KEY_ID=<AWS_ACCESS_KEY_ID>
 export AWS_SECRET_ACCESS_KEY=<AWS_SECRET_ACCESS_KEY>

Stepping into the git repository and executing the following command:

gunicorn --chdir logic -b 0.0.0.0:8080 hubie:server

Should load the website in a browser once you enter IP_ADDRESS:8080 or DNS_NAME:8080

Make sure you open the port 8080!

The page loads successfully, and now we work towards loading the page with port 80.

Exit virtual environment:

deactivate

Create gunicorn service

To create a Gunicorn service, two services will be created, one depending on the other.

Create gunicorn.socket file

First service creates a socket file which listens for connections.

sudo vi /etc/systemd/system/gunicorn.socket

[Unit]
Description=gunicorn socket
[Socket]
ListenStream=/run/gunicorn.sock
[Install]
WantedBy=sockets.target

No need to start this service since it is a dependence of the service described below.

Create gunicorn.service file

This file creates the Gunicorn service and prior to that starts the above mentioned socket service. Make sure both files have the same name.

sudo vi /etc/systemd/system/gunicorn.service

[Unit]
Description=gunicorn daemon
Requires=gunicorn.socket
After=network.target

[Service]
User=centos
Group=www-data
WorkingDirectory=/home/centos/git/hubie
Environment="AWS_ACCESS_KEY_ID=<AWS_ACCESS_KEY_ID>"
Environment="AWS_SECRET_ACCESS_KEY=<AWS_SECRET_ACCESS_KEY>"
Environment="PATH=/home/centos/hubievenv/bin/gunicorn"

ExecStart=/home/centos/hubievenv/bin/gunicorn --workers 3 --chdir /home/centos/git/hubie/logic --bind unix:/run/gunicorn.sock hubie:server

[Install]
WantedBy=multi-user.target

The group www-data has to exist before this service is started. Alter the parameters accordingly.

Start the gunicorn service

When the service file is created, start the service.

sudo systemctl start gunicorn

Enable the service so that it starts automatically after server restart.

sudo systemctl enable gunicorn

Check for status of the service with below command.

sudo systemctl status gunicorn

If the gunicorn service does not start add execute right to the world to the gunicorn.sock file.

sudo chmod 667 /run/gunicorn.sock

Configure Nginx and Gunicorn

Gunicorn configuration file

First, create two folders in the nginx home (/etc/nginx), folder sites-available will store the gunicorn configuration file, sites-enabled will store the symbolic link of the file.

sudo mkdir /etc/nginx/{sites-available,sites-enabled}

Create the configuration file. Keep in mind the file has to be of type *.conf.

sudo vi /etc/nginx/sites-available/gunicorn.conf

server {
    listen 80;
    server_name mydomain.com www.mydomain.com;

    location = /favicon.ico { access_log off; log_not_found off; }
    location /hubie/ {
        root /home/centos/git;
    }
    location / {
        proxy_pass http://unix:/run/gunicorn.sock;
    }
}

Server name is the DNS name or IP address of the server.

First location ignores the error of missing favicon. ico file.
Second location defines the project name with root as the home directory of the repository.

Create symbolic link

Create a symbolic link of the file in the sites-available folder.

sudo ln -s /etc/nginx/sites-available/gunicorn.conf /etc/nginx/sites-enabled

Nginx configuration file

The nginx configuration file should be changed as well.

sudo mv /etc/nginx/nginx.conf /etc/nginx/nginx.conf.default
sudo vi /etc/nginx/nginx.conf

user nginx;
worker_processes auto;
error_log /var/log/nginx/error.log;
pid /run/nginx.pid;

include /usr/share/nginx/modules/*.conf;

events {
    worker_connections 1024;
}

http {
    log_format  main  '$remote_addr - $remote_user [$time_local] "$request" '
                      '$status $body_bytes_sent "$http_referer" '
                      '"$http_user_agent" "$http_x_forwarded_for"';

    access_log  /var/log/nginx/access.log  main;

    sendfile            on;
    tcp_nopush          on;
    tcp_nodelay         on;
    keepalive_timeout   65;
    types_hash_max_size 2048;

    include             /etc/nginx/mime.types;
    default_type        application/octet-stream;


    include /etc/nginx/sites-enabled/*.conf;
    server_names_hash_bucket_size 64;
}

The file is pretty much similar to the default file, except the last two lines.

Check validity of nginx.conf

sudo nginx -t

Restart the nginx service.

sudo systemctl restart nginx

Create nginx.ini for ufw

Last file we create is a nginx.ini file to fix the firewall issues. Linux package ufw is required for the job.

sudo vi /etc/ufw/applications.d/nginx.ini

[Nginx HTTP]
title=Web Server
description=Enable NGINX HTTP traffic
ports=80/tcp

[Nginx HTTPS] \
title=Web Server (HTTPS) \
description=Enable NGINX HTTPS traffic
ports=443/tcp

[Nginx Full]
title=Web Server (HTTP,HTTPS)
description=Enable NGINX HTTP and HTTPS traffic
ports=80,443/tcp

sudo ufw enable

Answer “y” to the question.

sudo ufw allow 'Nginx Full'

Execute the following two commands and the Dash web app will be ready to use.

sudo grep nginx /var/log/audit/audit.log | audit2allow -M nginx
sudo semodule -i nginx.pp

If you check the browser, the page with the server’s DNS or IP loads on port 80.

Some error messages

502 bad gateway

connect() to unix:/run/gunicorn.sock failed (13: Permission denied) while connecting to upstream

When you run into this error, and believe me you will, make sure user nginx has access to the *.sock file in the above mentioned error message. Even though service nginx is not owned by nginx, nginx is still accessing the socket file.
With below command, it is possible to monitor the nginx error messages:

sudo tail -f var/log/nginx/error.log

504 Gateway Time-out

upstream timed out (110: Connection timed out) while reading response header from upstream

In the file that defines the service – in this example gunicorn.service – add the following option:

--timeout 120

remember to restart the service. And for more details regarding this solution, check out this stackoverflow post.

Capturing messages in Event Hubs to Blob Storage

This post builds on the post Streaming messages from Kafka to EventHub with MirrorMaker where the messages are streamed into Event Hubs which is the datasource for this post.

Messages in Event Hubs

Messages in Event Hubs are stored in Avro format. If we wish to capture them in Blob Storage, they will be stored in the same format – Avro. Apache Avro is a binary data serialization system.

The whole infrastructure is seen here, from the Scala script that produces messages to the Blob Storage where they are stored.
Important: Avro is not a tool for capturing messages, but a file format!

Installing avro-tools

The simplest way to manipulate an Avro file is by using avro-tools. Download the avro-tools jar file.

wget https://repo1.maven.org/maven2/org/apache/avro/avro-tools/1.9.0/avro-tools-1.9.0.jar -P /opt

Reading Avro file from Blob Storage

For the purpose of testing if messages produced in Kafka landed in the Blob Storage, one file is manually downloaded and checked. It is best to copy the URL from Azure Storage accounts -> Storage Explorer -> BLOB CONTAINERS -> EVENTHUB_NAMESPACE. On the right side, drill down until the avro file is visible.

Choose the file and click on the Copy URL button. Download the file to the local computer by executing the below command (alter the URL link accordingly):

wget https://STORAGE_ACC_NAME.blob.core.windows.net/STORAGE_CONTAINER_NAME/EVENTHUB_NAMESPACE/prod.test1/0/2019/08/07/15/17/19.avro -P /tmp

The file is saved to folder /tmp and can now be read using avro-tools.

Comparing messages in Kafka with messages in Blob Storage

100 messages have been produced and saved to 5 different topics in Kafka. If topic prod.test1 is taken as an example, from CLI the messages can be listed executing the following command:

$KAFKA_HOME/bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic prod.test1 --from-beginning

The output (first three and last three messages):

This is message number 2
This is message number 32
This is message number 52
.
.
.
This is message number 95
This is message number 98
This is message number 100

Checking Avro file that corresponds to the topic prod.test1 by running the following command:

java -jar /opt/avro-tools-1.9.0.jar tojson /tmp/19.avro

Returns the following rows:

{“SequenceNumber”:0,”Offset”:”0″,”EnqueuedTimeUtc”:”8/8/2019 9:02:31 AM”,”SystemProperties”:{“x-opt-kafka-key”:{“bytes”:”\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0001″}},”Properties”:{},”Body”:{“bytes”:”This is message number 2“}}
{“SequenceNumber”:1,”Offset”:”560″,”EnqueuedTimeUtc”:”8/8/2019 9:02:31 AM”,”SystemProperties”:{“x-opt-kafka-key”:{“bytes”:”\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0001″}},”Properties”:{},”Body”:{“bytes”:”This is message number 32“}}
{“SequenceNumber”:2,”Offset”:”608″,”EnqueuedTimeUtc”:”8/8/2019 9:02:31 AM”,”SystemProperties”:{“x-opt-kafka-key”:{“bytes”:”\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0001″}},”Properties”:{},”Body”:{“bytes”:”This is message number 52“}}
.
.
.
{“SequenceNumber”:8,”Offset”:”896″,”EnqueuedTimeUtc”:”8/8/2019 9:02:31 AM”,”SystemProperties”:{“x-opt-kafka-key”:{“bytes”:”\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0001″}},”Properties”:{},”Body”:{“bytes”:”This is message number 95“}}
{“SequenceNumber”:9,”Offset”:”944″,”EnqueuedTimeUtc”:”8/8/2019 9:02:31 AM”,”SystemProperties”:{“x-opt-kafka-key”:{“bytes”:”\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0001″}},”Properties”:{},”Body”:{“bytes”:”This is message number 98“}}
{“SequenceNumber”:10,”Offset”:”992″,”EnqueuedTimeUtc”:”8/8/2019 9:02:31 AM”,”SystemProperties”:{“x-opt-kafka-key”:{“bytes”:”\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0001″}},”Properties”:{},”Body”:{“bytes”:”This is message number 100“}}

The message numbers match. The messages the initially were produced in Kafka are now stored in Avro format in Azure Blob Storage.

There are services in Azure that work on top of Event Hubs or Blob Storage. That topic is covered in the next blog post.

Streaming messages from Kafka to EventHub with MirrorMaker

Idea

The idea is to replicate messages from Apache Kafka to Azure Event Hubs using Kafka’s MirrorMaker.

Introduction

Apache Kafka has become the most popular streaming and messaging open- source tool. Many organizations have implemented it on premise or in a public cloud. And many are content with Kafka’s performance and are hesitant to migrate to a Kafka-like service in the cloud. For example, one such service in Azure is Event Hubs. A “simple, secure and scalable real-time data ingestion” service in Azure.

LinkedIn has developed Kafka and donated it to the community in 2011. Microsoft cannot deny the popularity Kafka has gained and is therefore offering the possibility to use Kafka API to work against Event Hubs. This allows Kafka developers to continue using Kafka APIs without any disturbance. Only the configuration changes.

Architecture

MirrorMaker is ran on the consumer side – Kafka – in this case. However, it is advised to run MirrorMaker on the producer side, this would be on a server in Azure on this occasion.

For the sake of simplicity, 5 Kafka topics are defined: prod.test[1-5]. Kafka and MirrorMaker configurations are standard.

Consumer side: Kafka

First, a working Kafka is needed.

In the GitHub repository cloud and local alternatives are available:

provisioned to AWS using Terraform. This way you can create a Kafka cluster. The README.md files should give more details about provisioning Kafka in AWS.
created using Docker container. This gives you a Kafka service suitable for development and testing on your local computer. Keep in mind that Kafka needs to be started manually inside the container by executing script /opt/startall.sh.

Configuring server.properties

If Kafka in Docker container is used, the server.properties file is very standard:

broker.id=0
log.dirs=/var/logs/kafka-logs
zookeeper.connect=localhost:2181
zookeeper.connection.timeout.ms=6000
listeners=PLAINTEXT://localhost:9092
advertised.listeners=PLAINTEXT://localhost:9092
offsets.topic.replication.factor=1
advertised.host.name=127.0.0.1

Using the above configuration, value for parameter bootstrap-server is localhost:9092. Alternatively, 127.0.0.1 can be used since parameter advertised.host.name is defined in the file.

Once either of Kafka alternatives is up and running, test messages can be produced.

Message producer: Scala

Kafka should be up and running and the DNS of Kafka server(s) or localhost is the input parameter when initializing an instance of the class.

Scala script is executed from the client and produces numerous messages that are randomly assigned to topics mentioned above. Client can be any server with Scala installation since DNS names are used to communicate with Kafka and all you need is to be able to reach the Kafka’s DNS names.

The script is going to produce messages to one of the topics randomly. Ten is the default number of messages produced, but this parameter can be adjusted since it is method’s input parameter.

I am generating messages using Scala CLI. Stepping into the folder where RandomMessage.scala is located and starting Scala is a good start before executing the following commands:

:load RandomMessage.scala
val rm = new RandomMessage("localhost", "test") //second parameter is name of topic with prefix prod. and suffix [1-5]
rm.CreateMessages(10000) //10 is default parameter

Output should be something like the following:

Topic: prod.test1. Message: This is message number 9701
Topic: prod.test1. Message: This is message number 9702
Topic: prod.test2. Message: This is message number 9703
Topic: prod.test3. Message: This is message number 9704
Topic: prod.test1. Message: This is message number 9705
Topic: prod.test4. Message: This is message number 9706
Topic: prod.test4. Message: This is message number 9707
Topic: prod.test5. Message: This is message number 9708
Topic: prod.test1. Message: This is message number 9709
Topic: prod.test4. Message: This is message number 9710

This should create 5 topics in Kafka and add some messages to the topics.

As soon as the script is executed it is possible to check in Kafka if topics are created:

$KAFKA_HOME/bin/kafka-topics.sh --bootstrap-server 127.0.0.1:9092 –list

should return a list of topics available:

__consumer_offsets
prod.test0
prod.test1
prod.test2
prod.test3
prod.test4
prod.test5

Reading from one topic:

$KAFKA_HOME/bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic prod.test3 --from-beginning

should return lines like these:

This is message number 6207
This is message number 6215
This is message number 6217
This is message number 6219
This is message number 6221

The topics and the messages are now “sitting” in Kafka. The next step is to get them into Azure Event Hubs.

Producer side: Event Hubs

Kafka’s MirrorMaker replicates the messages from Kafka to Event Hubs. This post is not going into details about configuration of MirrorMaker, it will just prepare the configuration files to produce a working example.

In this repository, Terraform is used to provision Event Hubs. If you already have an existing Event Hubs, that works too, just make sure you don’t have topics that match the names of topics used in this Proof-of-Concept.

Configuration files

There are two files needed for MirrorMaker to work: one for consumer and one for producer side (for better illustration, check the graphic on top).

Below is an example of consumer.config file for Kafka running locally:

bootstrap.servers=localhost:9092
group.id=example-mirrormaker-group
exclude.internal.topics=true
client.id=mirror_maker_consumer
partition.assignment.strategy=org.apache.kafka.clients.consumer.RoundRobinAssignor
auto.offset.reset=earliest

The last parameter will fetch all messages from Kafka. This is good for testing purposes because you avoid having a live producer to Kafka to see it works. It does however follow the offset rules like any other Kafka’s producer.

To configure producer.config what is needed from Event Hubs is Connection string–primary key, which can be found in the settings of the Event Hubs service, under Shared access policies. Clicking on the policy opens connection strings on the right side. Copy the string. Below is an example of the file.

Replace NAMESPACE with the unique namespace of your choice. Replace CONNECTION_STRING_PRIMARY_KEY with the string from Event Hubs.

bootstrap.servers=NAMESPACE.servicebus.windows.net:9093
client.id=mirror_maker_producer
sasl.mechanism=PLAIN
security.protocol=SASL_SSL
sasl.jaas.config=org.apache.kafka.common.security.plain.PlainLoginModule required username="$ConnectionString" password="CONNECTION_STRING_PRIMARY_KEY";

Now that the configuration files are in place, the MirrorMaker can be started:

$KAFKA_HOME/bin/kafka-mirror-maker.sh --consumer.config $KAFKA_HOME/config/consumer.config --num.streams 1 --producer.config $KAFKA_HOME/config/producer.config --whitelist="prod.*"

The last parameter defines which topics should be replicated to Event Hubs. Above Scala example generated 10.000 messages to Kafka so it is expected to have 10.000 messages also in Event Hubs.

In Azure, it is obvious that all messages have been consumed by the Event Hubs – the blue line. The chart can be interpreted as all messages were consumed within the same minute.

Note: the green colour is used for presentation of messages that were captured – saved to some storage available by Azure. This is out of scope for this post.

The messages are now in Event Hubs. Its is up to the retention parameter how long they will be available there. In the repository above, the value is set to one – one day.

Next step

The messages are in the Event Hubs now. It would make sense to save them to a permanent storage so that they can be used for analysis. This is covered in this blog post.

markobigdata

Big Data documentation in a blog

Category: Azure