Using Python 3 with Apache Spark on CentOS 7 with help of virtualenv

Python 2.7.5 ships with Centos 7 and more and more software on CentOS 7 either does not support Python 2 anymore, or recommends you to use Python 3.

Alternative solutions

I have come across the same challenge and earlier my two approaches to solve it were:

The first option has proven to be confusing because one has to enable the Python 3 version by executing

scl enable rh-python36 bash

which can easily lead to confusion whether Python 2 or Python 3 is used at a certain moment.

The second option is even worse – it gives you a false feeling how easy it is to just change a symbolic link, but problems start occurring as soon as one goes into (not even) more serious with Python – for example: pip will crash and fixing that is world of pain. Don’t go there.

pip or pip3

Python 2 uses pip, Python 3 use pip3, or is it? Simple rule of thumb – pip == pip3. I was in a dilemma regarding the pips and in some blog, I read about the “equation” which I am also following below – in the solution. Ansible has solved this issue in its way as you will soon see.

Preparing virtualenv

The whole idea is very simple: create a virtual environment that uses Python 3, use the virtual environment OR the installed files (Python 3 or pip are probably most useful). Below is an Ansible code snippet showing how to install virtualenv, create a virtual environment and install some pip packages in the environment.

- name: Pip install virtualenv
  pip:
    name: virtualenv
    executable: pip3

- name: Create virtualenv
  shell: /usr/local/bin/virtualenv /home/spark/sparkenv
  become: true
  become_user: spark

- name: Pip install in virtualenv
  shell: /home/spark/sparkenv/bin/pip install numpy pandas
  become: yes
  become_user: spark

Whole task can be found here.

The virtual environments is now created and path to Python 3 can be used for Spark. Other Python modules can be added in the last task just like numpy and pandas are added.

Spark and Python

Its is recommended to tell Spark where the executable Python file is. This is done in the spark-env.sh which resides in $SPARK_HOME/conf. Below is an example telling Spark to use python executable file from the folder where virtual environment sparkenv is installed.

PYSPARK_PYTHON=/home/spark/sparkenv/bin/python

Whole spark-env.sh template file (Jinja2) can be found here.

With that in place, running pyspark is going to use Python 3.

Here is the link to my repository with code that automates provision of Spark cluster in AWS using Terraform and Consul. The repository uses the above described solution.