Python 2.7.5 ships with Centos 7 and more and more software on CentOS 7 either does not support Python 2 anymore, or recommends you to use Python 3.
Alternative solutions
I have come across the same challenge and earlier my two approaches to solve it were:
- using Software Collections (SCL) – https://linuxize.com/post/how-to-install-python-3-on-centos-7/
- changing the symbolic link /usr/bin/python to point to Python 3 (unfortunately without any links)
The first option has proven to be confusing because one has to enable the Python 3 version by executing
scl enable rh-python36 bash
which can easily lead to confusion whether Python 2 or Python 3 is used at a certain moment.
The second option is even worse – it gives you a false feeling how easy it is to just change a symbolic link, but problems start occurring as soon as one goes into (not even) more serious with Python – for example: pip will crash and fixing that is world of pain. Don’t go there.
pip or pip3
Python 2 uses pip, Python 3 use pip3, or is it? Simple rule of thumb – pip == pip3. I was in a dilemma regarding the pips and in some blog, I read about the “equation” which I am also following below – in the solution. Ansible has solved this issue in its way as you will soon see.
Preparing virtualenv
The whole idea is very simple: create a virtual environment that uses Python 3, use the virtual environment OR the installed files (Python 3 or pip are probably most useful). Below is an Ansible code snippet showing how to install virtualenv, create a virtual environment and install some pip packages in the environment.
- name: Pip install virtualenv
pip:
name: virtualenv
executable: pip3
- name: Create virtualenv
shell: /usr/local/bin/virtualenv /home/spark/sparkenv
become: true
become_user: spark
- name: Pip install in virtualenv
shell: /home/spark/sparkenv/bin/pip install numpy pandas
become: yes
become_user: spark
Whole task can be found here.
The virtual environments is now created and path to Python 3 can be used for Spark. Other Python modules can be added in the last task just like numpy and pandas are added.
Spark and Python
Its is recommended to tell Spark where the executable Python file is. This is done in the spark-env.sh which resides in $SPARK_HOME/conf. Below is an example telling Spark to use python executable file from the folder where virtual environment sparkenv is installed.
PYSPARK_PYTHON=/home/spark/sparkenv/bin/python
Whole spark-env.sh template file (Jinja2) can be found here.
With that in place, running pyspark is going to use Python 3.
Here is the link to my repository with code that automates provision of Spark cluster in AWS using Terraform and Consul. The repository uses the above described solution.