Project sandboxing (virtual environments) (11) | Labs | NSWI177

Information below is not for the current semester. The current semester can be found here.

Labs: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14.

Before class reading
Virtual environment for Python (a.k.a. virtualenv or venv)
How does it work?
Installing Python-specific packages with pip
Packaging Python Projects
Building Python Package
Publishing Python Package
Creating distribution packages (e.g. for DNF)
Higher-level tools
Other languages
Excercise
Graded tasks (deadline: May 12)
Learning outcomes

This lab is devoted to the basic principles of reproducible and isolated development. You will see how you can ensure that working on your project – that may require installation of many dependencies – can be set up without modifying anything system-wide on your machine.

Do not forget that the Before class reading is mandatory and there is a quiz that you are supposed to complete before coming to the labs.

Before class reading

During the previous lab, we showed that the preferred way of installing applications (and libraries and data files) on Linux is via the package manager. It installs the application for all users, it allows system-wide upgrades, and it generally keeps your system in a much cleaner state.

However, system-wide installation may not be always suitable. One typical example are project-specific dependencies. These are often not installed system-wide, mainly for the following reasons:

You need different versions of dependencies for different projects.
You do not want to remember to uninstall them when you stop working on the project.
You want to control when you upgrade them: an upgrade of the OS should not affect your project.
The versions you need are different from those available through the package manager.
Or they may not be packaged at all.

For the above reasons, it is much better to create a project-specific installation that is better isolated from the system. Note that installing the dependency per-user (i.e., somewhere into $HOME) may not provide the isolation you wish to achieve.

Such approach is supported by most reasonable programming languages and can be usually found under names such as virtual environment, local repository, sandbox or similar (note that the concepts do not map 1:1 across languages and tools, but the general idea remains the same).

With a virtual environment, your dependencies are usually installed into a specific directory inside your project, kept outside version control. The compiler/interpreter is then told to use this location.

The directory-local installation then keeps your system clean. It also allows working on multiple projects with incompatible dependencies, because they are completely isolated.

The installation directory is rarely committed to your Git repository. Instead, you commit a configuration file that specifies how to prepare the environment. Each developer can then recreate the environment without polluting the main repository with distribution-specific or even OS-dependent files. Yet the configuration file ensures that all developers will be working in the same environment (i.e., same versions of all the dependencies).

It also means that new members of software teams can easily set up their environment using the provided configuration file.

Dependency installation

Inside the virtual environment, the project usually does not use generic package managers (such as DNF). Instead, they install dependencies using language-specific package managers.

These are usually cross-platform and use their own software repository. Such repository then hosts only libraries for that particular language. Again, there can be multiple such repositories and it is up to the developer how he configures the project.

Technically, language-specific package managers can also install the packages system-wide, competing with distribution-specific package managers. It is up to the administrator to handle this reasonably. This usually involves defining a clear boundary between areas maintained by the distribution-specific manager and those maintained by the language-specific ones.

In our scenario, the language-specific managers would install only into the virtual environment directory without ever touching the system itself.

Installation directories

On a typical Linux system, there are multiple places where software can be installed:

/usr – system packages handled by the distribution’s package manager
/usr/local – software installed locally by the administrator; language-specific managers usually install system-wide packages there
/opt/$PACKAGE – large packages installed outside distribution’s package manager often live in their own sub-directory inside /opt.
$HOME (usually /home/$USER/) – language-specific managers run by non-root users can install packages locally to their home directory (to language-specific sub-directories).
$HOME/.local is a favourite place for local installation that generally mirrors /usr/local but for single user only (executables are then placed inside $HOME/.local/bin)
per-project virtual environments

Python Package Index (PyPI)

The rest of the text will focus mostly on Python tools supporting the above-mentioned principles. Similar tools are available for other languages, but we believe that demonstrating them on Python is sufficient to understand the principles in practice.

Python has a repository called the Python Package Index (PyPI) where anyone can publish their Python programs and/or libraries.

The repository can be used through a web browser, but also through a command-line client called pip.

pip behaves rather similar to DNF. You can use it to install, upgrade, or uninstall Python modules.

When run with superuser privileges, it is able to install packages system-wide. Do not use it like that unless you know what you are doing and you understand the consequences.

Issues of trust

In your distributions upstream package repository, all packages typically has to be reviewed by someone from the distribution’s security team. This is sadly not true for the PyPI or similar repositories. This said, you as a developer must be more cautious when installing from such sources.

Not all packages do what they claim to. Some are just innocently buggy, but some are outright malicious. Re-using other people’s code is generally a good practice, but you should give a thought to the trustworthiness of the author. After all, the code will be executed under your account either when you run your program or as a part of the installation process.

In particular, criminals like to publish malicious packages, whose name differs from a well-known package by a single typo. This is called typosquatting. You might read more for example in this blogpost, but searching the web will yield more results.

On the other hand, many PyPI packages are also available as packages for your distribution (feel free to try dnf search python3- on your Fedora box). Hence they probably were reviewed by distribution maintainers and are probably safe to use. For packages not available for your distribution natively, always look for tell-tales of normal vs malicious project. Popularity of the source code repository. User activity. Reactions to bug reports. Documentation quality. Etc. etc.

Recall that modern software is rarely built from scratch. Do not be afraid to explore what is available. Check it. And use it :-).

Typical workflow practically

While the actual tools will differ across different programming languages, the general steps for developing project in some kind of a sandbox are generally the same.

The developer clones the project (e.g., from a Git repository).
The sandbox (virtual environment) is initialized. Usually this means that a new directory with a fresh language environment is created.
The virtual environment must be activated. Often the virtual environment needs to modify $PATH (or rather some language-specific variant of such path that is used to search for libraries or modules), so the developer must source (or .) some activation script that modifies the path.
Then the developer can install dependencies of the project. They are usually stored in a file that can be passed to the package manager (of the given programming language).
Only now the developer can actually work on the project. The project is fully isolated, removing the virtual environment directory removes all traces of the installed packages.

Everyday job then often involves only steps 3 (some kind of activation) and step 5 (actual development).

Note that activation of the virtual environment typically removes access to libraries installed globally. That is, inside the virtual environment, the developer starts with a fresh and clean environment with a bare compiler. That is actually a very sane decision as it ensures that system-wide installation does not affect the project-specific environment.

In other words, it improves on reproducibility of the whole setup. It also means that the developer needs to specify every dependency into the configuration file even if the dependency can be considered as one of those that are usually present everywhere.

Before class quiz

The quiz file is available in the 11 folder of this GitLab project.

Copy the right language mutation into your project as 11/before.md (i.e., you will need to rename the file).

The questions and answers are part of that file, fill in the answers in between the **[A1]** and **[/A1]** markers.

The before-11 pipeline on GitLab will test that your answers are in the correct format. It does not check for actual correctness (for obvious reasons).

Submit your before-class quiz before start of lab 11.

Virtual environment for Python (a.k.a. `virtualenv` or `venv`)

To try installing Python packages safely, we will first setup a virtual environment for our project. Fortunately, Python has built-in support for creating a virtual environment.

We will demonstrate this on following example:

#!/usr/bin/env python3

import sys
import dateparser


def main():
    input_date = ' '.join(sys.argv[1:])
    if input_date == '':
        input_date = 'now'

    date = dateparser.parse(input_date)
    if not date:
        print(f"Invalid date specification (`{input_date}').", file=sys.stderr)
        sys.exit(1)

    print(date.strftime('%Y-%m-%dT%H:%M:%S'))


if __name__ == '__main__':
    main()

Save this snippet into timestamp2iso.py and set the executable bit. Note that dateparser.parse() is able to parse various time specification into the native Python date format. The time specification can be even text such as three days ago.

Make sure you understand the whole program before continuing.

Try running the timestamp2iso.py program.

Unless you have already installed the python3-dateparser package system-wide, it should fail with ModuleNotFoundError: No module named 'dateparser'. The chances are that you do not have that module installed.

If you have installed the python3-dateparser, uninstall it now and try again (just for this demo). But double-check that you would not remove some other program that may require it.

We could now install the python3-dateparser with DNF but we already described why that is a bad idea. We could also install it with pip globally but that is not the best course of action either.

Instead, we will create a new virtual environment for it.

python -m venv my-venv

The above command creates a new directory my-venv that contains a bare installation of Python. Feel free to investigate the contents of this directory.

We now need to activate the environment.

source my-venv/bin/activate

Your prompt should have changed: it is prefixed by (my-venv) now.

Running timestamp2iso.py will still terminate with ModuleNotFoundError.

We will now install the dependency:

pip install dateparser

This will take some time as Python will also download transitive dependencies of this library (and their dependencies etc.). Once the installation finishes, run timestamp2iso.py again.

This time, it should work.

./timestamp2iso.py three days ago

Once we are finished with the development, we can deactivate the environment by calling deactivate (this time, without sourcing anything).

Running timestamp2iso.py outside the environment shall again terminate with ModuleNotFoundError.

How does it work?

Python virtual environment uses two tricks in its implementation.

First, the activate script extends $PATH with the my-venv/bin directory. That means that calling python will prefer the application from the virtualenv’s directory (e.g. my-venv/bin/python).

Try this yourself: print $PATH before and after you activate a virtualenv.

This also explains why we should always specify /usr/bin/env python in the shebang instead of /usr/bin/python.

You can also view the activate script and see how this is implemented. Note that deactivate is actually a function.

Why is the activate script not executable? Hint.

The second trick is that Python searches for modules (i.e., for files implementing an imported module) relative to the path of the python binary. Hence, when python is inside my-venv/bin, Python will look for the modules inside my-venv/lib. That is the location where your locally installed files will be placed.

You can check this by executing the following one-liner that prints Python search directories (again, before and after activation):

python -c 'import sys; print(sys.path)'

This behaviour is actually not hard-wired in the Python interpreter. When Python starts up, it automatically imports a module called site. This module contains site-specific setup: it adjusts sys.path to include all directories where your distribution installs Python modules. It also detects virtual environments by looking for the pyvenv.cfg file in the grandparent directory of the python binary. In our case, this configuration file contains include-system-site-packages=false, which tells the site module to skip distribution’s module directories. You can see that the principle is very simple and the interpreter itself needs to know nothing about virtual environments.

Installing Python-specific packages with `pip`

`pip` VS. `python -m pip`?

Generally, it is recommended to use python -m pip, rather than raw pip. Reasons behind these additional 10 key strokes are well described in Why you should use python -m pip. However, in order to make the following text more readable, we will use the shorter pip variant.

We have already seen one usage of pip in practice, but pip can do much more. The nice walkthrough over all pip capabilities can be found in Using Python’s pip to Manage Your Projects’ Dependencies.

Here we provide a brief summary of the most important concepts and commands.

By default pip install is searching through the package registry PyPI, in order to install package specified in command-line. We wouldn’t be far from truth, by saying that all packages inside this registry are just archived directories, which contains Python source code organized in a prescribed way.

If you would like to change this default package registry you can use --index-url argument.

As you are already familiar with GitLab, you could be interested in GitLab PyPI Package Registry Support.

In later section, we will learn how to turn a directory with code into proper Python package. Assuming that we have already done it, we can that package directly (without archiving/packing) by running pip install /path/to/python_package.

For example, imagine a situation where you are interested in third-party open-source package. This package is available in remote git repository (typically on GitHub or GitLab), but it is NOT packed and published in PyPI. You can simply clone the repository and run pip install .. However, thanks to pip VCS Support you can avoid the cloning phase and install the package directly with:

pip install git+https://git.example.com/MyProject

In order to upgrade a specific package you run pip install --upgrade [packages].

Finally, for removing package you run pip uninstall [packages].

Dependency versioning

We have already mentioned Semantic Versioning 2.0.0. Python uses more or less compatible versioning, which is described in PEP 440 – Version Identification and Dependency Specification.

When you install dependencies from package registry, you can specify this version.

pkgname          # latest version
pkgname == 4.2   # specific version
pkgname >= 4.2   # minimal version
pkgname ~= 4.2   # equivalent to >= 4.2, == 4.*

Truth is that a version specifier consists of a series of version clauses, separated by commas. Therefore you can type:

pkgname >= 1.0, != 1.3.4.*, < 2.0

Dependency versioning

Sometimes it is helpful to save a list of all currently installed packages (including transitive dependencies). For example, you have recently noticed a new bug in you project and you would like to keep record of precise version of currently installed dependencies, so you co-worker can reproduce it.

In order to do that, it is possible to use pip freeze and create a list that sets specific versions, ensuring the same environment for every developer.

It is recommended to store these in requirements.txt file.

# Generationg requirements file
pip freeze > requirements.txt`

# Installing package from it
pip install -r requirements.txt

Packaging Python Projects

Let’s say that you come up with a super cool algorithm and you want to enrich the world by sharing it. Python official documentation offers step-by-step tutorial how to achieve it.

In following text, we are going to use setuptools for building the python projects. Historically, this was the only options how to a python package. Recently, Python developers decided to open gates for alternatives and so you may also build a python package with Poetry, flit or others. The description of these tools is out of the scope of this course.

Python Package Directory Structure

The very first step, before you can publish it, is to transform it into a proper Python package. We need to files called pyproject.toml and setup.cfg. These files contain information about the project, a list of dependencies, and also information for project installation.

Not long ago, it was usual to have setup.py script, rather that setup.cfg and pyproject.toml. Therefore, in many repositories/tutorials you can still find usage of it. The content is more or less 1:1, but there are certain cases, in which you are forced to use setup.py. Fortunately, this is not applicable for our usecase and so we have decided to describe the modern variant with static configuration files.

As is written in setuptools Quickstart, since version 61.0.0, setuptools offeres the experimental usage of having only a pyproject.toml. This approach is also used by Poetry, but in following text we will stay with stable combination setup.cfg and pyproject.tom.

In timestamp2iso you can find Python package with the same functionality as our previous timestamp2iso.py script.

Please study carefully the directory structure as well as the content of setup.cfg.

One may notice that the necessary dependencies are duplicated in setup.cfg and in requirements.txt. Actually, this is not a mistake. In setup.cfg you should use the most possible relaxed version of the dependency, whereas in requirements.txt we need to specify all dependencies with precise version. There are also the transitive dependencies, which should NOT be present in setup.cfg.

For more details see install_requires vs requirements file.

Try to install this package with VCS Support with following command:

pip install git+http://gitlab.mff.cuni.cz/teaching/nswi177/2022/common/timestamp2iso.git

You perhaps noticed that the setup.cfg contained section [options.entry_points]. This section specifies what are actual scripts of your project. Note that after running the above command, you can execute timestamp2iso command directly. Pip created a wrapper script for you and added it to the sandbox $PATH.

timestamp2iso three days ago

Now uninstall the package with:

pip uninstall matfyz-nswi177-timestamp2iso

Clone the repository to you local machine and change directory to it. Now run:

pip install -e .

pip install -e produces an editable installation for easy debugging. Instead of copying your code to the virtual environment, it installs only a symlink-like thing (actually, an timestamp2iso.egg-link file which has a similar effect on Python’s mechanism for finding modules) referring to the directory with your source files.

Add some nice prefix just before the ISO print statement and run timestamp2iso three days ago again.

Building Python Package

Now, when we already have the proper directory structure, we are only two step from publishing it to Package Registry.

Now, we prepare distribution packages for our code. Firstly, we install the build package by invoking pip install build. Then we can run

python -m build

Two files are created in the dist subdirectory:

matfyz-nswi177-timestamp2iso-0.0.1.tar.gz – a source code archive
matfyz_nswi177_timestamp2iso-0.0.1-py3-none-any.whl – a wheel file, which is the built package (py3 is the Python version required, none and any tell that this is a platform-independent package).

Note that wheel file is nothing more that a simple Zip archive.

$ file dist/matfyz_nswi177_timestamp2iso-0.0.1-py3-none-any.whl
dist/matfyz_nswi177_timestamp2iso-0.0.1-py3-none-any.whl: Zip archive data, at least v2.0 to extract, compression method=deflate

$ unzip -l dist/matfyz_nswi177_timestamp2iso-0.0.1-py3-none-any.whl
Archive:  dist/matfyz_nswi177_timestamp2iso-0.0.1-py3-none-any.whl
  Length      Date    Time    Name
---------  ---------- -----   ----
       57  04-23-2022 17:26   timestamp2iso/__init__.py
      597  04-23-2022 17:33   timestamp2iso/main.py
      402  04-23-2022 17:31   timestamp2iso/timestamp2iso.py
     1067  04-23-2022 19:36   matfyz_nswi177_timestamp2iso-0.0.1.dist-info/LICENSE
      917  04-23-2022 19:36   matfyz_nswi177_timestamp2iso-0.0.1.dist-info/METADATA
       92  04-23-2022 19:36   matfyz_nswi177_timestamp2iso-0.0.1.dist-info/WHEEL
       58  04-23-2022 19:36   matfyz_nswi177_timestamp2iso-0.0.1.dist-info/entry_points.txt
       14  04-23-2022 19:36   matfyz_nswi177_timestamp2iso-0.0.1.dist-info/top_level.txt
      849  04-23-2022 19:36   matfyz_nswi177_timestamp2iso-0.0.1.dist-info/RECORD
---------                     -------
     4053                     9 files

You may wonder, why there are two archives with very similar content. The answer can be found in What Are Python Wheels and Why Should You Care?.

You can now switch to a different virtualenv and install the package using pip install package.whl.

Publishing Python Package

If you think that the package could be useful to other people, you can publish it in the Python Package Index. This is usually accomplished using the twine tool. The precise steps are described in Uploading the distribution archives.

Creating distribution packages (e.g. for DNF)

While the work for creating the project files may seem to complicate things a lot, it actually saves time in the long run.

Virtually any Python developer would be now able to install your program and have a clear starting point when investigating other details.

Note that if you have installed some program via DNF system-wide and that program was written in Python, somewhere inside it was setup.cfg that looked very similar to the one you have just seen. Only instead of installing the script into your virtual environment, it was installed globally.

There is really no other magic behind it.

Note that for example Ranger is written in Python and this script describes its installation (it is a script for creating packages for DNF). Note that the %py3_install is a macro that actually calls setup.py install.

Higher-level tools

We can think of the pip and virtualenv as low-level tools. However, there are also tools that combine both of them and bring more comfort to package management. In Python there are at least two favorite choices, namely Poetry and Pipenv.

Internally, these tools use pip and venv, so you are still able to have independent working spaces as well as the possibility to install a specific package from the Python Package Index (PyPI).

The complete introduction of these tools is out of the scope for this course. Generally, they follow the same principles, but they add some extra functions that are nice to have. Briefly, the major differences are:

They can freeze specific versions of dependencies, so that the project builds the same on all machines (using poetry.lock file).
Packages can be removed together with their dependencies.
It is easier to initialize a new project.

Other languages

Other languages have their own tools with similar functions:

Ruby has bundler.
Julia has Pkg.
Rust has cargo.
JavaScript has npm.
…

Excercise

Setup program from examples repository (11/last_commit) to be a proper Python project.

Graded tasks (deadline: May 12)

`11/tapsum2json` (100 points)

Write a program that produces summary of TAP results in a JSON format.

TAP – or Test Anything Protocol – is a universal format for test results. It is used by BATS and the GitLab pipeline, too.

1..4
ok 1 One
ok 2 Two
ok 3 Three
not ok 4 Four
#
# -- Report --
# filename:77:26: note: Something is wrong here.
# --
#

Your program will accept a list of arguments – filenames – and read them using appropriate consumer. Each of the files would be a standalone TAP result (i.e., what a BATS produces with -t). Nonexistent files will be skipped and recorded as having zero tests.

The program will then print summary of the tests in the following format.

{
  "summary": [
    {
      "filename": "filename1.tap",
      "total": 12,
      "passed": 8,
      "skipped": 3,
      "failed": 1
    },
    {
      ...
    }
  ]
}

You have to use a library for reading TAP files: tap.py is certainly a good option, but feel free to find a better alternative.

Your solution must contain a pyproject.toml, setup.cfg and requirements.txt with a list of library dependencies that can be passed to pip install. Your solution must also be installable via setup.cfg and create a tapsum2json executable in the $PATH. This is mandatory as we will test your solution like this (see the tests for details).

Save your implementation into 11/tapsum2json subdirectory.

If you want to run the automated tests on your machine, you need to have the json_reformat utility from the yajl DNF package (sudo dnf install yajl). The tests reformat the JSON output to allow easy visual inspection of the result. We do not require you to format JSON by yourself, although passing indent=True to json.dump certainly helps debugging.

Learning outcomes

Conceptual knowledge

Conceptual knowledge is about understanding the meaning and context of given terms and putting them into context. Therefore, you should be able to …

explain what are dependencies (in the sense of requirements)
explain why installing project dependencies system-wide may not work for multiple projects
explain how the sandboxing works (high-level points of view)
explain pros and cons of specifying transitive dependencies or top-level ones only and of specifying exact versions vs minimal requirement

Practical skills

Practical skills is usually about usage of given programs to solve various tasks. Therefore, you should be able to …

create a new virtual environment (for Python)
active and deactive an existing virtual environment
run/test Python project using virtualenv (with setup.cfg and pyproject.toml)
install Python project that uses setup.cfg and pyproject.toml
install new project dependencies
update list of dependencies
configure project for installation (optional)

Before class reading

Dependency installation

Installation directories

Python Package Index (PyPI)

Issues of trust

Typical workflow practically

Before class quiz

Virtual environment for Python (a.k.a. virtualenv or venv)

How does it work?

Installing Python-specific packages with pip

pip VS. python -m pip?

Dependency versioning

Dependency versioning

Packaging Python Projects

Python Package Directory Structure

Building Python Package

Publishing Python Package

Creating distribution packages (e.g. for DNF)

Higher-level tools

Other languages

Excercise

Graded tasks (deadline: May 12)

11/tapsum2json (100 points)

Learning outcomes

Conceptual knowledge

Practical skills

Virtual environment for Python (a.k.a. `virtualenv` or `venv`)

Installing Python-specific packages with `pip`

`pip` VS. `python -m pip`?

`11/tapsum2json` (100 points)