Introduction

This document is a guide to get started with High Performance Computer clusters (HPCC) and more particularly Uninett Sigma2's HPCC. This guide aims at understanding the basic commands for navigating in the terminal of an HPCC, at helping running a job script and at managing files such as copying files from your local directory to the HPCC and using files that are stored remotely.

This document is the result of a Strategic Funding given by the Norwegian Institute for Nature Research (NINA). Please note that the document is not static and constantly subject to improvement based on feedbacks. You can contribute to improving the document by opening issues on the GitHub repository of this project.

Authors

The document has been written by Benjamin Cretois with the help of Francesco Frassinelli.

What is HPC and why using it?

High performance computing (HPC) generally refers to processing complex calculations at high speeds across multiple servers in parallel. Those groups of servers are known as clusters and are composed of hundreds or even thousands of compute servers that have been connected through a network.

With the increased use of technologies like the Internet of Things (IoT), artificial intelligence (AI), and machine learning (ML), organizations are producing huge quantities of data, and they need to be able to process and use that data more quickly in real-time. To power the analysis of such large dataset it is often more practical / mandatory to make use of supercomputing. Supercomputing enables fast processing of data and training of complex algorithms such as neural networks for image and sound recognition.

In Chapter 1 we explore the basics of HPC and we will learn more in depth about UNINETT Sigma2, a Norwegian infrastructure providing large scale computational resources.

Basic vocabulary

Cluster

An HPC cluster is a collection of many separate servers (computers), called nodes, which are connected via a fast interconnect.

Node

HPC clusters are composed of:

  • a headnode, where users log in
  • a specialized data transfer node
  • regular compute nodes (where majority of computations is run)
  • GPU nodes (on these nodes computations can be run both on CPU cores and on a Graphical Processing Unit)

All cluster nodes have the same components as a laptop or desktop: CPU cores, memory and disk space. The difference between personal computer and a cluster node is in quantity, quality and power of the components.

Synthesis

graph TB;

G[User] --> |ssh| A

subgraph HPC

subgraph Cluster one
A[headnode 1] --> B[Worker node 1]
A --> C[Worker node 2]
end

subgraph Cluster two
D[headnode 2] --> E[Worker node 1]
D --> F[Worker node 2]
end

H[Data] --> B
H --> C
H --> E
H --> F

end

Figure 1: Graph representing the structure of an HPC. The user first has to connect to the headnode of one of the HPC clusters through ssh (or any other protocols). The user can then submit a script (i.e. a job script) that is interpreted by the headnode. The headnode distribute the tasks accross the worker nodes. The worker nodes execute the script and if needed fetch the data necessary to complete the task

What is Uninett Sigma2?

Sigma2 is a non-profit company that provides services for high-performance computing and data storage to individuals and groups involved in research and education at all Norwegian universities and colleges, and other publicly funded organizations and projects (such as NINA). Their activities are financed by the Research Council of Norway (RCN) and the Sigma2 consortium partners, which are the universities in Oslo, Bergen, Trondheim and Tromsø. This collaboration goes by the name NRIS – Norwegian research infrastructure services.

Sigma2 owns four High Performance Computing (HPC) servers that have different configurations: Betsy, Saga, Fram and LUMI. Generally, if a project requires access to one or more Graphics processing units (GPUs) it is reasonable to apply for an access to Saga.

To apply for access to one of Sigma2 HPC please refer to this page. You will need to fill out a form describing your project, your experience with HPC and your computational needs (amount of CPU / GPU memory and storage your project will need).

For help regarding the application process contact Benjamin Cretois, Kjetil Grun or Francesco Frassinelli.

Getting started with Sigma2

After a successful application to an account on Sigma2 you will be given your username and will be able to log on the HPC terminal.

To access the HPC server you applied for (in our case it is saga). You can log in using the command ssh in Windows PowerShell or on a linux terminal:


$ ssh username@saga.sigma2.no

The first time you log in, you will be asked to set your password which will have to be used at any subsequent connection to the HPC server.

Navigating in Sigma2's HPC server

Basic bash commands

Communication between you and the HPC server is usually done through an interface named a bash and no Graphical User Interface (GUI) is provided

Here is how the bash looks like

[bencretois@login-5.SAGA ~]$ 

Communicating with a bash requires learning a particular programming language, bash scripting. Below we provide a list of selected commands that will allow you to navigate in HPC server:

Change directory

Command:

  • cd + path to the directory

Output:

[bencretois@login-5.SAGA ~]$ cd deepexperiments/
[bencretois@login-5.SAGA ~/deepexperiments]$

Get to the previous directory

Command:

  • cd .. -> get to the previous directory

Output:

[bencretois@login-5.SAGA ~/deepexperiments]$ cd ..
[bencretois@login-5.SAGA ~]$

List the content of a directory

Command:

  • ls + name of a directory - Note that ls by default list the files of your current directory

Output:

[bencretois@login-5.SAGA ~/deepexperiments]$ ls

bash_cheatsheet.md   Dockerfile             list_ignore.txt  poetry.lock     runs            sync.sh
bayesianfy.ipynb     docker_run_jupyter.sh  models           pyproject.toml  scripts         utils
deepexperiments.sif  jobs

get the path of your current directory

Command:

  • pwd

Output:

[bencretois@login-5.SAGA ~]$ pwd
/cluster/home/bencretois

Create a new folder

Command:

  • mkdir + name of the folder you want to create

Output:

[bencretois@login-5.SAGA ~]$ mkdir new_folder
[bencretois@login-5.SAGA ~]$ ls
deepexperiments new_folder

Learning more bash commands

Bash command* are very well documented on Internet and if you wish to learn more you can begin here.

Bash commands specific to Sigma2's HPC

There are also some useful commands specific to your Sigma2 account:

List your projects

Command:

projects -> list your projects

Output:

[bencretois@login-5.SAGA ~]$ projects
nn5019k

Look at used space and allocated quota for your projects

Command:

dusage -> . Note that space used is what you are currently using and quota is the limit.

Output:

[bencretois@login-5.SAGA ~]$ dusage

dusage v0.1.4
                          path    backup    space used     quota    files      quota
------------------------------  --------  ------------  --------  -------  ---------
                      /cluster        no       5.6 GiB         -   38 819          -
      /cluster/home/bencretois       yes       4.6 GiB  20.0 GiB    1 311    100 000
/cluster/work/users/bencretois        no       0.0 KiB         -        0          -
     /cluster/projects/nn5019k       yes     938.2 MiB   1.0 TiB   37 508  1 000 000

Job script basics

To run a job on the cluster involves creating a shell script called a job script. The job script is a plain-text file containing any number of commands, including your main computational task.

Anatomy of a job script

A job script consists of a couple of parts, in this order:

  • The first line, which is typically #!/bin/bash (the Slurm script does not have to be written in Bash, see below)
  • Parameters to the queue system (specified using the tag #SBATCH)
  • Commands to set up the execution environment
  • The actual commands you want to be run

Note that lines starting with a # are ignored as comments, except lines that start with #SBATCH and the shebang (i.e. #!/bin/bash), which are not executed, but contain special instructions to the queue system. There can be as many #SBATCH as you want. Moreover, the #SBATCH lines must precede any commands in the script.

SBATCH parameters

Which parameters are allowed or required depends the job type and cluster, but two parameters must be present in (almost) any job:

  • --account: specifies the project the job will run in. Required by all jobs.
  • --time: specifies how long a job should be allowed to run. If it has not finished within that time, it will be cancelled.

Other parameters that you will use on HPCs such as SAGA, Betzy or Fram include:

  • --ntasks: specifies the number of tasks to run on a node
  • c--pus-per-task: allocate a specific number of CPUs to each task
  • --mem-per-cpu: allocate a specific amount of memory per CPU and per task
  • --partition: The nodes on a cluster is divided into sets, called partitions. Jobs are run in partitions that are specific to certain needs. For instance, if you want nodes with GPU you will have to specify partition=accel.

And more rarely (if you have a very expensive task to run):

  • --nodes: number of nodes required
  • --ntasks-per-node: number of processes or tasks to run on a single node

Module load

The module system is a concept available on most supercomputers, simplifying the use of different software (versions) in a precise and controlled manner. In most cases, an HPC has far more software installed than the average user will ever use and it would not been computationally efficient to have them loaded by default.

Note that with the use of container, you do not need the module system as all the different softwares should be built in your image. See Chapter 2.

To get a list of available packages you can use:

module avail

To load a module into your environment to start using an application you can use:

module load package

For instance, if you want to load Pytorch v. 1.4.0 with python 3.7.4 use:

module load PyTorch/1.4.0-fosscuda-2019b-Python-3.7.4

Example 1, basic headers:

#SBATCH --account=nnXXX
#SBATCH --job-name=Run_ML_model
#SBATCH --nodes=2
#SBATCH --time=1:0:0
#SBATCH --mem-per-cpu=4G
#SBATCH --ntasks=8 --cpus-per-task=10 --ntasks-per-node=4

This job will get 2 nodes, and run 4 processes (tasks) on each of them, each process is getting 10 cpus with 4GB of memory. The wall-time is 1 hours so each task will be able to compute for a maximum of 1 hour.

Example 2, train a cat / dog classifier:

#!/bin/bash

#SBATCH --account=nnXX --job-name=cat_dog_model
#SBATCH --partition=accel --gpus=1
#SBATCH --time=24:00:00
#SBATCH --mem-per-cpu=4G

cd /cluster/projects/nnXX/

# Load the modules
module load Anaconda3/2020.11

# Activate my environment
conda activate myenv

# Run the script
python main_scripts/train_model.py \
            --data_path data/kaggle_cats_dogs/train \
            --save_path data/saved_models/model.pt \
            --save_es data/saved_models/model_es.pt \
            --batch_size 128 \
            --lr 0.05 \
            --num_epoch 10

This job will run on a single node located in the accel partition as we ask for a GPU and run a single process (training the deep learning model). The cd indicates that we move to the project folder (under which our data are stored in data). We activate the module Anaconda so we can load our virtual environment using conda activate. Once the virtual environment is activated we can finally run the main script for training the model.

Example 3, a generic script:


#!/bin/bash

# Job name:
#SBATCH --job-name=YourJobname
#
# Project:
#SBATCH --account=nnXXXXk
#
# Wall time limit:
#SBATCH --time=DD-HH:MM:SS
#
# Other parameters:
#SBATCH ...

## Set up job environment:
set -o errexit  # Exit the script on any error
set -o nounset  # Treat any unset variables as an error

module --quiet purge  # Reset the modules to the system default
module load SomeProgram/SomeVersion
module list

## Do some work:
YourCommands



Container technology

A container is a standard unit of software that packages up code and all its dependencies so the application runs quickly and reliably from one computing environment to another. 1

Containers are a solution to the problem of how to get software to run reliably when moved from one computing environment to another. For example, containers can be used to export the training of a ML model from your local machine to a HPC server without the strain of installing all the dependancies necessary to run the training. Think of a container as a shipping container for software — it holds important content like files and programs so that an application can be delivered efficiently from producer to consumer.

To create and use containers the use of a container platform is necessary. Docker is the most popular but others exist such as podman. On the other hand, Docker and similar have some limitations which makes it difficult to use on HPC clusters and more powerful platform such as Singularity are necessary.

Since this document focus primarily on using HPC we will go through Singularity in more details.

Container vs virtual environments

The key differences between virtual environments and containers are:

  • A virtualenv only encapsulates Python dependencies. A container (such as a docker or singularity container) encapsulates an entire OS.
  • With a Python virtualenv, you can easily switch between Python versions and dependencies, but you're stuck with your host OS.
  • With a Docker image, you can swap out the entire OS - install and run Python on Ubuntu, Debian, Alpine, even Windows Server Core.

Docker

Docker, a subset of the Moby project, is a software framework for building, running, and managing containers on servers and the cloud.

Singularity

Singularity was created to run complex applications on HPC clusters in a simple, portable, and reproducible way. First developed at Lawrence Berkeley National Laboratory, it quickly became popular at other HPC sites, academic sites, and beyond. Singularity is an open-source project, with a friendly community of developers and users. The user base continues to expand, with Singularity now used across industry and academia in many areas of work.

Images

Image versus container

A container is a virtualized runtime environment used in application development. It is used to create, run and deploy applications that are isolated from the underlying hardware. A container can use one machine, share its kernel and virtualize the OS to run more isolated processes. As a result, containers are lightweight.

An image is like a snapshot in other types of VM environments. It is a record of a Docker container at a specific point in time. Docker images are also immutable. While they can't be changed, they can be duplicated, shared or deleted. The feature is useful for testing new software or configurations because whatever happens, the image remains unchanged.

Containers need a runnable image to exist. Containers are dependent on images, because they are used to construct runtime environments and are needed to run an application. 2


graph LR;

A[Image] --> B[Container A]
A[Image] --> C[Container B]
A[Image] --> D[Container C]

Figure X: The image is a single file with all the dependancies and configurations to run a program. Containers are instances of an image.

Image registry

The Registry is a stateless, highly scalable server side application that stores and lets you distribute Docker images. Registries can be private, for instance some institutes or companies have internal registries to share images amongst teams but it is also possible to share images on public registries such as Docker Hub, GitLab registry or the GitHub registry.


1

Definition from docker.com

2

Definition from techtarget.com

Singularity

Singularity is a container platform. It allows you to create and run containers that package up pieces of software in a way that is portable and reproducible. You can build a container using Singularity on your laptop, and then run it on many of the largest HPC clusters in the world, local university or company clusters, a single server, in the cloud, or on a workstation down the hall.

Singularity Image Format file

Singularity uses Singularity Image Format (.sif) files to run containers. sif files can be built through two processes:

  • Downloading pre-built images
  • Building images from scratch

Building images from scratch require root access and since we do not have root access at NINA we need to rely on downloading pre-built images to build our .sif files.

Downloading pre-built images as .sif files

It is possible to download images from the public image registries using Singularity using singularity pull. For instance, you can pull the latest official python image from docker hub on saga using the command:

singularity pull docker://python

The docker:// uri is used to reference Docker images served from a registry. In this case pull does not just download an image file. Docker images are stored in layers, so pull combines those layers into a usable Singularity file.

I make sure that the images has been pulled:

[bencretois@login-1.SAGA ~]$ ls
python_latest.sif

Downloading your own custom image

In most case you will want to use an image that you built so that the depencies required to run your custom software are already specified and installed. Since we do not have root access at NINA we follow the following workflow:

  1. Specify a Dockerfile
  2. Build the docker image
  3. Push the docker image in a registry
  4. Pull the image as a .sif file from a HPC cluster

Below we describe and provide an example for each step.

1. Specify a Dockerfile

We first write a Dockerfile containing all we need to run our software. In this case, we want to train a cat and dog picture classifier and we need to write the Dockerfile accordingly.

FROM python:3.8

ARG DEBIAN_FRONTEND=noninteractive

RUN pip3 install poetry 

WORKDIR /app
COPY pyproject.toml poetry.lock ./

RUN poetry config virtualenvs.create false
RUN poetry install --no-root

COPY . ./

ENV PYTHONPATH "${PYTHONPATH}:/app/"

In this Dockerfile we first use the official image for python.3.8.

Then we install poetry (which is the python package manager) using pip. We recommand poetry over anaconda as we ran into some problems running anaconda with Docker.

We set the working directory of the container as /app

We copy both pyproject.toml and poetry.lock in the container so that poetry knows which packages to install

We install the packages necessary to run our machine learning experiment

And finally we specify the PYTHONPATH so that scripts that are in certain folders can read the scripts which are stored in other folders.

2. Build the docker image

Building the docker image implies that docker is installed on your system. At NINA it is possible to install Docker Desktop to use Docker on the remote server. Please contact Datahjelp which can assist you setting up either Docker Desktop or the VDI .

Once you have access to docker you can build your custom image using the command:

docker build -t ml_image -f Dockerfile .

-t stands for "tag", which is the name you want to give to the image.

-f stands for "file" and takes the Dockerfile as input.

3. Push the docker image in a registry

It is possible to push your custom image directly in the GitLab or GitHub registry.

Pushing the image on GitLab registry

Pushing the image on the Gitlab registry requires less manual configuration and we give an example on how to do it below.

Provided that you have a GitLab account and a GitLab project (in our example the project is called ml_image) for your specific task, the image should be rename as:

registry.gitlab.com/nina-data/ml_image:latest

We rename the image to provide an url to the registry where the image should be stored. Once the image has been pushed, you can check GitLab project -> Container registry

Pushing the image on GitHub registry

In your project repository create a folder .github/workflows and create a file publish_image.yml containing the following code:

name: Create and publish a Docker image

on:
  push:
    branches: main

env:
  REGISTRY: ghcr.io
  IMAGE_NAME: ${{ github.repository }}

jobs:
  build-and-push-image:
    runs-on: ubuntu-latest
    permissions:
      contents: read
      packages: write

    steps:
      - name: Checkout repository
        uses: actions/checkout@v3

      - name: Log in to the Container registry
        uses: docker/login-action@f054a8b539a109f9f41c372932f1ae047eff08c9
        with:
          registry: ${{ env.REGISTRY }}
          username: ${{ github.actor }}
          password: ${{ secrets.GITHUB_TOKEN }}

      - name: Extract metadata (tags, labels) for Docker
        id: meta
        uses: docker/metadata-action@98669ae865ea3cffbcbaa878cf57c20bbf1c6c38
        with:
          images: ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}

      - name: Build and push Docker image
        uses: docker/build-push-action@ad44023a93711e3deb337508980b4b5e9bcdc5dc
        with:
          context: .
          push: true
          tags: ${{ steps.meta.outputs.tags }}
          labels: ${{ steps.meta.outputs.labels }}

Once the folder and file have been created simply push the .github to the project's GitHub repository and GitHub Actions will take care of building and hosting the image.

Note that you will find the image in Packages (right sidebar of GitHub) and that you will need to make your image public before being able to pull it on Sigma2.

4. Pull the image as a .sif file from a HPC cluster

To pull the image that is stored on Gitlab, Docker hub or GitHub registry simply run:

singularity pull docker://registry/name_of_your_image

For instance, to pull the image from this repository (which is hosted on GitHub):

 singularity pull docker://ghcr.io/ninanor/91126800_ml_and_associated_tech:main

Interact with images

Now that the .sif file has been pulled, it is possible to interact with it via multiple ways.

Shell

The shell command allows you to spawn a new bash within your container and interact with it as though it were a small virtual machine.

[bencretois@login-1.SAGA ~]$ singularity shell 91126800_ml_and_associated_tech_main.sif
Singularity>

The change in prompt indicates that you have entered the container. Once inside of a Singularity container, you are the same user as you are on the host system except that you have root access, meaning that you can install packages and other software in your container.

Note that with singularity, when you are inside of a Singularity container, you are the same user as you are on the host system.

Singularity> whoami
bencretois

Executing commands

The exec command allows you to execute a custom command within a container by specifying the image file. For instance, if I want to train a Machine Learning model using the .sif image we could write:

singularity exec \
                91126800_ml_and_associated_tech_main.sif \
                python main_scripts/train_model.py

Specifying bind paths

If the data we want to process / use to train a machine learning model are stored in a different folder (for instance our .sif file is in cluster/projects/nn8055k but the data is in cluster/projects/nn8054k) we need to expose cluster/projects/nn8054k or in other words, make it available to the container. The flag --bind fill that purpose.

We would run the container as follow:

singularity exec \
                --bind /cluster/projects/nn8054k \
                91126800_ml_and_associated_tech_main.sif \
                python main_scripts/train_model.py

Exposing GPUs

When training or using a machine learning model it will usually be preferable to use a GPU(s) for accelerating the processed. The container being an "isolated" environment, it is required to specify that we want to expose GPUs to our container. This can simply be done by adding the flag --nv. For example:

singularity exec \
                --nv \
                91126800_ml_and_associated_tech_main.sif \
                python main_scripts/train_model.py

File management on HPC clusters

In addition to training a machine learning model we usually want to process our data on an HPC cluster to benefit the computational resources. For example, if we want to train a complex machine learning model we want the HPC cluster to first process our data so we get tensors so they can be used as inputs for the model. Using the data requires being able to access them.

There are two ways of making the data available to the HPC cluster:

  1. Copying the data over to the HPC cluster
  2. Using a filesystem to make remote data available to the HPC cluster

Copying the data over to the HPC cluster

Both methods have their advantages and drawbacks. While copying data over to the HPC cluster is conceptually simpler as it doesn't require the use of any specific library, the method will be limited by the size of the dataset. By default, if you apply for an account on one of Sigma2's HPC clusters you will be allocated 1 TB of storage that can be extended up to 10TB. Moreover, copying multiple TB of data can be a long process.

Using a filesystem to make remote data available to the HPC cluster

On the other hand, using filesystems allows you to use data stored in a remote server (e.g. cloud storage, private servers ...) in the HPC cluster and abstract the need of having storage memory in the HPC cluster. Filesystems nevertheless require slight change in your code.

Copying files over to an HPC cluster

It is possible to copy / transfer files using simple bash commands. scp is used to simply copy data over while rsync is used to synchronise folders.

scp: Copying files

The first command is scp which allows you to copy files from your machine to the HPC machines. For instance you want to copy the script train_model.py from the template folder over to saga in the project folder nn8055k, we would write (using your relevant username and HPC):

$ scp train_model.py bencretois@saga.sigma2.no:/cluster/projects/nn8055k/

With scp it is also possible to copy a folder over to the HPC machine, you will need to add the flag -r for this. For instance, if I want to copy the entire template folder over to saga I can write:

$ scp -r template bencretois@saga.sigma2.no:/cluter/projects/nn8055k

rsync: Synchronizing a local repository with a remote repository

Instead of copying all files from your local to remote folder you can synchronze the two folders with rsync. Synchronizing has the advantage of being more flexible than scp and has some optimisations to make the transfer of files faster. Moreoever rsync has a plethora of command line options, allowing the user to fine tune its behavior. It supports complex filter rules, runs in batch mode, daemon mode, etc.

$ rsync -e ssh -avz ./local_repo user@server:/remote_repo

-a is the archive option, i.e. syncs directories recursively while keeping permissions, symbolic links, ownership, and group settings.

-v being the verbose option and prints the progress and status of the rsync command.

-z compressing files during the transfer - speed up the sync.

-e is used to specify the remote shell to use, ssh in our case.

It is also possible to use the option --exclude to exclude some file from synchronisation:

$ rsync -e ssh -avz --exclude "file.txt" ./local_repo user@server:/remote_repo

However, in some cases there are files that we do not want to send to the remote repository. In these cases with can generate and .txt file containing a list of files to exclude.

$ rsync -e ssh -avz --exclude-from{"list_ignore.txt"} ./local_repo user@server:/remote_repo

With list_ignore.txt looking like:

folder1
file1.txt
folder2

Filesystem

A filesystem (or filesystem) is the way in which files are named and where they are placed logically for storage and retrieval. Without a file system, stored information wouldn't be isolated into individual files and would be difficult to identify and retrieve. As data capacities increase, the organization and accessibility of individual files are becoming even more important in data storage. 1.


graph TB;

A[Project folder] --> B[Camera trap pictures]
A --> C[Code]
A --> D[Report.doc]
B --> E[Pictures of deer]
B --> F[Pictures of birds]
C --> G[myscript.py]

Pyfilesystem: a filesystem abstraction for Python

A filesystem abstraction allows you to work with files and directories in archives, memory, the cloud etc. as easily as your local drive. It makes your code agnostic to where the data is stored.

Pyfilesystem is a filesystem abstraction for Python. This means that in your python code you can fully replace the use of os with Pyfilesystem.

Suppose the data is stored on NIRD in the project folder /folder/my_data but I want to process the data on SAGA to benefit from optimal computational resources. To access the data on NIRD I would make the following change to my code:

First, a connection to NIRD must be established. The connection to NIRD is done using the ssh protocol.

connection_string = "ssh://bencretois:PASSWORD@nird.sigma2.no"


def doConnection(connection_string):

    myfs = fs.open_fs(connection_string)
    return myfs


my_filesystem = doConnection(connection_string)

Now that the connection is established we can list the files in /folder/my_data:

def walk_audio(filesystem, input_path):
    # Get all files in directory with os.walk
    walker = filesystem.walk(input_path, filter=['*.wav', '*.mp3'])
    for path, dirs, flist in walker:
        for f in flist:
            yield fs.path.combine(path, f.name)


all_files = walk_audio(my_filesystem, "/folder/my_data")

Now that the object all_files has been created, it is possible to use the files locally.


1 Definition on Techtarget

Case study: training a cat / dog classifier

In this section we will demonstrate how to train a cat and dog classifier using supercomputers and in particular SIGMA2.

Data

The case study is entirely reproducible and you can run it yourself provided you have an account on SIGMA2.

The data to run this case study can be found on Kaggle, a subsidiary of Google that allow users to find and publish data sets, explore and build models in a web-based data-science environment, work with other data scientists and machine learning engineers, and enter competitions to solve data science challenges. The training dataset is composed of 25,000 images of dogs and cats and weight approximatively about 500MB.a

Code

All the scripts are found in the GitHub repository of this book, under case_study. The folder contains three subfolders:

  • bash_scripts: contains all the bash scripts which are not to be ran on HPC.
  • hpc_scripts: contains all the bash scripts to be run on HPC.
  • model_scripts: All the python scripts necessary to train the model.

The folder also contains a Dockerfile which will be used for creating docker container.

Setting up the environment

Virtual environments versus containers

Two choices are offered to set up the environment: virtual environments and containers. We will demonstrate how we can train the cat / dog classifier using both methods.

Conda setup

WORK IN PROGRESS

Container setup

On the other hand it might be a good idea to set up a container. Because we have not root access on SIGMA2's HPC we have to build a Singularity container by first having a Docker container which is specified using a Dockerfile. Let's analyse the Dockerfile of this specific case study line by line:

  • First we install python version 3.8 in our container so we can use up to date python libraries
FROM python:3.8
  • Avoid the system asking questions / dialogs during the apt-get install
ARG DEBIAN_FRONTEND=noninteractive
  • Update apt-get to have up to date packages
RUN \
    apt-get update && \
    rm -rf /var/lib/apt/lists/*
  • Install the package manager poetry. Note that you can install your preferred package manager such as conda in this step.
RUN pip3 install poetry 
  • Set the working directory of the container
WORKDIR /app
  • The next three lines are specific to poetry. Basically we copy both the pyproject.toml (file containing the packages we use for our analysis) and poetry.lock (file containing all the dependancies). Then we remove the creation of the virutal environment so that python in our container uses all our package without opening a virtual environment. Finally we install our packages.
COPY pyproject.toml poetry.lock ./
RUN poetry config virtualenvs.create false
RUN poetry install --no-root
  • Copy all the files of the folder where we open the container
COPY . ./
  • Set the python path to the working directory. This way the main_scripts can access the scripts in the other folders (for instance the utils scripts).
ENV PYTHONPATH "${PYTHONPATH}:/app/"

2 - Creating the docker image

The Dockerfile being defined we can now create our image (i.e. the "environment" in which we will run the training script). For open a terminal, move to the folder where your Dockerfile is located and write the following command (change case_study_1 to the name of your folder):

docker run -t case_study_1:latest -f Dockerfile .

The command should output the following:

Sending build context to Docker daemon  244.7kB
Step 1/10 : FROM python:3.8
 ---> 271c1bcd4489
Step 2/10 : ARG DEBIAN_FRONTEND=noninteractive
 ---> Using cache
 ---> 0965e91032c6
Step 3/10 : RUN     apt-get update &&     rm -rf /var/lib/apt/lists/*
 ---> Using cache
 ---> 02fa21122354
Step 4/10 : RUN pip3 install poetry
 ---> Using cache
 ---> 33bbd2c53863
Step 5/10 : WORKDIR /app
 ---> Using cache
 ---> 7720e687da9c
Step 6/10 : COPY pyproject.toml poetry.lock ./
 ---> Using cache
 ---> 345244b7ba43
Step 7/10 : RUN poetry config virtualenvs.create false
 ---> Using cache
 ---> 47271847855f
Step 8/10 : RUN poetry install --no-root
 ---> Using cache
 ---> 67e487ef8ae6
Step 9/10 : COPY . ./
 ---> c7656f1447fa
Step 10/10 : ENV PYTHONPATH "${PYTHONPATH}:/app/"
 ---> Running in 9f82e9b70fc5
Removing intermediate container 9f82e9b70fc5
 ---> b08815df7070
Successfully built b08815df7070
Successfully tagged case_study_1:latest

Indicating that the image has been successfully created.

Training model locally

First of all, we should try to make sure that our scripts work well on the VDI before spending computational resources on SIGMA2. To develop and improve our scripts we can leverage the power of Docker by using the created image in two ways.

First, we can start a Jupyter instance inside the docker container so you can develop in a more interactive environments while having all the python libraries you need. For this you can use the script docker_start_jupyter in ml-sats/bash_utils.

The script

#!/bin/bash

cd ~/Code/deepexperiments

docker run \
    -p 8889:8889 \
    --rm -it \
    -v $PWD:/app \
    -v $HOME/Data:/Data \
    case_study_1:latest \
    poetry run jupyter lab \
    --port=8889 --no-browser --ip=0.0.0.0 --allow-root
~                                                        

You can also use the docker image to run the training script on any computers using the script case_study_1/bash_scripts/train_model.sh. Note that you need to change the folders that are exposed (for the meaning of exposed folder refer to the document XXX)

#!/bin/bash

# e: stop on error
# u : raises error if variable undefined
# -o pipefail: trigger error when command in the pipe fail
set -euo pipefail

cd $HOME/Code/case_study_1

DATA_PATH=/Data/train
OUT_DIR=/Data/

docker run --rm -v $HOME/Data:/Data -v $PWD:/app case_study_1:latest \
    python -u /app/main_scripts/train_model.py \
                --data_path $DATA_PATH \
                --save_path $OUT_DIR/model.pt \
                --save_es $OUT_DIR/model.pt \
                --batch_size 128 \
                --lr 0.001 \
                --num_epoch 10

We are finally ready to train the model. Since this is a test and the main model will be trained on SIGMA2 we run the model only for a few epoch to be sure the code does not contain any bug. Running the script (./bash_scripts/train_model.sh) should output the following:

benjamin.cretois@nixml086424q01:~/Code/ml-sats/case_study_1$ ./bash_scripts/train_model.sh 
./bash_scripts/train_model.sh: line 3: cd: /home/benjamin.cretois/Code/case_study_1: No such file or directory
/usr/local/lib/python3.8/site-packages/torch/nn/functional.py:718: UserWarning: Named tensors and all their associated APIs are an experimental feature and subject to change. Please do not use them for anything important until they are released as stable. (Triggered internally at  /pytorch/c10/core/TensorImpl.h:1156.)
  return torch.max_pool2d(input, kernel_size, stride, padding, dilation, ceil_mode)
Epoch: 0, Training loss: 0.6700030667766644, Validation loss: 0.6518245041370392
Validation loss decreased (inf --> 0.651825).  Saving model ...
Epoch: 1, Training loss: 0.6335798464003642, Validation loss: 0.6232574313879014
Validation loss decreased (0.651825 --> 0.623257).  Saving model ...
Epoch: 2, Training loss: 0.5948401207377196, Validation loss: 0.5763850644230842
Validation loss decreased (0.623257 --> 0.576385).  Saving model ...
Epoch: 3, Training loss: 0.553090987691454, Validation loss: 0.5323234401643276
Validation loss decreased (0.576385 --> 0.532323).  Saving model ...
Epoch: 4, Training loss: 0.5246088358627004, Validation loss: 0.5051904194056988
Validation loss decreased (0.532323 --> 0.505190).  Saving model ...
Epoch: 5, Training loss: 0.48994877194143405, Validation loss: 0.4672641947865486
Validation loss decreased (0.505190 --> 0.467264).  Saving model ...
Epoch: 6, Training loss: 0.4647162993242786, Validation loss: 0.45798871740698816
Validation loss decreased (0.467264 --> 0.457989).  Saving model ...
Epoch: 7, Training loss: 0.4354040877074952, Validation loss: 0.42637150436639787
Validation loss decreased (0.457989 --> 0.426372).  Saving model ...
Epoch: 8, Training loss: 0.4093162176335693, Validation loss: 0.41202530562877654
Validation loss decreased (0.426372 --> 0.412025).  Saving model ...
Epoch: 9, Training loss: 0.3919389988206754, Validation loss: 0.39200695902109145
Validation loss decreased (0.412025 --> 0.392007).  Saving model ...
Finished Training

Training model on SIGMA2

4 - Pushing the image to the Gitlab repository

On SIGMA2 we cannot use our docker image immediately as it has been created on our local computer. As explained here SIGMA2 uses singularity which is another software for handling images.

First we need to push our docker image to GitLab (we use Gitlab instead of GitHub as it is easier to handle docker images on Gitlab). This has to be done in 3 steps:

  • First, be sure you are logged in the Gitlab registry:
docker login registry.gitlab.com
  • Then we need to rename the docker image with regard to the Gitlab repository where it will be stored. For instance, the image we build case_study_1:latest will be stored in registry.gitlab.com/nina-data/ml-sats/, thus we need to rename the image as registry.gitlab.com/nina-data/ml-sats/case_study_1:latest:
docker tag case_study_1:latest registry.gitlab.com/nina-data/ml-sats/case_study_1:latest
  • Then we can push the image to the Gitlab repository:
docker push registry.gitlab.com/bencretois/ml-sats/case_study_1:latest

5 - Training the model on SIGMA2

We first need to pull the image we stored on the Gitlab repository in our folder in SIGMA2. singularity uses .sif file so when we pull the image we need to relabel the image as a .sif file:

singularity pull --name case_study_1.sif docker://registry.gitlab.com/bencretois/ml-sats/case_study_1

Now that case_study_1.sif has been created in our folder we can train the model on SIGMA2 using the script case_study_1/hpc_scripts/train_model.sh. The script contains a few lines worth noticing:

  • First the shebang, to tell the interpreter that this is a bash script
#!/bin/bash
  • The SIGMA2 specific lines. Here we tell the HPC server that for running our script we need 1 GPU for a maximum of 24 hours. We also ask for a CPU with 4GB of memory.
#SBATCH --account=nn5019k --job-name=cat_dog_model
#SBATCH --partition=accel --gpus=1
#SBATCH --time=24:00:00
#SBATCH --mem-per-cpu=4G
  • We change our current directory to the directory where our scripts are located
cd $HOME/ml-sats/case_study_1
  • Since we have 24 hours of GPU we run our script with three different learning rates. In this specific case the script will be ran 3 times in a consecutive fashion: once the script using the learning rate of 0.01 is finished, the script will be ran with a learning rate of 0.001. We will learn about parallelizing on SIGMA2 in case_study_2.
for LR in 0.01 0.001 0.0001
do
  • Finally we set up the docker container for running the script train_model.py and specify all the relevant parameters. With singularity, by default only the working directory is exposed (in our case $HOME/ml-sats) so we need manually expose the folder where the the data is stored and where we want the model to be stored with the command --bind. Then with python we ask the docker container to use python to run the script train_model.py. The option -u is useful to display the printed text in the .out file.
singularity exec --bind /cluster/projects/nn5019k:/Data \
    --nv case_study_1.sif \
    python -u main_scripts/train_model.py \
                --data_path /Data/kaggle_cats_dogs/train \
                --save_path /Data/saved_models/model.pt \
                --save_es /Data/saved_models/model_es.pt \
                --batch_size 128 \
                --lr $LR \
                --num_epoch 100
done
  • Now we can start the model training by submitting our script as a job with the command:
sbatch train_model.sh

6 (optional) - Importing the trained model locally

If we want to use our trained model on our own computer we can import the model using the command scp:

scp bencretois@sage.sigma2.no:/cluster/projects/nn5019k/saved_models/model.pt $HOME/Code/ml-sats/case_study_1

Chapter 5: Advanced topics

Ray tune

Acknowlegments

This document would not have been possible without the inputs of Francesco Frassinelli, Kjetil Grun and Stig Clausen.

List of acronyms

AcronymTermDescription
GPUGraphic Processing UnitA specialized processor originally designed to accelerate graphics rendering. Used extensively in Machine Learning to train complex algorithms
GUIGraphical User InterfaceForm of user interface that allows users to interact with electronic devices through graphical icons and audio indicator such as primary notation
MLMachine learningUse and development of computer systems that are able to learn and adapt without following explicit instructions
DLDeep learningPart of the family of machine learning methods based on artificial neural networks with representation learning.