Introduction
This document is a guide to get started with High Performance Computer clusters (HPCC) and more particularly Uninett Sigma2's HPCC. This guide aims at understanding the basic commands for navigating in the terminal of an HPCC, at helping running a job script and at managing files such as copying files from your local directory to the HPCC and using files that are stored remotely.
This document is the result of a Strategic Funding given by the Norwegian Institute for Nature Research (NINA). Please note that the document is not static and constantly subject to improvement based on feedbacks. You can contribute to improving the document by opening issues on the GitHub repository of this project.
Authors
The document has been written by Benjamin Cretois with the help of Francesco Frassinelli.
What is HPC and why using it?
High performance computing (HPC) generally refers to processing complex calculations at high speeds across multiple servers in parallel. Those groups of servers are known as clusters and are composed of hundreds or even thousands of compute servers that have been connected through a network.
With the increased use of technologies like the Internet of Things (IoT), artificial intelligence (AI), and machine learning (ML), organizations are producing huge quantities of data, and they need to be able to process and use that data more quickly in real-time. To power the analysis of such large dataset it is often more practical / mandatory to make use of supercomputing. Supercomputing enables fast processing of data and training of complex algorithms such as neural networks for image and sound recognition.
In Chapter 1 we explore the basics of HPC and we will learn more in depth about UNINETT Sigma2, a Norwegian infrastructure providing large scale computational resources.
Basic vocabulary
Cluster
An HPC cluster is a collection of many separate servers (computers), called nodes, which are connected via a fast interconnect.
Node
HPC clusters are composed of:
- a headnode, where users log in
- a specialized data transfer node
- regular compute nodes (where majority of computations is run)
- GPU nodes (on these nodes computations can be run both on CPU cores and on a Graphical Processing Unit)
All cluster nodes have the same components as a laptop or desktop: CPU cores, memory and disk space. The difference between personal computer and a cluster node is in quantity, quality and power of the components.
Synthesis
graph TB;
G[User] --> |ssh| A
subgraph HPC
subgraph Cluster one
A[headnode 1] --> B[Worker node 1]
A --> C[Worker node 2]
end
subgraph Cluster two
D[headnode 2] --> E[Worker node 1]
D --> F[Worker node 2]
end
H[Data] --> B
H --> C
H --> E
H --> F
end
Figure 1: Graph representing the structure of an HPC. The user first has to connect to the headnode of one of the HPC clusters through ssh (or any other protocols). The user can then submit a script (i.e. a job script) that is interpreted by the headnode. The headnode distribute the tasks accross the worker nodes. The worker nodes execute the script and if needed fetch the data necessary to complete the task
What is Uninett Sigma2?
Sigma2 is a non-profit company that provides services for high-performance computing and data storage to individuals and groups involved in research and education at all Norwegian universities and colleges, and other publicly funded organizations and projects (such as NINA). Their activities are financed by the Research Council of Norway (RCN) and the Sigma2 consortium partners, which are the universities in Oslo, Bergen, Trondheim and Tromsø. This collaboration goes by the name NRIS – Norwegian research infrastructure services.
Sigma2 owns four High Performance Computing (HPC) servers that have different configurations: Betsy
, Saga
, Fram
and LUMI
. Generally, if a project requires access to one or more Graphics processing units (GPUs) it is reasonable to apply for an access to Saga
.
To apply for access to one of Sigma2 HPC please refer to this page. You will need to fill out a form describing your project, your experience with HPC and your computational needs (amount of CPU / GPU memory and storage your project will need).
For help regarding the application process contact Benjamin Cretois, Kjetil Grun or Francesco Frassinelli.
Getting started with Sigma2
After a successful application to an account on Sigma2 you will be given your username and will be able to log on the HPC terminal.
To access the HPC server you applied for (in our case it is saga
). You can log in using the command ssh
in Windows PowerShell or on a linux terminal:
$ ssh username@saga.sigma2.no
The first time you log in, you will be asked to set your password which will have to be used at any subsequent connection to the HPC server.
Navigating in Sigma2's HPC server
Basic bash commands
Communication between you and the HPC server is usually done through an interface named a bash
and no Graphical User Interface (GUI) is provided
Here is how the bash looks like
[bencretois@login-5.SAGA ~]$
Communicating with a bash requires learning a particular programming language, bash scripting. Below we provide a list of selected commands that will allow you to navigate in HPC server:
Change directory
Command:
cd + path to the directory
Output:
[bencretois@login-5.SAGA ~]$ cd deepexperiments/
[bencretois@login-5.SAGA ~/deepexperiments]$
Get to the previous directory
Command:
cd ..
-> get to the previous directory
Output:
[bencretois@login-5.SAGA ~/deepexperiments]$ cd ..
[bencretois@login-5.SAGA ~]$
List the content of a directory
Command:
ls
+ name of a directory - Note thatls
by default list the files of your current directory
Output:
[bencretois@login-5.SAGA ~/deepexperiments]$ ls
bash_cheatsheet.md Dockerfile list_ignore.txt poetry.lock runs sync.sh
bayesianfy.ipynb docker_run_jupyter.sh models pyproject.toml scripts utils
deepexperiments.sif jobs
get the path of your current directory
Command:
pwd
Output:
[bencretois@login-5.SAGA ~]$ pwd
/cluster/home/bencretois
Create a new folder
Command:
mkdir
+ name of the folder you want to create
Output:
[bencretois@login-5.SAGA ~]$ mkdir new_folder
[bencretois@login-5.SAGA ~]$ ls
deepexperiments new_folder
Learning more bash commands
Bash command* are very well documented on Internet and if you wish to learn more you can begin here.
Bash commands specific to Sigma2's HPC
There are also some useful commands specific to your Sigma2 account:
List your projects
Command:
projects
-> list your projects
Output:
[bencretois@login-5.SAGA ~]$ projects
nn5019k
Look at used space and allocated quota for your projects
Command:
dusage
-> . Note that space used is what you are currently using and quota is the limit.
Output:
[bencretois@login-5.SAGA ~]$ dusage
dusage v0.1.4
path backup space used quota files quota
------------------------------ -------- ------------ -------- ------- ---------
/cluster no 5.6 GiB - 38 819 -
/cluster/home/bencretois yes 4.6 GiB 20.0 GiB 1 311 100 000
/cluster/work/users/bencretois no 0.0 KiB - 0 -
/cluster/projects/nn5019k yes 938.2 MiB 1.0 TiB 37 508 1 000 000
Job script basics
To run a job on the cluster involves creating a shell script called a job script. The job script is a plain-text file containing any number of commands, including your main computational task.
Anatomy of a job script
A job script consists of a couple of parts, in this order:
- The first line, which is typically
#!/bin/bash
(the Slurm script does not have to be written in Bash, see below) - Parameters to the queue system (specified using the tag
#SBATCH
) - Commands to set up the execution environment
- The actual commands you want to be run
Note that lines starting with a #
are ignored as comments, except lines that start with #SBATCH
and the shebang (i.e. #!/bin/bash
), which are not executed, but contain special instructions to the queue system. There can be as many #SBATCH
as you want. Moreover, the #SBATCH
lines must precede any commands in the script.
SBATCH parameters
Which parameters are allowed or required depends the job type and cluster, but two parameters must be present in (almost) any job:
- --account: specifies the project the job will run in. Required by all jobs.
- --time: specifies how long a job should be allowed to run. If it has not finished within that time, it will be cancelled.
Other parameters that you will use on HPCs such as SAGA
, Betzy
or Fram
include:
- --ntasks: specifies the number of tasks to run on a node
- c--pus-per-task: allocate a specific number of CPUs to each task
- --mem-per-cpu: allocate a specific amount of memory per CPU and per task
- --partition: The nodes on a cluster is divided into sets, called partitions. Jobs are run in partitions that are specific to certain needs. For instance, if you want nodes with GPU you will have to specify
partition=accel
.
And more rarely (if you have a very expensive task to run):
- --nodes: number of nodes required
- --ntasks-per-node: number of processes or tasks to run on a single node
Module load
The module system is a concept available on most supercomputers, simplifying the use of different software (versions) in a precise and controlled manner. In most cases, an HPC has far more software installed than the average user will ever use and it would not been computationally efficient to have them loaded by default.
Note that with the use of container, you do not need the module system as all the different softwares should be built in your image. See Chapter 2.
To get a list of available packages you can use:
module avail
To load a module into your environment to start using an application you can use:
module load package
For instance, if you want to load Pytorch v. 1.4.0
with python 3.7.4
use:
module load PyTorch/1.4.0-fosscuda-2019b-Python-3.7.4
Example 1, basic headers:
#SBATCH --account=nnXXX
#SBATCH --job-name=Run_ML_model
#SBATCH --nodes=2
#SBATCH --time=1:0:0
#SBATCH --mem-per-cpu=4G
#SBATCH --ntasks=8 --cpus-per-task=10 --ntasks-per-node=4
This job will get 2 nodes, and run 4 processes (tasks) on each of them, each process is getting 10 cpus with 4GB of memory. The wall-time is 1 hours so each task will be able to compute for a maximum of 1 hour.
Example 2, train a cat / dog classifier:
#!/bin/bash
#SBATCH --account=nnXX --job-name=cat_dog_model
#SBATCH --partition=accel --gpus=1
#SBATCH --time=24:00:00
#SBATCH --mem-per-cpu=4G
cd /cluster/projects/nnXX/
# Load the modules
module load Anaconda3/2020.11
# Activate my environment
conda activate myenv
# Run the script
python main_scripts/train_model.py \
--data_path data/kaggle_cats_dogs/train \
--save_path data/saved_models/model.pt \
--save_es data/saved_models/model_es.pt \
--batch_size 128 \
--lr 0.05 \
--num_epoch 10
This job will run on a single node located in the accel
partition as we ask for a GPU and run a single process (training the deep learning model). The cd
indicates that we move to the project folder (under which our data are stored in data
). We activate the module Anaconda
so we can load our virtual environment using conda activate
. Once the virtual environment is activated we can finally run the main script for training the model.
Example 3, a generic script:
#!/bin/bash
# Job name:
#SBATCH --job-name=YourJobname
#
# Project:
#SBATCH --account=nnXXXXk
#
# Wall time limit:
#SBATCH --time=DD-HH:MM:SS
#
# Other parameters:
#SBATCH ...
## Set up job environment:
set -o errexit # Exit the script on any error
set -o nounset # Treat any unset variables as an error
module --quiet purge # Reset the modules to the system default
module load SomeProgram/SomeVersion
module list
## Do some work:
YourCommands
Container technology
A container is a standard unit of software that packages up code and all its dependencies so the application runs quickly and reliably from one computing environment to another. 1
Containers are a solution to the problem of how to get software to run reliably when moved from one computing environment to another. For example, containers can be used to export the training of a ML model from your local machine to a HPC server without the strain of installing all the dependancies necessary to run the training. Think of a container as a shipping container for software — it holds important content like files and programs so that an application can be delivered efficiently from producer to consumer.
To create and use containers the use of a container platform is necessary. Docker is the most popular but others exist such as podman. On the other hand, Docker and similar have some limitations which makes it difficult to use on HPC clusters and more powerful platform such as Singularity are necessary.
Since this document focus primarily on using HPC we will go through Singularity in more details.
Container vs virtual environments
The key differences between virtual environments and containers are:
- A virtualenv only encapsulates Python dependencies. A container (such as a docker or singularity container) encapsulates an entire OS.
- With a Python virtualenv, you can easily switch between Python versions and dependencies, but you're stuck with your host OS.
- With a Docker image, you can swap out the entire OS - install and run Python on Ubuntu, Debian, Alpine, even Windows Server Core.
Docker
Docker, a subset of the Moby project, is a software framework for building, running, and managing containers on servers and the cloud.
Singularity
Singularity was created to run complex applications on HPC clusters in a simple, portable, and reproducible way. First developed at Lawrence Berkeley National Laboratory, it quickly became popular at other HPC sites, academic sites, and beyond. Singularity is an open-source project, with a friendly community of developers and users. The user base continues to expand, with Singularity now used across industry and academia in many areas of work.
Images
Image versus container
A container is a virtualized runtime environment used in application development. It is used to create, run and deploy applications that are isolated from the underlying hardware. A container can use one machine, share its kernel and virtualize the OS to run more isolated processes. As a result, containers are lightweight.
An image is like a snapshot in other types of VM environments. It is a record of a Docker container at a specific point in time. Docker images are also immutable. While they can't be changed, they can be duplicated, shared or deleted. The feature is useful for testing new software or configurations because whatever happens, the image remains unchanged.
Containers need a runnable image to exist. Containers are dependent on images, because they are used to construct runtime environments and are needed to run an application. 2
graph LR;
A[Image] --> B[Container A]
A[Image] --> C[Container B]
A[Image] --> D[Container C]
Figure X: The image is a single file with all the dependancies and configurations to run a program. Containers are instances of an image.
Image registry
The Registry is a stateless, highly scalable server side application that stores and lets you distribute Docker images. Registries can be private, for instance some institutes or companies have internal registries to share images amongst teams but it is also possible to share images on public registries such as Docker Hub, GitLab registry or the GitHub registry.
Definition from docker.com
Definition from techtarget.com
Singularity
Singularity is a container platform. It allows you to create and run containers that package up pieces of software in a way that is portable and reproducible. You can build a container using Singularity on your laptop, and then run it on many of the largest HPC clusters in the world, local university or company clusters, a single server, in the cloud, or on a workstation down the hall.
Singularity Image Format file
Singularity uses Singularity Image Format (.sif
) files to run containers. sif
files can be built through two processes:
- Downloading pre-built images
- Building images from scratch
Building images from scratch require root access and since we do not have root access at NINA we need to rely on downloading pre-built images to build our .sif
files.
Downloading pre-built images as .sif
files
It is possible to download images from the public image registries using Singularity using singularity pull
. For instance, you can pull the latest official python image from docker hub on saga using the command:
singularity pull docker://python
The docker://
uri is used to reference Docker images served from a registry. In this case pull
does not just download an image file. Docker images are stored in layers, so pull
combines those layers into a usable Singularity file.
I make sure that the images has been pulled:
[bencretois@login-1.SAGA ~]$ ls
python_latest.sif
Downloading your own custom image
In most case you will want to use an image that you built so that the depencies required to run your custom software are already specified and installed. Since we do not have root access at NINA we follow the following workflow:
- Specify a
Dockerfile
- Build the docker image
- Push the docker image in a registry
- Pull the image as a
.sif
file from a HPC cluster
Below we describe and provide an example for each step.
1. Specify a Dockerfile
We first write a Dockerfile
containing all we need to run our software. In this case, we want to train a cat and dog picture classifier and we need to write the Dockerfile
accordingly.
FROM python:3.8
ARG DEBIAN_FRONTEND=noninteractive
RUN pip3 install poetry
WORKDIR /app
COPY pyproject.toml poetry.lock ./
RUN poetry config virtualenvs.create false
RUN poetry install --no-root
COPY . ./
ENV PYTHONPATH "${PYTHONPATH}:/app/"
In this Dockerfile
we first use the official image for python.3.8.
Then we install poetry (which is the python package manager) using pip
. We recommand poetry over anaconda as we ran into some problems running anaconda with Docker.
We set the working directory of the container as /app
We copy both pyproject.toml
and poetry.lock
in the container so that poetry
knows which packages to install
We install the packages necessary to run our machine learning experiment
And finally we specify the PYTHONPATH
so that scripts that are in certain folders can read the scripts which are stored in other folders.
2. Build the docker image
Building the docker image implies that docker
is installed on your system. At NINA it is possible to install Docker Desktop to use Docker on the remote server. Please contact Datahjelp which can assist you setting up either Docker Desktop or the VDI .
Once you have access to docker
you can build your custom image using the command:
docker build -t ml_image -f Dockerfile .
-t
stands for "tag", which is the name you want to give to the image.
-f
stands for "file" and takes the Dockerfile
as input.
3. Push the docker image in a registry
It is possible to push your custom image directly in the GitLab or GitHub registry.
Pushing the image on GitLab registry
Pushing the image on the Gitlab registry requires less manual configuration and we give an example on how to do it below.
Provided that you have a GitLab account and a GitLab project (in our example the project is called ml_image
) for your specific task, the image should be rename as:
registry.gitlab.com/nina-data/ml_image:latest
We rename the image to provide an url to the registry where the image should be stored. Once the image has been pushed, you can check GitLab project -> Container registry
Pushing the image on GitHub registry
In your project repository create a folder .github/workflows
and create a file publish_image.yml
containing the following code:
name: Create and publish a Docker image
on:
push:
branches: main
env:
REGISTRY: ghcr.io
IMAGE_NAME: ${{ github.repository }}
jobs:
build-and-push-image:
runs-on: ubuntu-latest
permissions:
contents: read
packages: write
steps:
- name: Checkout repository
uses: actions/checkout@v3
- name: Log in to the Container registry
uses: docker/login-action@f054a8b539a109f9f41c372932f1ae047eff08c9
with:
registry: ${{ env.REGISTRY }}
username: ${{ github.actor }}
password: ${{ secrets.GITHUB_TOKEN }}
- name: Extract metadata (tags, labels) for Docker
id: meta
uses: docker/metadata-action@98669ae865ea3cffbcbaa878cf57c20bbf1c6c38
with:
images: ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}
- name: Build and push Docker image
uses: docker/build-push-action@ad44023a93711e3deb337508980b4b5e9bcdc5dc
with:
context: .
push: true
tags: ${{ steps.meta.outputs.tags }}
labels: ${{ steps.meta.outputs.labels }}
Once the folder and file have been created simply push the .github
to the project's GitHub repository and GitHub Actions will take care of building and hosting the image.
Note that you will find the image in Packages
(right sidebar of GitHub) and that you will need to make your image public before being able to pull it on Sigma2.
4. Pull the image as a .sif
file from a HPC cluster
To pull the image that is stored on Gitlab, Docker hub or GitHub registry simply run:
singularity pull docker://registry/name_of_your_image
For instance, to pull the image from this repository (which is hosted on GitHub):
singularity pull docker://ghcr.io/ninanor/91126800_ml_and_associated_tech:main
Interact with images
Now that the .sif
file has been pulled, it is possible to interact with it via multiple ways.
Shell
The shell
command allows you to spawn a new bash within your container and interact with it as though it were a small virtual machine.
[bencretois@login-1.SAGA ~]$ singularity shell 91126800_ml_and_associated_tech_main.sif
Singularity>
The change in prompt indicates that you have entered the container. Once inside of a Singularity container, you are the same user as you are on the host system except that you have root access, meaning that you can install packages and other software in your container.
Note that with singularity, when you are inside of a Singularity container, you are the same user as you are on the host system.
Singularity> whoami
bencretois
Executing commands
The exec
command allows you to execute a custom command within a container by specifying the image file. For instance, if I want to train a Machine Learning model using the .sif
image we could write:
singularity exec \
91126800_ml_and_associated_tech_main.sif \
python main_scripts/train_model.py
Specifying bind paths
If the data we want to process / use to train a machine learning model are stored in a different folder (for instance our .sif
file is in cluster/projects/nn8055k
but the data is in cluster/projects/nn8054k
) we need to expose cluster/projects/nn8054k
or in other words, make it available to the container. The flag --bind
fill that purpose.
We would run the container as follow:
singularity exec \
--bind /cluster/projects/nn8054k \
91126800_ml_and_associated_tech_main.sif \
python main_scripts/train_model.py
Exposing GPUs
When training or using a machine learning model it will usually be preferable to use a GPU(s) for accelerating the processed. The container being an "isolated" environment, it is required to specify that we want to expose GPUs to our container. This can simply be done by adding the flag --nv
. For example:
singularity exec \
--nv \
91126800_ml_and_associated_tech_main.sif \
python main_scripts/train_model.py
File management on HPC clusters
In addition to training a machine learning model we usually want to process our data on an HPC cluster to benefit the computational resources. For example, if we want to train a complex machine learning model we want the HPC cluster to first process our data so we get tensors so they can be used as inputs for the model. Using the data requires being able to access them.
There are two ways of making the data available to the HPC cluster:
- Copying the data over to the HPC cluster
- Using a filesystem to make remote data available to the HPC cluster
Copying the data over to the HPC cluster
Both methods have their advantages and drawbacks. While copying data over to the HPC cluster is conceptually simpler as it doesn't require the use of any specific library, the method will be limited by the size of the dataset. By default, if you apply for an account on one of Sigma2's HPC clusters you will be allocated 1 TB of storage that can be extended up to 10TB. Moreover, copying multiple TB of data can be a long process.
Using a filesystem to make remote data available to the HPC cluster
On the other hand, using filesystems allows you to use data stored in a remote server (e.g. cloud storage, private servers ...) in the HPC cluster and abstract the need of having storage memory in the HPC cluster. Filesystems nevertheless require slight change in your code.
Copying files over to an HPC cluster
It is possible to copy / transfer files using simple bash commands. scp
is used to simply copy data over while rsync
is used to synchronise folders.
scp
: Copying files
The first command is scp
which allows you to copy files from your machine to the HPC machines. For instance you want to copy the script train_model.py
from the template folder over to saga
in the project folder nn8055k, we would write (using your relevant username and HPC):
$ scp train_model.py bencretois@saga.sigma2.no:/cluster/projects/nn8055k/
With scp
it is also possible to copy a folder over to the HPC machine, you will need to add the flag -r
for this. For instance, if I want to copy the entire template
folder over to saga
I can write:
$ scp -r template bencretois@saga.sigma2.no:/cluter/projects/nn8055k
rsync
: Synchronizing a local repository with a remote repository
Instead of copying all files from your local to remote folder you can synchronze the two folders with rsync
. Synchronizing has the advantage of being more flexible than scp
and has some optimisations to make the transfer of files faster. Moreoever rsync
has a plethora of command line options, allowing the user to fine tune its behavior. It supports complex filter rules, runs in batch mode, daemon mode, etc.
$ rsync -e ssh -avz ./local_repo user@server:/remote_repo
-a
is the archive option, i.e. syncs directories recursively while keeping permissions, symbolic links, ownership, and group settings.
-v
being the verbose option and prints the progress and status of the rsync command.
-z
compressing files during the transfer - speed up the sync.
-e
is used to specify the remote shell to use, ssh
in our case.
It is also possible to use the option --exclude
to exclude some file from synchronisation:
$ rsync -e ssh -avz --exclude "file.txt" ./local_repo user@server:/remote_repo
However, in some cases there are files that we do not want to send to the remote repository. In these cases with can generate and .txt
file containing a list of files to exclude.
$ rsync -e ssh -avz --exclude-from{"list_ignore.txt"} ./local_repo user@server:/remote_repo
With list_ignore.txt
looking like:
folder1
file1.txt
folder2
Filesystem
A filesystem (or filesystem) is the way in which files are named and where they are placed logically for storage and retrieval. Without a file system, stored information wouldn't be isolated into individual files and would be difficult to identify and retrieve. As data capacities increase, the organization and accessibility of individual files are becoming even more important in data storage. 1.
graph TB;
A[Project folder] --> B[Camera trap pictures]
A --> C[Code]
A --> D[Report.doc]
B --> E[Pictures of deer]
B --> F[Pictures of birds]
C --> G[myscript.py]
Pyfilesystem: a filesystem abstraction for Python
A filesystem abstraction allows you to work with files and directories in archives, memory, the cloud etc. as easily as your local drive. It makes your code agnostic to where the data is stored.
Pyfilesystem is a filesystem abstraction for Python. This means that in your python code you can fully replace the use of os
with Pyfilesystem.
Suppose the data is stored on NIRD
in the project folder /folder/my_data
but I want to process the data on SAGA
to benefit from optimal computational resources. To access the data on NIRD
I would make the following change to my code:
First, a connection to NIRD
must be established. The connection to NIRD
is done using the ssh
protocol.
connection_string = "ssh://bencretois:PASSWORD@nird.sigma2.no"
def doConnection(connection_string):
myfs = fs.open_fs(connection_string)
return myfs
my_filesystem = doConnection(connection_string)
Now that the connection is established we can list the files in /folder/my_data
:
def walk_audio(filesystem, input_path):
# Get all files in directory with os.walk
walker = filesystem.walk(input_path, filter=['*.wav', '*.mp3'])
for path, dirs, flist in walker:
for f in flist:
yield fs.path.combine(path, f.name)
all_files = walk_audio(my_filesystem, "/folder/my_data")
Now that the object all_files
has been created, it is possible to use the files locally.
1 Definition on Techtarget
Case study: training a cat / dog classifier
In this section we will demonstrate how to train a cat and dog classifier using supercomputers and in particular SIGMA2.
Data
The case study is entirely reproducible and you can run it yourself provided you have an account on SIGMA2.
The data to run this case study can be found on Kaggle, a subsidiary of Google that allow users to find and publish data sets, explore and build models in a web-based data-science environment, work with other data scientists and machine learning engineers, and enter competitions to solve data science challenges. The training dataset is composed of 25,000 images of dogs and cats and weight approximatively about 500MB.a
Code
All the scripts are found in the GitHub repository of this book, under case_study
. The folder contains three subfolders:
- bash_scripts: contains all the bash scripts which are not to be ran on HPC.
- hpc_scripts: contains all the bash scripts to be run on HPC.
- model_scripts: All the python scripts necessary to train the model.
The folder also contains a Dockerfile
which will be used for creating docker container
.
Setting up the environment
Virtual environments versus containers
Two choices are offered to set up the environment: virtual environments and containers. We will demonstrate how we can train the cat / dog classifier using both methods.
Conda setup
WORK IN PROGRESS
Container setup
On the other hand it might be a good idea to set up a container. Because we have not root access on SIGMA2's HPC we have to build a Singularity container by first having a Docker container which is specified using a Dockerfile. Let's analyse the Dockerfile of this specific case study line by line:
- First we install python version 3.8 in our container so we can use up to date python libraries
FROM python:3.8
- Avoid the system asking questions / dialogs during the
apt-get
install
ARG DEBIAN_FRONTEND=noninteractive
- Update
apt-get
to have up to date packages
RUN \
apt-get update && \
rm -rf /var/lib/apt/lists/*
- Install the package manager
poetry
. Note that you can install your preferred package manager such asconda
in this step.
RUN pip3 install poetry
- Set the working directory of the container
WORKDIR /app
- The next three lines are specific to
poetry
. Basically we copy both thepyproject.toml
(file containing the packages we use for our analysis) andpoetry.lock
(file containing all the dependancies). Then we remove the creation of the virutal environment so thatpython
in our container uses all our package without opening a virtual environment. Finally we install our packages.
COPY pyproject.toml poetry.lock ./
RUN poetry config virtualenvs.create false
RUN poetry install --no-root
- Copy all the files of the folder where we open the container
COPY . ./
- Set the python path to the working directory. This way the
main_scripts
can access the scripts in the other folders (for instance theutils
scripts).
ENV PYTHONPATH "${PYTHONPATH}:/app/"
2 - Creating the docker image
The Dockerfile
being defined we can now create our image (i.e. the "environment" in which we will run the training script). For open a terminal, move to the folder where your Dockerfile
is located and write the following command (change case_study_1
to the name of your folder):
docker run -t case_study_1:latest -f Dockerfile .
The command should output the following:
Sending build context to Docker daemon 244.7kB
Step 1/10 : FROM python:3.8
---> 271c1bcd4489
Step 2/10 : ARG DEBIAN_FRONTEND=noninteractive
---> Using cache
---> 0965e91032c6
Step 3/10 : RUN apt-get update && rm -rf /var/lib/apt/lists/*
---> Using cache
---> 02fa21122354
Step 4/10 : RUN pip3 install poetry
---> Using cache
---> 33bbd2c53863
Step 5/10 : WORKDIR /app
---> Using cache
---> 7720e687da9c
Step 6/10 : COPY pyproject.toml poetry.lock ./
---> Using cache
---> 345244b7ba43
Step 7/10 : RUN poetry config virtualenvs.create false
---> Using cache
---> 47271847855f
Step 8/10 : RUN poetry install --no-root
---> Using cache
---> 67e487ef8ae6
Step 9/10 : COPY . ./
---> c7656f1447fa
Step 10/10 : ENV PYTHONPATH "${PYTHONPATH}:/app/"
---> Running in 9f82e9b70fc5
Removing intermediate container 9f82e9b70fc5
---> b08815df7070
Successfully built b08815df7070
Successfully tagged case_study_1:latest
Indicating that the image has been successfully created.
Training model locally
First of all, we should try to make sure that our scripts work well on the VDI before spending computational resources on SIGMA2. To develop and improve our scripts we can leverage the power of Docker
by using the created image in two ways.
First, we can start a Jupyter
instance inside the docker container so you can develop in a more interactive environments while having all the python libraries you need. For this you can use the script docker_start_jupyter
in ml-sats/bash_utils
.
The script
#!/bin/bash
cd ~/Code/deepexperiments
docker run \
-p 8889:8889 \
--rm -it \
-v $PWD:/app \
-v $HOME/Data:/Data \
case_study_1:latest \
poetry run jupyter lab \
--port=8889 --no-browser --ip=0.0.0.0 --allow-root
~
You can also use the docker image to run the training script on any computers using the script case_study_1/bash_scripts/train_model.sh
. Note that you need to change the folders that are exposed (for the meaning of exposed folder refer to the document XXX)
#!/bin/bash
# e: stop on error
# u : raises error if variable undefined
# -o pipefail: trigger error when command in the pipe fail
set -euo pipefail
cd $HOME/Code/case_study_1
DATA_PATH=/Data/train
OUT_DIR=/Data/
docker run --rm -v $HOME/Data:/Data -v $PWD:/app case_study_1:latest \
python -u /app/main_scripts/train_model.py \
--data_path $DATA_PATH \
--save_path $OUT_DIR/model.pt \
--save_es $OUT_DIR/model.pt \
--batch_size 128 \
--lr 0.001 \
--num_epoch 10
We are finally ready to train the model. Since this is a test and the main model will be trained on SIGMA2 we run the model only for a few epoch to be sure the code does not contain any bug. Running the script (./bash_scripts/train_model.sh
) should output the following:
benjamin.cretois@nixml086424q01:~/Code/ml-sats/case_study_1$ ./bash_scripts/train_model.sh
./bash_scripts/train_model.sh: line 3: cd: /home/benjamin.cretois/Code/case_study_1: No such file or directory
/usr/local/lib/python3.8/site-packages/torch/nn/functional.py:718: UserWarning: Named tensors and all their associated APIs are an experimental feature and subject to change. Please do not use them for anything important until they are released as stable. (Triggered internally at /pytorch/c10/core/TensorImpl.h:1156.)
return torch.max_pool2d(input, kernel_size, stride, padding, dilation, ceil_mode)
Epoch: 0, Training loss: 0.6700030667766644, Validation loss: 0.6518245041370392
Validation loss decreased (inf --> 0.651825). Saving model ...
Epoch: 1, Training loss: 0.6335798464003642, Validation loss: 0.6232574313879014
Validation loss decreased (0.651825 --> 0.623257). Saving model ...
Epoch: 2, Training loss: 0.5948401207377196, Validation loss: 0.5763850644230842
Validation loss decreased (0.623257 --> 0.576385). Saving model ...
Epoch: 3, Training loss: 0.553090987691454, Validation loss: 0.5323234401643276
Validation loss decreased (0.576385 --> 0.532323). Saving model ...
Epoch: 4, Training loss: 0.5246088358627004, Validation loss: 0.5051904194056988
Validation loss decreased (0.532323 --> 0.505190). Saving model ...
Epoch: 5, Training loss: 0.48994877194143405, Validation loss: 0.4672641947865486
Validation loss decreased (0.505190 --> 0.467264). Saving model ...
Epoch: 6, Training loss: 0.4647162993242786, Validation loss: 0.45798871740698816
Validation loss decreased (0.467264 --> 0.457989). Saving model ...
Epoch: 7, Training loss: 0.4354040877074952, Validation loss: 0.42637150436639787
Validation loss decreased (0.457989 --> 0.426372). Saving model ...
Epoch: 8, Training loss: 0.4093162176335693, Validation loss: 0.41202530562877654
Validation loss decreased (0.426372 --> 0.412025). Saving model ...
Epoch: 9, Training loss: 0.3919389988206754, Validation loss: 0.39200695902109145
Validation loss decreased (0.412025 --> 0.392007). Saving model ...
Finished Training
Training model on SIGMA2
4 - Pushing the image to the Gitlab repository
On SIGMA2 we cannot use our docker image immediately as it has been created on our local computer. As explained here SIGMA2 uses singularity
which is another software for handling images.
First we need to push our docker image to GitLab
(we use Gitlab
instead of GitHub
as it is easier to handle docker images on Gitlab
). This has to be done in 3 steps:
- First, be sure you are logged in the Gitlab registry:
docker login registry.gitlab.com
- Then we need to rename the docker image with regard to the
Gitlab
repository where it will be stored. For instance, the image we buildcase_study_1:latest
will be stored inregistry.gitlab.com/nina-data/ml-sats/
, thus we need to rename the image asregistry.gitlab.com/nina-data/ml-sats/case_study_1:latest
:
docker tag case_study_1:latest registry.gitlab.com/nina-data/ml-sats/case_study_1:latest
- Then we can push the image to the
Gitlab
repository:
docker push registry.gitlab.com/bencretois/ml-sats/case_study_1:latest
5 - Training the model on SIGMA2
We first need to pull the image we stored on the Gitlab repository in our folder in SIGMA2. singularity
uses .sif
file so when we pull the image we need to relabel the image as a .sif
file:
singularity pull --name case_study_1.sif docker://registry.gitlab.com/bencretois/ml-sats/case_study_1
Now that case_study_1.sif
has been created in our folder we can train the model on SIGMA2 using the script case_study_1/hpc_scripts/train_model.sh
. The script contains a few lines worth noticing:
- First the
shebang
, to tell the interpreter that this is abash
script
#!/bin/bash
- The SIGMA2 specific lines. Here we tell the HPC server that for running our script we need 1 GPU for a maximum of 24 hours. We also ask for a CPU with 4GB of memory.
#SBATCH --account=nn5019k --job-name=cat_dog_model
#SBATCH --partition=accel --gpus=1
#SBATCH --time=24:00:00
#SBATCH --mem-per-cpu=4G
- We change our current directory to the directory where our scripts are located
cd $HOME/ml-sats/case_study_1
- Since we have 24 hours of GPU we run our script with three different learning rates. In this specific case the script will be ran 3 times in a consecutive fashion: once the script using the learning rate of 0.01 is finished, the script will be ran with a learning rate of 0.001. We will learn about parallelizing on SIGMA2 in case_study_2.
for LR in 0.01 0.001 0.0001
do
- Finally we set up the docker container for running the script
train_model.py
and specify all the relevant parameters. With singularity, by default only the working directory is exposed (in our case$HOME/ml-sats
) so we need manually expose the folder where the the data is stored and where we want the model to be stored with the command--bind
. Then withpython
we ask the docker container to usepython
to run the scripttrain_model.py
. The option-u
is useful to display the printed text in the.out
file.
singularity exec --bind /cluster/projects/nn5019k:/Data \
--nv case_study_1.sif \
python -u main_scripts/train_model.py \
--data_path /Data/kaggle_cats_dogs/train \
--save_path /Data/saved_models/model.pt \
--save_es /Data/saved_models/model_es.pt \
--batch_size 128 \
--lr $LR \
--num_epoch 100
done
- Now we can start the model training by submitting our script as a job with the command:
sbatch train_model.sh
6 (optional) - Importing the trained model locally
If we want to use our trained model on our own computer we can import the model using the command scp
:
scp bencretois@sage.sigma2.no:/cluster/projects/nn5019k/saved_models/model.pt $HOME/Code/ml-sats/case_study_1
Chapter 5: Advanced topics
Ray tune
Acknowlegments
This document would not have been possible without the inputs of Francesco Frassinelli, Kjetil Grun and Stig Clausen.
List of acronyms
Acronym | Term | Description |
---|---|---|
GPU | Graphic Processing Unit | A specialized processor originally designed to accelerate graphics rendering. Used extensively in Machine Learning to train complex algorithms |
GUI | Graphical User Interface | Form of user interface that allows users to interact with electronic devices through graphical icons and audio indicator such as primary notation |
ML | Machine learning | Use and development of computer systems that are able to learn and adapt without following explicit instructions |
DL | Deep learning | Part of the family of machine learning methods based on artificial neural networks with representation learning. |