HPC#
This Chapter covers introductory concepts in High Performance computing, including:
components of a HPC cluster
distributed computing
resource managers, particularly Slurm
the easy-build and module system
batch processing small tasks
applications
HPC can be considered as a combination of powerful hardware systems due to need for memory, compute power, storage and fast networking. It involves software tools capable of fully exploiting these hardware systems via implementing specialised libraries and parallel programming models. HPC is also used to refer Parallel Computing, which is dividing up the work into smaller pieces and executing those pieces simultaneously. This is done by using supercomputers and computer clusters to solve advanced computation problems.
HPC can benefit the researchers who are generally modelling problems involving large amounts of repetitive calculations on large amount of data which will not fit in a simple computer and needs to be solved in a timely manner or tested for potential solutions to those problems. HPC provides to use traditional scientific and engineering methods by performing numerical calculations on real world problems that are typically too expensive to solve with the tools in conventional computing. Those are the types of problems involving large simulations such as modelling a throw-away passenger jet, or a Formula 1 car, or a Wind tunnel tests. Some models are too slow to converge to a solution, such as those used in climate modelling, galactic evolution, or earthquake prediction. Some problems come from dangerous or controversial areas such as experiments of testing nuclear weapons or environmental impact assessments.
Today, HPC capable architectures has become more accessible for everyone with access to affordable modern hardware, ranging from small inexpensive clusters with commodity hardware to the fastest supercomputers. HPC is now being used around the world from individuals to large groups, in research labs, public and private sector and so on. In recent years, there is a growing demand for computational power in nearly all of the branches of science and engineering.
One of the primary drivers of this demand comes from the advance in Artificial Intelligence (AI). As AI applications expand across various fields including healthcare, finance, and autonomous systems, the need for HPC has become increasingly critical. HPC plays an important role by providing the computational power necessary for processing huge amounts of data and executing complex algorithms. This results in advancements in hardware systems, such as GPUs and TPUs, designed specifically to accelerate AI workloads. Additionally, the new software tools were developed alongside these hardware innovations enabling more efficient utilisation of computational resources frameworks like PyTorch and TensorFlow.
HPC Cluster Components#
A HPC cluster is typically a tightly integrated network of computers with largely homogenous software and hardware environments. Each computer is typically known as a ‘node’, although more generally this is an abstract addressible resource in the network.
Nodes can perform specific functions, for example ‘login’ nodes typically host a software environment that allows a user to get shell access to the HPC system overall via the internet, but don’t support intensive computations. ‘Compute nodes’ are focused on compute intensive applications and may or may not have direct internet access. Support, management or control nodes allow for networking, storage and other system tooling to run.
Compute nodes might be further broken into different specialities, some may have large numbers of CPUs to support standard computing workloads while others might provide a mix of CPUs and GPUs to support workloads that can leverage GPU acceleration.
Cluster nodes are typically connected by high bandwidth network interconnects due to the requirement for performant data interchange between them during problem solution.
Distributed Computing#
Most HPC applications need to take advantage of the availability of many processes at the same time, or parallel computing. Two primary paradigms in this area are ‘message passing’ between distinct addressable processes, which in HPC is often implemented following the Message Passing Interface (MPI) and shared resource (usually memory) programming witin a node which is usually facilitated by OpenMP multithreading.
MPI aware programs are typically built against API implementation headers and libraries, often facilitated a compiler wrapper. They are then launched with the equivalent mpiexec
or mpirun
commands.
On HPC systems there is usually a tight integration between the (Distributed) Resource Manager (DRM), which has awareness of all system resources, such as CPUs, and the implementation of MPI, which allows applications to address the resources in a simple way.
MPI implementations can be integrated with DRMs via a Process Management Interface (PMI).
Recently there has been some standardization of these interfaces:
PMIx (Process Management Interface - Exascale) is a standard for libraries that support distributed and parallel computing systems. You may see it in lower level documentation for applications such as Slurm and Open MPI.
PRRTE is the reference runtime environment for PMIx. Some tools like
mpirun
maybe just be thin vendored wrappers of this. This is a successor of ORTE which you might still see in some applications.
This means when using a Resource Manager like Slurm that, provided that the cluster is suitably configured’ Slurm itself can be used to launch MPI tasks with a suitable runtime environment, for example via the srun
command.
Further Reading#
Resource Managers and Slurm#
HPC systems have multiple users attempting to use their resources (CPUs, GPUs) at the same time. It is thus necessary to have some system for allocating these resources in a sensible way.
Slurm is a particular implementation of such a Resource Manager or Job Scheduler. It is configured with a knowledge of the available computer resources and is then used to manager the submission of user ‘jobs’ to run on them.
Slurm has specific terminology related to performing work on a cluster. A node is a particular compute resource. A partition is some logical grouping of nodes (which can overlap). A job is an allocation of resources to a particular user for a particular amount of time. Job steps are a set of taks within a job, which can be run in parallel.
The srun command is used to launch job steps in Slurm, and optionally allocated the required resources.
Modules and Easy Build#
HPC systems often provide applications that will be useful for a large fraction of their users. This is both to avoid users having to install duplicate software, but also because much HPC software needs to be build to target a specific system for optimal performance.
The modules
application, which has several different implementations, allows application runtime environments to be loaded into system paths - meaning that the applications can be launched by end users after loading the module. This approach, distinct from having all applications available at all times allows for easier management of dependencies and of different versions of software at the same time.
The Easy Build tool is often used to build custom modules, specializing in making portable builds that do not need to go into standard system-wide installation locations and coming with a large number of recipies for commonly-used scientific applications.
Batch Processing Small Tasks#
While it is traditional in HPC applications for a single application (MPI or OpenMP based) to utilize all available resources across one or multiple nodes during a computation there are still many cases where it is desireable to launch many smaller tasks within a single resource allocation. This is particularly the case in parameter sweeps or data-driven applications, where a high degree of resource utilization is possible with suitable design of the workflow.
This section reviews some methods for running small tasks in the context of a single Slurm resource allocation, starting with simple tools like xargs
available on all machines, through to more powerful workflow managers.
xargs#
xargs is a command line utility to create and execute commands built from strings. It is part of the POSIX specification, meaning it is available by default on the majority of Unix-like systems.
Although not specifically developed for running tasks in parallel it is capable of doing so via the -P
option. Take this simple script, my_echo.sh
for example:
#! /bin/bash
MINWAIT=1
MAXWAIT=5
sleep $((MINWAIT+RANDOM % (MAXWAIT-MINWAIT)))
echo First arg is: $1, second arg is: $2
which will sleep for between 1 to 5 seconds and then echo the first two command line args it is passed. We can run this script in parallel passing each instance a different input argument. If we collect our arguments in a file my_input.txt
, such as:
task_1
task_2
task_3
task_4
task_5
task_6
we can run six instances of this script, with a maximum three at a time, with:
xargs -n 1 -P 3 ./my_echo.sh first_arg < my_input.txt
to give:
First arg is: first_arg, second arg is: task_1
First arg is: first_arg, second arg is: task_3
First arg is: first_arg, second arg is: task_2
First arg is: first_arg, second arg is: task_4
First arg is: first_arg, second arg is: task_5
First arg is: first_arg, second arg is: task_6
Here the -n 1
constraint make sure we only use one argument from the input file per script, without it all arguments would be passed to a single script at once. Setting -P
to 0 will use all available cores on the machine.
Xargs is a simple and widely available tool but it can be easy to make mistakes in formatting the file with input commands. In addition you need to manage stdout and stderr streams from within your program if relying on the output for further processing, and make sure that any produced files have unique names (which can be taken from an input arg).
The next discussed tool GNU Parallel is more feature-rich and forgiving to use than xargs
with the downside that it is not available by default on most Linux systems (however it is available via Modules on many HPC systems).
GNU Parallel#
GNU Parallel is a free and open source utility for running commands in parallel. It has a reasonably similar syntax to xargs
but provides many more useful features. Continuing the previous example, we can run it we parallel with:
parallel -j 3 ./my_echo.sh first_arg {} < my_input.txt
Note that -j
is used to specify the max number of processes, while ommitting it will use all available cores. Another difference is the use of the placeholder {}
for the argument coming from the input file.
Note: the first time you run parallel
you will see a banner which can be silenced for future runs by running parallel --citation
once. Please do read it and consider supporting the author by citing their work if it is useful for you.
@software{tange_2024_12789352,
author = {Tange, Ole},
title = {GNU Parallel 20240722 ('Assange')},
month = Jul,
year = 2023,
note = {{GNU Parallel is a general parallelizer to run
multiple serial command line programs in parallel
without changing them.}},
publisher = {Zenodo},
doi = {10.5281/zenodo.12789352},
url = {https://doi.org/10.5281/zenodo.12789352}
}
Parallel has some extra utilities to help distinguish tasks and which ‘slot’ (core) they are running on. Extending the my_echo.sh
example to:
#! /bin/bash
MINWAIT=1
MAXWAIT=5
sleep $((MINWAIT+RANDOM % (MAXWAIT-MINWAIT)))
echo First arg is: $1, second arg is: $2. Task is: $3, core: is $4
and running:
parallel -j 3 ./my_echo.sh first_arg {} {#} {%} < my_input.txt
where {#}
and {%}
are special placeholders to pass the task number and core number as input arguments, will produce:
First arg is: first_arg, second arg is: task_1. Task is: 1, core: is 1
First arg is: first_arg, second arg is: task_3. Task is: 3, core: is 3
First arg is: first_arg, second arg is: task_4. Task is: 4, core: is 1
First arg is: first_arg, second arg is: task_2. Task is: 2, core: is 2
First arg is: first_arg, second arg is: task_6. Task is: 6, core: is 1
First arg is: first_arg, second arg is: task_5. Task is: 5, core: is 3
These axtra arguments, particularly the task number, can be useful for naming output files to make sure that they are not over-written by other tasks.
The Parallel tool has many other flags and options, however the above should be sufficient for many simple use cases. You can just create a file with the input parameters for your experiment (for example using a script) which can be paths to input or config files for solvers or some parameter values for parameter sweeps and pass that file with your launcher script to parallel
.
For further reading see:
Taskfarm#
Taskfarm is a Python script developed by ICHEC for running tasks in parallel. It has some similarities to GNU Parallel, but fewer features and is used only on ICHEC HPC systems. GNU Parallel may be more suitable for general batching needs, however Taskfarm is supported by ICHEC if any issues arise and is documented for use on ICHEC systems.
As a simple example of its use - to allow comparison with the previously introduced tools - you can load the taskform module on a supported HPC system:
module load taskfarm
and then run:
taskfarm my_tasks.txt
where my_tasks.txt looks as follows:
./my_echo.sh first_arg task_1 %TASKFARM_TASKNUM% 0
./my_echo.sh first_arg task_2 %TASKFARM_TASKNUM% 0
./my_echo.sh first_arg task_3 %TASKFARM_TASKNUM% 0
./my_echo.sh first_arg task_4 %TASKFARM_TASKNUM% 0
./my_echo.sh first_arg task_5 %TASKFARM_TASKNUM% 0
./my_echo.sh first_arg task_6 %TASKFARM_TASKNUM% 0
where %TASKFARM_TASKNUM%
is a substitution that taskfarm will perform, inserting the task number. The final argument is 0
just for the purposes of this example, since taskfarm doesn’t give the core number.
This format has similarities with GNU Parallel and ultimately can achieve the same thing but is more verbose if the same launch script and default arguments are being used for all tasks.
Further Reading#
NERSC Docs on Workflow tools: https://docs.nersc.gov/jobs/workflow/
UL HPC Tutorial on GNU Parallel: https://ulhpc-tutorials.readthedocs.io/en/latest/sequential/gnu-parallel/
DOE Workflow Training Material: CrossFacilityWorkflows/DOE-HPC-workflow-training
Sulis HPC Workflow materials: https://sulis-hpc.github.io/advanced/ensemble/jobarrays.html
CeCi HPC Workflow management: https://support.ceci-hpc.be/doc/_contents/SubmittingJobs/WorkflowManagement.html
HPC Wiki: https://hpc-wiki.info/hpc/Multiple_Program_Runs_in_one_Slurm_Job
Lumi notes on task affinity: https://docs.lumi-supercomputer.eu/runjobs/scheduled-jobs/distribution-binding/#__tabbed_1_4
LRZ notes on farming with GNU parallel: https://doku.lrz.de/job-farming-with-gnu-parallel-10746427.html
Globus hpc data flow managemet: https://www.globus.org/what-we-do
ICHEC HPC Systems#
Since 2005, ICHEC has hosted a number HPC systems where were listed in the ICHEC HPC Museum:
Walton (IBM Ethernet cluster, 2005 - 2008)
Hamilton (Bull Shared Memory, 2005 - 2008)
Lanczos (IBM Blue Gene/L, 2008 - 2010)
Schrödinger (IBM Blue Gene/P, 2008 - 2011)
Stokes (SGI ICE, 2008 - 2013)
Stoney (Bull NovaScale cluster, 2009 - 2014)
Fionn (SGI ICE-X, 2014 - 2018)
Kay (Intel Xeon Gold, 2018-2024)
Kay supercomputer reached its 5 year end of life in November 2023. Due to a delay in the national funding of the replacement system, CASPIr, an interim platform was required to provide the compute cycles for the National HPC Service and hence the Interim National Service currently runs on the Meluxina system but continues to be supported by ICHEC.
Meluxina is Luxembourg’s supercomputer hosted by LuxProvide. It is powered by AMD processors. It supports a variety of applications, including scientific simulations, data analysis, machine learning, and AI. For access and system details see here.
ICHEC Training Cluster#
Sciprog is a 2-core machine with Intel® Xeon® Platinum 8259CL CPUs @ 2.50GHz, hosted on an Amazon EC2 AWS cluster. It currently runs Rocky Linux 9.5 (Blue Onyx) as of January 2025. ICHEC uses sciprog for practicals in the “Scientific Programming” and “HPC and Parallel Programming” modules.
Every semester, before the module starts, we ask the system team to refresh the accounts and create a new password. User accounts are named sp#
where #
represents a number between 1
and 100
(or up to 150
, depending on the number of students). We also provide students with instructions on how to SSH into sciprog and share the login password. Hostname is sciprog.training.ichec.ie
. Students have to change the password in their first login.
GCC compilers are available for compiling OpenMP programs on sciprog.
$ gcc --version
gcc (GCC) 11.5.0 20240719 (Red Hat 11.5.0-2)
GCC version 11.5.0
fully supports OpenMP 5.0
. Add omp.h
to C codes and use omp_lib
to the Fortran codes. Compile C codes with gcc
and Fortran codes with gfortran
and link with -fopenmp
flag.
We use SimGrid simulator for compiling and running MPI codes on sciprog. SimGrid provides the execution of parallel and distributed applications on cloud environments. It simulates a virtual platform, allowing us to test programs without needing actual hardware.
$ /opt/simgrid/bin/smpicc --version
SimGrid version 3.35
Both platform.xml
and hostfile
are used to define the simulation environment for SimGrid. They help set up the topology and configuration of the simulated distributed system.
platform.xml
is an XML file that describes the structure of the virtual machines (hosts), their computing resources (CPU, memory), and the network (latencies, bandwidth) in the simulation. The example file below sets up a virtual environment with two hosts (host0
and host1
) connected by a network link (link0
).
Both hosts have equal processing power (1Gf
) or (1 gigaflop
), and they are connected with a network link that has a bandwidth of 125 MBps
and a latency of 100
microseconds.
The <config>
element is used to set global properties. Here, it defines the host speed as 1Gf
.
<?xml version='1.0'?>
<!DOCTYPE platform SYSTEM "http://simgrid.gforge.inria.fr/simgrid/simgrid.dtd">
<platform version="4.1">
<config>
<prop id="smpi/host-speed" value="1Gf"/>
</config>
<zone id="AS0" routing="Full">
<host id="host0" speed="1Gf"/>
<host id="host1" speed="1Gf"/>
<link id="link0" bandwidth="125MBps" latency="100us"/>
<route src="host0" dst="host1"><link_ctn id="link0"/></route>
</zone>
</platform>
You can find network topology examples here.
hostfile
is txt based and used to list the available hosts in the simulation, specifying which machines are used for the execution of the MPI program.
For example:
host0
host1
SimGrid version 3.35
supports MPI-1
and MPI-2
standards. It does not include the more advanced features introduced in MPI-3
, such as non-blocking collectives or improved one-sided communication.
Add mpi.h
to C codes and use mpi
to the Fortran codes. Compile C codes with /opt/simgrid/bin/smpicc
and Fortran codes with /opt/simgrid/bin/smpif90
.
It is better to set the environment variables for the MPI compiler and launcher in the .bashrc
file as follows:
alias mpicc='/opt/simgrid/bin/smpicc'
alias mpif90='/opt/simgrid/bin/smpif90'
alias mpirun='/opt/simgrid/bin/smpirun'
To run MPI application:
mpirun -platform platform.xml -hostfile hostfile.txt -np 2 ./a.out
Further Reading#
For ICHEC courses and graduate modules related with HPC, please check the training page.