Performance

Performance#

Tools#

This section attempts to provide a list of performance engineering tools. It includes the tools provided by VI-HPS or taught in the EuroCC trainings, tools used in the projects in ICHEC and Pyhton language specific performance tools. Of course there are many other options available for the HPC applications.

Tools can be catagorised for their purpose:

Performance analysis tools provide information about the runtime behavior of an application, performance bottlenecks and optimisation potentials.
Debugging tools investigates an application for possible errors to help to fix the problems.
Correctness checking tools detects errors in the usage of programming models such as data race condition in OpenMP.

Tools also can be categorised in terms of parallel programming model or platforms they support as well as compiler support and the way they are built and run on HPC clusters. The VI-HPS tools guide lists the properties of the performance tools developed by their partner institutions and specifies the focus categories for each.

VI-HPS Performance Tools#

VI-HPS is a collaborative research initiative focused on developing advanced programming tools and techniques for HPC. The main tools introduced in 46th VI-HPS Tuning Workshop:

Score-P/Scalasca/CUBE: This is an integrated performance analysis toolkit developed by JSC (Jülich Supercomputing Centre). Score-P collects performance data in profiles and execution traces, Scalasca analyses and identifies performance issues, and CUBE provides visualisation of certain metrics and their distribution on functions and resources. Scalasca and Score-P write analysis reports in CUBE format and these reports can also be viewed by some other tools such as Vampir and TAU.
Extrae/Paraver/Dimemas: These are performance analysis tools developed by BSC (Barcelona Supercomputing Centre). Extrae captures detailed execution traces. The information collected by the Extrae are stored in three different files. Paraver captures data in these files and provides a powerful visualisation and analysis capabilities to help identify performance bottlenecks and optimise parallel code. Dimemas ia a simulator used to predict the application’s behavior on different systems.
MUST/Archer/OTF-CPT: The correctness analysis tools MUST and Archer and the performance analysis tool OTF-CPT were developed by the IT Center of the RWTH Aachen University. Archer is a dynamic data race detector for OpenMP programs. MUST is a runtime error detection tool for MPI applications. OTF-CPT is a lightweight performance analysis tool that reports a summary of POP metrics for MPI+OpenMP applications.
MAQAO: MAQAO toolkit was developed by the performance tools team at UVSQ (LI-PaRAD Laboratory). It executes performance analysis at binary level, with a focus on core performance. It comes with six main modules:

LProf (Lightweight Profiler) as a sampling-based profiler that provides a list of hot spots (loops and/or functions) collected during program execution;
CQA (Code Quality Analyzer) as a static analyser assessing the quality of the executable and producing a set of reports describing potential issues, estimations of the gain if fixed, and hints on how to achieve this;
ONE View as a module that invokes other modules and aggregates their results to produce reports in HTML or XLS format;
VProf as a value profiler relying on instrumentation through binary rewriting;
DECAN (DECremental ANalyzer) as a module using differential analysis on innermost loops to locate performance issues;
ASSIST as prototype code restructuring tool implementing advanced Profile Guided Optimisation techniques.

CARM: It is a performance analysis tool that focuses on roofline modelling, developed by INESC-ID, Instituto Superior Técnico, Universidade de Lisboa. Roofline modelling provides visualisation of the performance limits of an application based on its memory bandwidth and peak floating-point performance. It also provides recommendations for optimising the application’s performance. This paper explains the roofline modelling.

These tools are freely available for download and use in applications. Please see the VI-HPS tools guide for supported languages and platforms.

Community Tools#

Note: Many of the profiling tools and approaches in this section support ‘Unix’ systems, however recent MacOS releases have made it challenging to use any profiling tools other than their Instruments application, which has several limitations. Therefore there is a focus in this section mostly on profiling on Linux systems and profiling remote codes, which is likely to be necessary when working on Mac.

gprof, the ‘GNU profiler’ is a widely used profiler for C and C++ applications on Unix systems. The flag -pg will allow use of gprof with the compiler, with debugging symbols also being useful for line-by-line profiling (-g):
```
gcc hello.c -o hello -g -pg
```
This will create a file gmon.out which can be post-processed with the gprof tool:
```
gprof hello gmon.out > profile_output
```
You can then take a look at the output:
```
nano profile_output
```
LIKWID: LIKWID is a lightweight command line performance tool suite for the GNU/Linux operating system. likwid-topology tool provides a detailed view of the thread and cache topology of a single node. This is very handy because it is important to understand the hardware architecture before understanding the performance of parallel applications. likwid-perfctris used in measuring hardware performance counters counters for the application. likwid-pin is used to control the affinity of tasks to specific CPU cores or NUMA nodes. There is a section about likwid in ICHEC handbook which was created in 2021. It explains the likwid tools in detail. For more tools and more up to date information please check the official website.
TAU: TAU is a profiling and tracing tool. It gathers performance data through automatic instrumentor tool based on the Program Database Toolkit (PDT), dynamically using DyninstAPI, at runtime in the Java Virtual Machine, or manually using the instrumentation API. Collected data can be visualised in paraprof profiler. Event traces can be displayed with JumpShot trace visualisation tools.
PAPI: PAPI is a library that provides an interface for hardware performance counters, found across the system such as CPUs, GPUs, memory, interconnectors, I/O system, energy and power. Hardware counters, such as CPU cycles, cache misses, and branch mispredictions provide valuable information about code sections that can be improved. While it can be used as a standalone tool, many performance analysis tools, such as Scalasca, Score-P, TAU, leverage it as a backend to collect performance data.
DARSHAN: DARSHAN is a lightweight I/O profiler for frameworks such as MPI-IO, HDF5, PNetCDF, and standard POSIX calls. It collects performance data through I/O instrumentation at link-time (for static and dynamic executables) or at runtime and produces a log of information that can be viewed by DARSHAN commands.
Valgrind: Valgrind is an instrumentation framework for building debugging and profiling tools. In the area of profiling it can be used with its callgrind tool as follows:
```
valgrind --tool=callgrind program args
```
You can visualize results in a nice way with the kcachegrind tool. On Mac you can install a Qt version with brew install qcachegrind. Some of well-know tools:
- Memcheck detects memory management problems primarily for C and C++ programs such as memory leak detection, invalid memory access, uninitialized reads, mismatched allocation/deallocation.
- Cachegrind is a cache profiler for analysing cache performance and identifying cache misses.
- Callgrind is a profiler that provides detailed information about function call times and call graphs. A useful blog for profiling C++ code with callgrind.
- Helgrind is a thread debugger which finds data races in multithreaded programs.
pprof: It is a tool for visualisation and analysis of profiling data. It reads a collection of profiling samples in profile.proto format and generates reports to visualise and help analyse the data. It can generate both text and graphical reports.
gperftools: It is a collection of a high-performance multi-threaded malloc() implementation, plus some pretty nifty performance analysis tools. Note: Check Mac issue in [this link](gperftools/gperftools#1292 https://gperftools.github.io/gperftools/cpuprofile.html).
O-profile: It includes a statistical profiler for Linux systems, capable of profiling all running code at low overhead.
dtrace: It can be used to get a global overview of a running system, such as the amount of memory, CPU time, filesystem and network resources used by the active processes. It can also provide much more fine-grained information, such as a log of the arguments with which a specific function is being called, or a list of the processes accessing a specific file. It requires root privilages on Mac. xctrace is Mac only - likely to replace dtrace on there.

Commercial#

Linaro Forge DDT/MAP/PR: Linaro Forge provides easy to use and efficient tools for parallel debugging, profiling and performance reports. Linaro Forge DDT is an advanced debugger designed to simplify the troubleshooting and code optimisation. Linaro Forge MAP is an easy to use profiling tool that can show metrics such as processor, memory, communication and I/O. Linaro Forge PR provides analysis on specific performance aspects of the application and produces a simple one page HTML report highlighting issues and hints on how to improve them.
VAMPIR: VAMPIR is an interactive graphical tool for trace visualisation and analysis. It executes trace files in OTF2 format which can be produced by Score-P. The tool provides a comprehensive suite of features with special focus on highly parallel applications from message passing to multithreaded and accelerator based paradigms.
TotalView: TotalView is powerful tool used for debugging and analysing both serial and parallel applications. It provides both graphical and command line interface, and script based environments for debugging.

Vendor Tools#

Intel® Toolkits: Intel oneAPI Base Toolkit is a unified programming environment for x86-based processors that provides a single, cross-architecture programming model for Intel CPUs, GPUs, FPGAs, and other accelerators. It is not primarily a performance analysis tool but it includes several components that can be used for performance analysis. Some of them:
- Intel® VTune™ Profiler: It supports profiling for CPU, GPU, FPGA, threading, memory, cache, storage, power. Profiling data can be viewed in architecture diagrams, as a histogram, on a timeline. Types of analysis consists of hotspot analysis, microarchitecture, parallelism, platform analysis, power usage and low level counters. Please see this page for more details. This tool was used in several ICHEC National and Academic Flagship project during the kay era.
- Intel® Advisor: It helps to identify issues related with threading and vectorisation. It provides Memory Access Patterns (MAP) report, data dependency analysis, survey hotspots analysis and illustrates roofline performance model. Similar to many other Intel performance tools, it offers both a command-line interface and a graphical user interface. Please see this page for more details.
- Intel® System Debugger: Intel debugger IDB, Intel Distribution for GDB, is a simple tool for quick debugging of parallel applications. Intel® System Debugger enables deep analysis of system hardware in chip level. Please see this page for more details.
NVIDIA Performance Analysis Tools: Two main performance analysis tools as part of NVIDIA Developer Tools: NVIDIA Nsight Systems and NVIDIA Nsight Compute.
- NVIDIA Nsight Systems provides a system-wide visualisation of an application’s performance. Data is collected from the comman-line interface and then can be copied to any system and analysed later with Nsight Systems' visualisation tools suitable for the host architecture. It visualises system workload metrics on a timeline and offers tools to identify, understand, and resolve performance bottlenecks. See this page for a tutorial series on how to use the profiler and optimise communication, memory allocation etc. for different platform.
- NVIDIA Nsight Compute is an interactive kernel profiler for CUDA applications. It provides targeted metric sections for various performance aspects and presents those in tables and charts.Those metrics detail the workload on the GPU, including the number of instructions executed, and the distribution of work across different SMs; examine the memory access patterns of the application and provide insights into the GPU scheduler’s behavior, including task scheduling, warp scheduling, and occupancy management. It highlights the locations of performance bottlenecks within the source code and offers practical solutions. In addition, the baseline feature of the tool allows to compare different versions of the same kernel within the tool. ICHEC performance engineering projects benefit a lot from Nsight Compute for kernel optimisation .

Python Profiling Tools#

Benchmarking#

pyperf: To perform accurate and repeatable benchmarking, python provides the pyperf module. Basic timing methods can be unreliable due to factors like system noise, background processes, or CPU frequency scaling. pyperf mitigates these issues by providing more precise timing and reducing external interference. To prepare your system for benchmarking, run:
```
python -m pyperf system tune
```
To restore the default CPU settings afterward, use: python -m pyperf system reset. You can use pyperf directly from the command line, or import it into a script. In scripts, you can create a Runner object to configure and execute your benchmarks programmatically.
time: There are several ways to measure execution time in python depending on your needs. One is Linux time command to measure how long a python script takes to run. Simply run:
```
time python <script.py>
```
This is a super quick and simple way to get timing information. It gives real: Total wall-clock time (start to finish), user: CPU time spent in user-mode, sys: CPU time spent in kernel-mode.

For basic timing in code, you can use time.time():
```
import time

start = time.time()
# your code
end = time.time()
print("Elapsed:", end - start)
```
This returns the wall-clock time in seconds since the epoch. However, it’s not recommended for precise comparisons since it can be affected by system clock changes. Other useful functions in the time module:
- time.process_time(): CPU time, excludes sleep.
- time.perf_counter(): High-resolution timer, includes sleep.
- time.monotonic(): Always moves forward, unaffected by clock adjustments.
Use time.perf_counter() for microbenchmarking. It returns the value of a performance counter with the highest available resolution to measure a short duration. It does include time elapsed during sleep and is system wide. So, It is more suitable for the measurement of smaller code blocks.

timeit is another package that comes with python standard library. Similar to time, it can be used from the terminal as well as in the script with time.timeit(). It automatically uses time.perf_counter(). It runs the code many times and then takes the average time. It temporarily disables garbage collection to avoid measurement noise.

Profiling#

Python comes with two built-in tools to profile a code: profile and cProfile. Both profilers are different implementations of the same profiling interface. profile is the original python module but has significant overhead compared to cProfile. cProfile on the other hand is a C extension with less overhead and therefore it is chosen for programs with a longer runtime. For memory usage profiling python has built-in tracemalloc module.

cProfile: It is easy to use cProfile as it is not necessary to modify the program code in advance. You can run it directly from the command line. It writes the results to standard output. Optional -o flag redirects the output to a file.
```
python -m cProfile [-o <output.prof>] <script.py>
```
It outputs total number of function calls, primitive calls and total time at the top. The column headings are ncalls: number of times the function is called; tottime: total time spent in the given function (it doesn’t include sub-function calls); percall: tottime/ncalls; cumtime: cumulative time spent in this and all subfunctions; percall: cumtime/ncalls; filename:lineno(function): where this function is defined. The output is ordered by cumulative time.

Once the profile is saved to a file, you can view and analyse it using the pstats module. When used from command line, it launches an interactive shell where you can explore the profiling data.
```
python -m pstats <output.prof>
output.prof% sort cumtime
output.prof% sort time
output.prof% stats 10
output.prof% quit
```
Or use it in the script.
```
import pstats

stats = pstats.Stats("output.prof")
stats.strip_dirs()
stats.sort_stats("cumtime").print_stats(10)
stats.sort_stats("time").print_stats(10)
stats.print_stats(10)
```
cProfile can also be added to the code. See this handbook page for an example.
tracemalloc: It helps trace memory allocations, identify where memory is being used, and detect potential leaks by comparing snapshots. To trace most allocations, we should start tracemalloc by setting the PYTHONTRACEMALLOC environment variable to 1, or by using -X tracemalloc command line option. Its accuracy can be improved by enabling deeper stack traces using the PYTHONTRACEMALLOC. Following example can be use to display top 10 memory consuming lines.
```
import tracemalloc

tracemalloc.start()

# run your code

snapshot = tracemalloc.take_snapshot()
top_stats = snapshot.statistics('lineno')

for stat in top_stats[:10]:
    print(stat)
```

Other profiling tools that you can install with pip.:

line_profiler: It is a module for doing line-by-line profiling of functions. It returns a line-by-line breakdown of where time is spent. To use line_profiler from the command line, the functions to be profiled need to be explicitly marked in the script with the @line_profiler.profile decorator. Set the environment variable LINE_PROFILE=1 to enable. Profiling is performed with kernprof.
```
kernprof -lvr <script.py>
```
The -l flag ensures that the function is profiled step by step, the -v flag shows the result on standard output and the -r flag is used to have rich format on output.
py-spy: It is a fast, sampling based profiler written in Rust. It runs without any code modification.
yep: It is a utility that profiles C/C++ functions made inside python C extensions. It uses the gperftools profiler underneath and depends on pprof for visualisation.
memory_profiler: It is for tracking memory usage line by line. Simple to use with the @profile decorator. It is no longer maintained since November 2022.
guppy3: It inspects memory usage in CPython based applications. It contains subpackages: etc, gsl, heapy and sets.
Fil: It is an open source memory profiler designed for data processing applications.
austin: is frame level sampling profiler for CPython written in pure C. It can also track memory usage and has special support for multi process applications.

Visualising performance reports#

There are several tools that can be used to view the resulting output.

SnakeViz: It is a browser-based visualiser for cProfile output. It displays an interactive call graph and sunburst view.
```
snakeviz <output.prof>
```
It shows hierarchical function calls in a horizontal or circular layout. It’s fully interactive. You can click any segment to zoom in, focus on specific paths, and explore which parts of your code dominate execution time.
viztracer: It is a tracing tool. It captures function call timings, arguments and return values. It generates interactive HTML reports and timelines. The following will create a file named result.json, which opens as an interactive report in the browser.
```
viztracer <script.py>
```
```
vizviewer <result.json>
```
gprof2dot: It can be used to generate a visual call graph image. The first command converts profiler output into a Graphviz DOT format. Then you can use the dot command to generate an image. (Note that Graphviz must be installed separately.)
```
gprof2dot -f pstats <output.prof> -o <callgraph.dot>
```
```
dot -Tpng <callgraph.dot> > <output.png>
```
pyCallGraph: It creates call graph visualisations. It groups application modules based on execution time, function calls, memory usage. It is not maintained anymore.
Tuna: It is alternative to SnakeViz for visualising cProfile output. It handles runtime and import profiles, has minimal dependencies, and uses d3 and bootstrap.
```
tuna <output.prof>
```
Speedscope: It is a fast, interactive viewer for flamegraphs. Flamegraphs show function call hierarchies and hot paths in a visual tree format.

AI Workflows#

The tools listed above supporting python such as

Score-P or Extrae for profiling/tracing
Vampir or Paraver for visualisation
LIKWID or PAPI for measuring hardware/software counters
Linaro Forge tools

or python profiling tools can be used for python based AI Workflows. In particular, python profiler cProfile could be a starting point to gain insights into performance bottlenecks. For AI applications using machine learning frameworks, similarly, it’s best to start by using the tool’s built-in profiler. Pytorch is one of these frameworks used in the ICHEC SEODA project. See the handbook page for pytorch profiling. It provides instructions on how to collect performance data on both CPU and GPU. Data can then be viewed by tensorboard or other third party trace viewer tools. When targeting GPU-specific performance analysis, NVIDIA Performance Analysis Tools offer a comprehensive suite of features for optimising GPU-accelerated HPC AI Workflows.