Performance#
Tools#
This section attempts to provide a list of performance engineering tools. It includes the tools provided by VI-HPS or taught in the EuroCC trainings and tools used in the projects in ICHEC. Of course there are many other options available for the HPC applications.
Tools can be catagorised for their purpose:
Performance analysis tools provide information about the runtime behavior of an application, performance bottlenecks and optimisation potentials.
Debugging tools investigates an application for possible errors to help to fix the problems.
Correctness checking tools detects errors in the usage of programming models such as data race condition in OpenMP.
Tools also can be categorised in terms of parallel programming model or platforms they support as well as compiler support and the way they are built and run on HPC clusters. The VI-HPS tools guide lists the properties of the performance tools developed by their partner institutions and speficies the focus categories for each. These tools are mostly support traditional HPC programming models. See this topic for information about performance engineering tools for HPC AI Workflows.
VI-HPS Performance Tools#
VI-HPS is a collaborative research initiative focused on developing advanced programming tools and techniques for HPC. The main tools introduced in 46th VI-HPS Tuning Workshop:
Score-P/Scalasca/CUBE
: This is an integrated performance analysis toolkit developed by JSC (Jülich Supercomputing Centre).Score-P
collects performance data in profiles and execution traces,Scalasca
analyses and identifies performance issues, andCUBE
provides visualisation of certain metrics and their distribution on functions and resources.Scalasca
andScore-P
write analysis reports inCUBE
format and these reports can also be viewed by some other tools such asVampir
andTAU
.Extrae/Paraver/Dimemas
: These are performance analysis tools developed by BSC (Barcelona Supercomputing Centre).Extrae
captures detailed execution traces. The information collected by theExtrae
are stored in three different files.Paraver
captures data in these files and provides a powerful visualisation and analysis capabilities to help identify performance bottlenecks and optimise parallel code.Dimemas
ia a simulator used to predict the application’s behavior on different systems.MUST/Archer/OTF-CPT
: The correctness analysis toolsMUST
andArcher
and the performance analysis toolOTF-CPT
were developed by the IT Center of the RWTH Aachen University.Archer
is a dynamic data race detector for OpenMP programs.MUST
is a runtime error detection tool for MPI applications.OTF-CPT
is a lightweight performance analysis tool that reports a summary of POP metrics for MPI+OpenMP applications.MAQAO
:MAQAO
toolkit was developed by the performance tools team at UVSQ (LI-PaRAD Laboratory). It executes performance analysis at binary level, with a focus on core performance. It comes with six main modules:
LProf (Lightweight Profiler)
as a sampling-based profiler that provides a list of hot spots (loops and/or functions) collected during program execution;CQA (Code Quality Analyzer)
as a static analyser assessing the quality of the executable and producing a set of reports describing potential issues, estimations of the gain if fixed, and hints on how to achieve this;ONE View
as a module that invokes other modules and aggregates their results to produce reports in HTML or XLS format;VProf
as a value profiler relying on instrumentation through binary rewriting;DECAN (DECremental ANalyzer)
as a module using differential analysis on innermost loops to locate performance issues;ASSIST
as prototype code restructuring tool implementing advanced Profile Guided Optimisation techniques.
CARM
: It is a performance analysis tool that focuses on roofline modelling, developed by INESC-ID, Instituto Superior Técnico, Universidade de Lisboa. Roofline modelling provides visualisation of the performance limits of an application based on its memory bandwidth and peak floating-point performance. It also provides recommendations for optimising the application’s performance. This paper explains the roofline modelling.
These tools are freely available for download and use in applications. Please see the VI-HPS tools guide for supported languages and platforms.
Community Tools#
LIKWID:
LIKWID
is a lightweight command line performance tool suite for the GNU/Linux operating system.likwid-topology
tool provides a detailed view of the thread and cache topology of a single node. This is very handy because it is important to understand the hardware architecture before understanding the performance of parallel applications.likwid-perfctr
is used in measuring hardware performance counters counters for the application.likwid-pin
is used to control the affinity of tasks to specific CPU cores or NUMA nodes. There is a section about likwid in ICHEC handbook which was created in 2021. It explains thelikwid
tools in detail. For more tools and more up to date information please check the official website.TAU:
TAU
is a profiling and tracing tool. It gathers performance data through automatic instrumentor tool based on theProgram Database Toolkit (PDT)
, dynamically usingDyninstAPI
, at runtime in theJava Virtual Machine
, or manually using the instrumentation API. Collected data can be visualised inparaprof
profiler. Event traces can be displayed withJumpShot
trace visualisation tools.PAPI:
PAPI
is a library that provides an interface for hardware performance counters, found across the system such as CPUs, GPUs, memory, interconnectors, I/O system, energy and power. Hardware counters, such as CPU cycles, cache misses, and branch mispredictions provide valuable information about code sections that can be improved. While it can be used as a standalone tool, many performance analysis tools, such asScalasca, Score-P, TAU
, leverage it as a backend to collect performance data.DARSHAN:
DARSHAN
is a lightweight I/O profiler for frameworks such asMPI-IO, HDF5, PNetCDF
, and standardPOSIX
calls. It collects performance data through I/O instrumentation at link-time (for static and dynamic executables) or at runtime and produces a log of information that can be viewed byDARSHAN
commands.Valgrind:
Valgrind
is an instrumentation framework for building debugging and profiling tools. Some of well-know tools:Memcheck
detects memory management problems primarily for C and C++ programs such as memory leak detection, invalid memory access, uninitialized reads, mismatched allocation/deallocation.Cachegrind
is a cache profiler for analysing cache performance and identifying cache misses.Callgrind
is a profiler that provides detailed information about function call times and call graphs.Helgrind
is a thread debugger which finds data races in multithreaded programs.
Other Valgrind
tools can be found in their official webpage.
Commercial#
Linaro Forge DDT/MAP/PR:
Linaro Forge
provides easy to use and efficient tools for parallel debugging, profiling and performance reports.Linaro Forge DDT
is an advanced debugger designed to simplify the troubleshooting and code optimisation.Linaro Forge MAP
is an easy to use profiling tool that can show metrics such as processor, memory, communication and I/O.Linaro Forge PR
provides analysis on specific performance aspects of the application and produces a simple one page HTML report highlighting issues and hints on how to improve them.VAMPIR:
VAMPIR
is an interactive graphical tool for trace visualisation and analysis. It executes trace files inOTF2
format which can be produced byScore-P
. The tool provides a comprehensive suite of features with special focus on highly parallel applications from message passing to multithreaded and accelerator based paradigms.TotalView:
TotalView
is powerful tool used for debugging and analysing both serial and parallel applications. It provides both graphical and command line interface, and script based environments for debugging.
Vendor Tools#
Intel® Toolkits: Intel oneAPI Base Toolkit is a unified programming environment for x86-based processors that provides a single, cross-architecture programming model for Intel CPUs, GPUs, FPGAs, and other accelerators. It is not primarily a performance analysis tool but it includes several components that can be used for performance analysis. Some of them:
Intel® VTune™ Profiler
: It supports profiling for CPU, GPU, FPGA, threading, memory, cache, storage, power. Profiling data can be viewed in architecture diagrams, as a histogram, on a timeline. Types of analysis consists of hotspot analysis, microarchitecture, parallelism, platform analysis, power usage and low level counters. Please see this page for more details. This tool was used in several ICHEC National and Academic Flagship project during the kay era.Intel® Advisor
: It helps to identify issues related with threading and vectorisation. It provides Memory Access Patterns (MAP) report, data dependency analysis, survey hotspots analysis and illustrates roofline performance model. Similar to many other Intel performance tools, it offers both a command-line interface and a graphical user interface. Please see this page for more details.Intel® System Debugger
:Intel
debuggerIDB
,Intel
Distribution forGDB
, is a simple tool for quick debugging of parallel applications.Intel® System Debugger
enables deep analysis of system hardware in chip level. Please see this page for more details.
NVIDIA Performance Analysis Tools: Two main performance analysis tools as part of
NVIDIA
Developer Tools:NVIDIA Nsight Systems
andNVIDIA Nsight Compute
.NVIDIA Nsight Systems
provides a system-wide visualisation of an application’s performance. Data is collected from the comman-line interface and then can be copied to any system and analysed later withNsight Systems'
visualisation tools suitable for the host architecture. It visualises system workload metrics on a timeline and offers tools to identify, understand, and resolve performance bottlenecks. See this page for a tutorial series on how to use the profiler and optimise communication, memory allocation etc. for different platform.NVIDIA Nsight Compute
is an interactive kernel profiler for CUDA applications. It provides targeted metric sections for various performance aspects and presents those in tables and charts.Those metrics detail the workload on the GPU, including the number of instructions executed, and the distribution of work across different SMs; examine the memory access patterns of the application and provide insights into the GPU scheduler’s behavior, including task scheduling, warp scheduling, and occupancy management. It highlights the locations of performance bottlenecks within the source code and offers practical solutions. In addition, the baseline feature of the tool allows to compare different versions of the same kernel within the tool. ICHEC performance engineering projects benefit a lot fromNsight Compute
for kernel optimisation .
AI Workflows#
The tools listed above supporting python
such as
Score-P
orExtrae
for profiling/tracingVampir
orParaver
for visualisationLIKWID
orPAPI
for measuring hardware/software countersLinaro Forge
tools
can be used for python
based AI Workflows. In particular, python
profiler cProfile
could be a starting point to gain insights into performance bottlenecks. Please see the handbook page for python profiling. For AI applications using machine learning frameworks, similarly, it’s best to start by using the tool’s built-in profiler. Pytorch is one of these frameworks used in the ICHEC SEODA project. See the handbook page for pytorch profiling. It provides instructions on how to collect performance data on both CPU and GPU. Data can then be viewed by tensorboard
or other third party trace viewer tools. When targeting GPU-specific performance analysis, NVIDIA Performance Analysis Tools
offer a comprehensive suite of features for optimising GPU-accelerated HPC AI Workflows.