Performance#
Tools#
This section attempts to provide a list of performance engineering tools.
It includes the tools provided by VI-HPS or taught in the EuroCC trainings, tools used in the projects in ICHEC and Pyhton language specific performance tools. Of course there are many other options available for the HPC applications.
Tools can be catagorised for their purpose:
Performance analysis tools provide information about the runtime behavior of an application, performance bottlenecks and optimisation potentials.
Debugging tools investigates an application for possible errors to help to fix the problems.
Correctness checking tools detects errors in the usage of programming models such as data race condition in OpenMP.
Tools also can be categorised in terms of parallel programming model or platforms they support as well as compiler support and the way they are built and run on HPC clusters. The VI-HPS tools guide lists the properties of the performance tools developed by their partner institutions and specifies the focus categories for each.
VI-HPS Performance Tools#
VI-HPS is a collaborative research initiative focused on developing advanced programming tools and techniques for HPC. The main tools introduced in 46th VI-HPS Tuning Workshop:
Score-P/Scalasca/CUBE: This is an integrated performance analysis toolkit developed by JSC (Jülich Supercomputing Centre).Score-Pcollects performance data in profiles and execution traces,Scalascaanalyses and identifies performance issues, andCUBEprovides visualisation of certain metrics and their distribution on functions and resources.ScalascaandScore-Pwrite analysis reports inCUBEformat and these reports can also be viewed by some other tools such asVampirandTAU.Extrae/Paraver/Dimemas: These are performance analysis tools developed by BSC (Barcelona Supercomputing Centre).Extraecaptures detailed execution traces. The information collected by theExtraeare stored in three different files.Paravercaptures data in these files and provides a powerful visualisation and analysis capabilities to help identify performance bottlenecks and optimise parallel code.Dimemasia a simulator used to predict the application’s behavior on different systems.MUST/Archer/OTF-CPT: The correctness analysis toolsMUSTandArcherand the performance analysis toolOTF-CPTwere developed by the IT Center of the RWTH Aachen University.Archeris a dynamic data race detector for OpenMP programs.MUSTis a runtime error detection tool for MPI applications.OTF-CPTis a lightweight performance analysis tool that reports a summary of POP metrics for MPI+OpenMP applications.MAQAO:MAQAOtoolkit was developed by the performance tools team at UVSQ (LI-PaRAD Laboratory). It executes performance analysis at binary level, with a focus on core performance. It comes with six main modules:
LProf (Lightweight Profiler)as a sampling-based profiler that provides a list of hot spots (loops and/or functions) collected during program execution;CQA (Code Quality Analyzer)as a static analyser assessing the quality of the executable and producing a set of reports describing potential issues, estimations of the gain if fixed, and hints on how to achieve this;ONE Viewas a module that invokes other modules and aggregates their results to produce reports in HTML or XLS format;VProfas a value profiler relying on instrumentation through binary rewriting;DECAN (DECremental ANalyzer)as a module using differential analysis on innermost loops to locate performance issues;ASSISTas prototype code restructuring tool implementing advanced Profile Guided Optimisation techniques.
CARM: It is a performance analysis tool that focuses on roofline modelling, developed by INESC-ID, Instituto Superior Técnico, Universidade de Lisboa. Roofline modelling provides visualisation of the performance limits of an application based on its memory bandwidth and peak floating-point performance. It also provides recommendations for optimising the application’s performance. This paper explains the roofline modelling.
These tools are freely available for download and use in applications. Please see the VI-HPS tools guide for supported languages and platforms.
Community Tools#
Note: Many of the profiling tools and approaches in this section support ‘Unix’ systems, however recent MacOS releases have made it challenging to use any profiling tools other than their Instruments application, which has several limitations. Therefore there is a focus in this section mostly on profiling on Linux systems and profiling remote codes, which is likely to be necessary when working on Mac.
gprof, the ‘GNU profiler’ is a widely used profiler for C and C++ applications on Unix systems. The flag
-pgwill allow use of gprof with the compiler, with debugging symbols also being useful for line-by-line profiling (-g):gcc hello.c -o hello -g -pg
This will create a file
gmon.outwhich can be post-processed with thegproftool:gprof hello gmon.out > profile_output
You can then take a look at the output:
nano profile_outputLIKWID:
LIKWIDis a lightweight command line performance tool suite for the GNU/Linux operating system.likwid-topologytool provides a detailed view of the thread and cache topology of a single node. This is very handy because it is important to understand the hardware architecture before understanding the performance of parallel applications.likwid-perfctris used in measuring hardware performance counters counters for the application.likwid-pinis used to control the affinity of tasks to specific CPU cores or NUMA nodes. There is a section about likwid in ICHEC handbook which was created in 2021. It explains thelikwidtools in detail. For more tools and more up to date information please check the official website.TAU:
TAUis a profiling and tracing tool. It gathers performance data through automatic instrumentor tool based on theProgram Database Toolkit (PDT), dynamically usingDyninstAPI, at runtime in theJava Virtual Machine, or manually using the instrumentation API. Collected data can be visualised inparaprofprofiler. Event traces can be displayed withJumpShottrace visualisation tools.PAPI:
PAPIis a library that provides an interface for hardware performance counters, found across the system such as CPUs, GPUs, memory, interconnectors, I/O system, energy and power. Hardware counters, such as CPU cycles, cache misses, and branch mispredictions provide valuable information about code sections that can be improved. While it can be used as a standalone tool, many performance analysis tools, such asScalasca, Score-P, TAU, leverage it as a backend to collect performance data.DARSHAN:
DARSHANis a lightweight I/O profiler for frameworks such asMPI-IO, HDF5, PNetCDF, and standardPOSIXcalls. It collects performance data through I/O instrumentation at link-time (for static and dynamic executables) or at runtime and produces a log of information that can be viewed byDARSHANcommands.Valgrind:
Valgrindis an instrumentation framework for building debugging and profiling tools. In the area of profiling it can be used with its callgrind tool as follows:valgrind --tool=callgrind program args
You can visualize results in a nice way with the kcachegrind tool. On Mac you can install a Qt version with
brew install qcachegrind. Some of well-know tools:Memcheckdetects memory management problems primarily for C and C++ programs such as memory leak detection, invalid memory access, uninitialized reads, mismatched allocation/deallocation.Cachegrindis a cache profiler for analysing cache performance and identifying cache misses.Callgrindis a profiler that provides detailed information about function call times and call graphs. A useful blog for profiling C++ code with callgrind.Helgrindis a thread debugger which finds data races in multithreaded programs.
pprof: It is a tool for visualisation and analysis of profiling data. It reads a collection of profiling samples inprofile.protoformat and generates reports to visualise and help analyse the data. It can generate both text and graphical reports.gperftools: It is a collection of a high-performance multi-threadedmalloc()implementation, plus some pretty nifty performance analysis tools. Note: Check Mac issue in [this link](gperftools/gperftools#1292 https://gperftools.github.io/gperftools/cpuprofile.html).O-profile: It includes a statistical profiler for Linux systems, capable of profiling all running code at low overhead.dtrace: It can be used to get a global overview of a running system, such as the amount of memory, CPU time, filesystem and network resources used by the active processes. It can also provide much more fine-grained information, such as a log of the arguments with which a specific function is being called, or a list of the processes accessing a specific file. It requires root privilages on Mac.xctraceis Mac only - likely to replace dtrace on there.
Commercial#
Linaro Forge DDT/MAP/PR:
Linaro Forgeprovides easy to use and efficient tools for parallel debugging, profiling and performance reports.Linaro Forge DDTis an advanced debugger designed to simplify the troubleshooting and code optimisation.Linaro Forge MAPis an easy to use profiling tool that can show metrics such as processor, memory, communication and I/O.Linaro Forge PRprovides analysis on specific performance aspects of the application and produces a simple one page HTML report highlighting issues and hints on how to improve them.VAMPIR:
VAMPIRis an interactive graphical tool for trace visualisation and analysis. It executes trace files inOTF2format which can be produced byScore-P. The tool provides a comprehensive suite of features with special focus on highly parallel applications from message passing to multithreaded and accelerator based paradigms.TotalView:
TotalViewis powerful tool used for debugging and analysing both serial and parallel applications. It provides both graphical and command line interface, and script based environments for debugging.
Vendor Tools#
Intel® Toolkits: Intel oneAPI Base Toolkit is a unified programming environment for x86-based processors that provides a single, cross-architecture programming model for Intel CPUs, GPUs, FPGAs, and other accelerators. It is not primarily a performance analysis tool but it includes several components that can be used for performance analysis. Some of them:
Intel® VTune™ Profiler: It supports profiling for CPU, GPU, FPGA, threading, memory, cache, storage, power. Profiling data can be viewed in architecture diagrams, as a histogram, on a timeline. Types of analysis consists of hotspot analysis, microarchitecture, parallelism, platform analysis, power usage and low level counters. Please see this page for more details. This tool was used in several ICHEC National and Academic Flagship project during the kay era.Intel® Advisor: It helps to identify issues related with threading and vectorisation. It provides Memory Access Patterns (MAP) report, data dependency analysis, survey hotspots analysis and illustrates roofline performance model. Similar to many other Intel performance tools, it offers both a command-line interface and a graphical user interface. Please see this page for more details.Intel® System Debugger:InteldebuggerIDB,IntelDistribution forGDB, is a simple tool for quick debugging of parallel applications.Intel® System Debuggerenables deep analysis of system hardware in chip level. Please see this page for more details.
NVIDIA Performance Analysis Tools: Two main performance analysis tools as part of
NVIDIADeveloper Tools:NVIDIA Nsight SystemsandNVIDIA Nsight Compute.NVIDIA Nsight Systemsprovides a system-wide visualisation of an application’s performance. Data is collected from the comman-line interface and then can be copied to any system and analysed later withNsight Systems'visualisation tools suitable for the host architecture. It visualises system workload metrics on a timeline and offers tools to identify, understand, and resolve performance bottlenecks. See this page for a tutorial series on how to use the profiler and optimise communication, memory allocation etc. for different platform.NVIDIA Nsight Computeis an interactive kernel profiler for CUDA applications. It provides targeted metric sections for various performance aspects and presents those in tables and charts.Those metrics detail the workload on the GPU, including the number of instructions executed, and the distribution of work across different SMs; examine the memory access patterns of the application and provide insights into the GPU scheduler’s behavior, including task scheduling, warp scheduling, and occupancy management. It highlights the locations of performance bottlenecks within the source code and offers practical solutions. In addition, the baseline feature of the tool allows to compare different versions of the same kernel within the tool. ICHEC performance engineering projects benefit a lot fromNsight Computefor kernel optimisation .
Python Profiling Tools#
Benchmarking#
pyperf: To perform accurate and repeatable benchmarking,pythonprovides thepyperfmodule. Basic timing methods can be unreliable due to factors like system noise, background processes, or CPU frequency scaling.pyperfmitigates these issues by providing more precise timing and reducing external interference. To prepare your system for benchmarking, run:python -m pyperf system tune
To restore the default CPU settings afterward, use:
python -m pyperf system reset. You can usepyperfdirectly from the command line, or import it into a script. In scripts, you can create a Runner object to configure and execute your benchmarks programmatically.time: There are several ways to measure execution time inpythondepending on your needs. One is Linuxtimecommand to measure how long apythonscript takes to run. Simply run:time python <script.py>
This is a super quick and simple way to get timing information. It gives real: Total wall-clock time (start to finish), user: CPU time spent in user-mode, sys: CPU time spent in kernel-mode.
For basic timing in code, you can use
time.time():import time start = time.time() # your code end = time.time() print("Elapsed:", end - start)
This returns the wall-clock time in seconds since the epoch. However, it’s not recommended for precise comparisons since it can be affected by system clock changes. Other useful functions in the
timemodule:time.process_time(): CPU time, excludes sleep.time.perf_counter(): High-resolution timer, includes sleep.time.monotonic(): Always moves forward, unaffected by clock adjustments.
Use
time.perf_counter()for microbenchmarking. It returns the value of a performance counter with the highest available resolution to measure a short duration. It does include time elapsed during sleep and is system wide. So, It is more suitable for the measurement of smaller code blocks.timeitis another package that comes withpythonstandard library. Similar totime, it can be used from the terminal as well as in the script withtime.timeit(). It automatically usestime.perf_counter(). It runs the code many times and then takes the average time. It temporarily disables garbage collection to avoid measurement noise.
Profiling#
Python comes with two built-in tools to profile a code: profile and cProfile.
Both profilers are different implementations of the same profiling interface. profile is the original python module but has significant overhead compared to cProfile. cProfile on the other hand is a C extension with less overhead and therefore it is chosen for programs with a longer runtime. For memory usage profiling python has built-in tracemalloc module.
cProfile: It is easy to usecProfileas it is not necessary to modify the program code in advance. You can run it directly from the command line. It writes the results to standard output. Optional -o flag redirects the output to a file.python -m cProfile [-o <output.prof>] <script.py>
It outputs total number of function calls, primitive calls and total time at the top. The column headings are ncalls: number of times the function is called; tottime: total time spent in the given function (it doesn’t include sub-function calls); percall: tottime/ncalls; cumtime: cumulative time spent in this and all subfunctions; percall: cumtime/ncalls; filename:lineno(function): where this function is defined. The output is ordered by cumulative time.
Once the profile is saved to a file, you can view and analyse it using the
pstatsmodule. When used from command line, it launches an interactive shell where you can explore the profiling data.python -m pstats <output.prof> output.prof% sort cumtime output.prof% sort time output.prof% stats 10 output.prof% quit
Or use it in the script.
import pstats stats = pstats.Stats("output.prof") stats.strip_dirs() stats.sort_stats("cumtime").print_stats(10) stats.sort_stats("time").print_stats(10) stats.print_stats(10)
cProfilecan also be added to the code. See this handbook page for an example.tracemalloc: It helps trace memory allocations, identify where memory is being used, and detect potential leaks by comparing snapshots. To trace most allocations, we should starttracemallocby setting thePYTHONTRACEMALLOCenvironment variable to1, or by using-X tracemalloccommand line option. Its accuracy can be improved by enabling deeper stack traces using the PYTHONTRACEMALLOC. Following example can be use to display top10memory consuming lines.import tracemalloc tracemalloc.start() # run your code snapshot = tracemalloc.take_snapshot() top_stats = snapshot.statistics('lineno') for stat in top_stats[:10]: print(stat)
Other profiling tools that you can install with pip.:
line_profiler: It is a module for doing line-by-line profiling of functions. It returns a line-by-line breakdown of where time is spent. To useline_profilerfrom the command line, the functions to be profiled need to be explicitly marked in the script with the@line_profiler.profiledecorator. Set the environment variableLINE_PROFILE=1to enable. Profiling is performed withkernprof.kernprof -lvr <script.py>
The
-lflag ensures that the function is profiled step by step, the-vflag shows the result on standard output and the-rflag is used to have rich format on output.py-spy: It is a fast, sampling based profiler written inRust. It runs without any code modification.yep: It is a utility that profilesC/C++functions made insidepythonCextensions. It uses thegperftoolsprofiler underneath and depends onpproffor visualisation.memory_profiler: It is for tracking memory usage line by line. Simple to use with the@profiledecorator. It is no longer maintained since November 2022.guppy3: It inspects memory usage inCPythonbased applications. It contains subpackages:etc,gsl,heapyandsets.Fil: It is an open source memory profiler designed for data processing applications.austin: is frame level sampling profiler forCPythonwritten in pureC. It can also track memory usage and has special support for multi process applications.
Visualising performance reports#
There are several tools that can be used to view the resulting output.
SnakeViz: It is a browser-based visualiser forcProfileoutput. It displays an interactive call graph and sunburst view.snakeviz <output.prof>
It shows hierarchical function calls in a horizontal or circular layout. It’s fully interactive. You can click any segment to zoom in, focus on specific paths, and explore which parts of your code dominate execution time.
viztracer: It is a tracing tool. It captures function call timings, arguments and return values. It generates interactive HTML reports and timelines. The following will create a file namedresult.json, which opens as an interactive report in the browser.viztracer <script.py>
vizviewer <result.json>
gprof2dot: It can be used to generate a visual call graph image. The first command converts profiler output into aGraphviz DOTformat. Then you can use thedotcommand to generate an image. (Note thatGraphvizmust be installed separately.)gprof2dot -f pstats <output.prof> -o <callgraph.dot>
dot -Tpng <callgraph.dot> > <output.png>
pyCallGraph: It creates call graph visualisations. It groups application modules based on execution time, function calls, memory usage. It is not maintained anymore.Tuna: It is alternative to
SnakeVizfor visualisingcProfileoutput. It handles runtime and import profiles, has minimal dependencies, and usesd3andbootstrap.tuna <output.prof>
Speedscope: It is a fast, interactive viewer forflamegraphs.Flamegraphsshow function call hierarchies and hot paths in a visual tree format.
AI Workflows#
The tools listed above supporting python such as
Score-PorExtraefor profiling/tracingVampirorParaverfor visualisationLIKWIDorPAPIfor measuring hardware/software countersLinaro Forgetools
or python profiling tools can be used for python based AI Workflows. In particular, python profiler cProfile could be a starting point to gain insights into performance bottlenecks. For AI applications using machine learning frameworks, similarly, it’s best to start by using the tool’s built-in profiler. Pytorch is one of these frameworks used in the ICHEC SEODA project. See the handbook page for pytorch profiling. It provides instructions on how to collect performance data on both CPU and GPU. Data can then be viewed by tensorboard or other third party trace viewer tools. When targeting GPU-specific performance analysis, NVIDIA Performance Analysis Tools offer a comprehensive suite of features for optimising GPU-accelerated HPC AI Workflows.