Performance#

Tools#

This section attempts to provide a list of performance engineering tools. It includes the tools provided by VI-HPS or taught in the EuroCC trainings and tools used in the projects in ICHEC. Of course there are many other options available for the HPC applications.

Tools can be catagorised for their purpose:

  • Performance analysis tools provide information about the runtime behavior of an application, performance bottlenecks and optimisation potentials.

  • Debugging tools investigates an application for possible errors to help to fix the problems.

  • Correctness checking tools detects errors in the usage of programming models such as data race condition in OpenMP.

Tools also can be categorised in terms of parallel programming model or platforms they support as well as compiler support and the way they are built and run on HPC clusters. The VI-HPS tools guide lists the properties of the performance tools developed by their partner institutions and speficies the focus categories for each. These tools are mostly support traditional HPC programming models. See this topic for information about performance engineering tools for HPC AI Workflows.

VI-HPS Performance Tools#

VI-HPS is a collaborative research initiative focused on developing advanced programming tools and techniques for HPC. The main tools introduced in 46th VI-HPS Tuning Workshop:

  1. Score-P/Scalasca/CUBE: This is an integrated performance analysis toolkit developed by JSC (Jülich Supercomputing Centre). Score-P collects performance data in profiles and execution traces, Scalasca analyses and identifies performance issues, and CUBE provides visualisation of certain metrics and their distribution on functions and resources. Scalasca and Score-P write analysis reports in CUBE format and these reports can also be viewed by some other tools such as Vampir and TAU.

  2. Extrae/Paraver/Dimemas: These are performance analysis tools developed by BSC (Barcelona Supercomputing Centre). Extrae captures detailed execution traces. The information collected by the Extrae are stored in three different files. Paraver captures data in these files and provides a powerful visualisation and analysis capabilities to help identify performance bottlenecks and optimise parallel code. Dimemas ia a simulator used to predict the application’s behavior on different systems.

  3. MUST/Archer/OTF-CPT: The correctness analysis tools MUST and Archer and the performance analysis tool OTF-CPT were developed by the IT Center of the RWTH Aachen University. Archer is a dynamic data race detector for OpenMP programs. MUST is a runtime error detection tool for MPI applications. OTF-CPT is a lightweight performance analysis tool that reports a summary of POP metrics for MPI+OpenMP applications.

  4. MAQAO: MAQAO toolkit was developed by the performance tools team at UVSQ (LI-PaRAD Laboratory). It executes performance analysis at binary level, with a focus on core performance. It comes with six main modules:

  • LProf (Lightweight Profiler) as a sampling-based profiler that provides a list of hot spots (loops and/or functions) collected during program execution;

  • CQA (Code Quality Analyzer) as a static analyser assessing the quality of the executable and producing a set of reports describing potential issues, estimations of the gain if fixed, and hints on how to achieve this;

  • ONE View as a module that invokes other modules and aggregates their results to produce reports in HTML or XLS format;

  • VProf as a value profiler relying on instrumentation through binary rewriting;

  • DECAN (DECremental ANalyzer) as a module using differential analysis on innermost loops to locate performance issues;

  • ASSIST as prototype code restructuring tool implementing advanced Profile Guided Optimisation techniques.

  1. CARM: It is a performance analysis tool that focuses on roofline modelling, developed by INESC-ID, Instituto Superior Técnico, Universidade de Lisboa. Roofline modelling provides visualisation of the performance limits of an application based on its memory bandwidth and peak floating-point performance. It also provides recommendations for optimising the application’s performance. This paper explains the roofline modelling.

These tools are freely available for download and use in applications. Please see the VI-HPS tools guide for supported languages and platforms.

Community Tools#

  • LIKWID: LIKWID is a lightweight command line performance tool suite for the GNU/Linux operating system. likwid-topology tool provides a detailed view of the thread and cache topology of a single node. This is very handy because it is important to understand the hardware architecture before understanding the performance of parallel applications. likwid-perfctris used in measuring hardware performance counters counters for the application. likwid-pin is used to control the affinity of tasks to specific CPU cores or NUMA nodes. There is a section about likwid in ICHEC handbook which was created in 2021. It explains the likwid tools in detail. For more tools and more up to date information please check the official website.

  • TAU: TAU is a profiling and tracing tool. It gathers performance data through automatic instrumentor tool based on the Program Database Toolkit (PDT), dynamically using DyninstAPI, at runtime in the Java Virtual Machine, or manually using the instrumentation API. Collected data can be visualised in paraprof profiler. Event traces can be displayed with JumpShot trace visualisation tools.

  • PAPI: PAPI is a library that provides an interface for hardware performance counters, found across the system such as CPUs, GPUs, memory, interconnectors, I/O system, energy and power. Hardware counters, such as CPU cycles, cache misses, and branch mispredictions provide valuable information about code sections that can be improved. While it can be used as a standalone tool, many performance analysis tools, such as Scalasca, Score-P, TAU, leverage it as a backend to collect performance data.

  • DARSHAN: DARSHAN is a lightweight I/O profiler for frameworks such as MPI-IO, HDF5, PNetCDF, and standard POSIX calls. It collects performance data through I/O instrumentation at link-time (for static and dynamic executables) or at runtime and produces a log of information that can be viewed by DARSHAN commands.

  • Valgrind: Valgrind is an instrumentation framework for building debugging and profiling tools. Some of well-know tools:

    • Memcheck detects memory management problems primarily for C and C++ programs such as memory leak detection, invalid memory access, uninitialized reads, mismatched allocation/deallocation.

    • Cachegrind is a cache profiler for analysing cache performance and identifying cache misses.

    • Callgrind is a profiler that provides detailed information about function call times and call graphs.

    • Helgrind is a thread debugger which finds data races in multithreaded programs.

Other Valgrind tools can be found in their official webpage.

Commercial#

  • Linaro Forge DDT/MAP/PR: Linaro Forge provides easy to use and efficient tools for parallel debugging, profiling and performance reports. Linaro Forge DDT is an advanced debugger designed to simplify the troubleshooting and code optimisation. Linaro Forge MAP is an easy to use profiling tool that can show metrics such as processor, memory, communication and I/O. Linaro Forge PR provides analysis on specific performance aspects of the application and produces a simple one page HTML report highlighting issues and hints on how to improve them.

  • VAMPIR: VAMPIR is an interactive graphical tool for trace visualisation and analysis. It executes trace files in OTF2 format which can be produced by Score-P. The tool provides a comprehensive suite of features with special focus on highly parallel applications from message passing to multithreaded and accelerator based paradigms.

  • TotalView: TotalView is powerful tool used for debugging and analysing both serial and parallel applications. It provides both graphical and command line interface, and script based environments for debugging.

Vendor Tools#

  • Intel® Toolkits: Intel oneAPI Base Toolkit is a unified programming environment for x86-based processors that provides a single, cross-architecture programming model for Intel CPUs, GPUs, FPGAs, and other accelerators. It is not primarily a performance analysis tool but it includes several components that can be used for performance analysis. Some of them:

    • Intel® VTune™ Profiler: It supports profiling for CPU, GPU, FPGA, threading, memory, cache, storage, power. Profiling data can be viewed in architecture diagrams, as a histogram, on a timeline. Types of analysis consists of hotspot analysis, microarchitecture, parallelism, platform analysis, power usage and low level counters. Please see this page for more details. This tool was used in several ICHEC National and Academic Flagship project during the kay era.

    • Intel® Advisor: It helps to identify issues related with threading and vectorisation. It provides Memory Access Patterns (MAP) report, data dependency analysis, survey hotspots analysis and illustrates roofline performance model. Similar to many other Intel performance tools, it offers both a command-line interface and a graphical user interface. Please see this page for more details.

    • Intel® System Debugger: Intel debugger IDB, Intel Distribution for GDB, is a simple tool for quick debugging of parallel applications. Intel® System Debugger enables deep analysis of system hardware in chip level. Please see this page for more details.

  • NVIDIA Performance Analysis Tools: Two main performance analysis tools as part of NVIDIA Developer Tools: NVIDIA Nsight Systems and NVIDIA Nsight Compute.

    • NVIDIA Nsight Systems provides a system-wide visualisation of an application’s performance. Data is collected from the comman-line interface and then can be copied to any system and analysed later with Nsight Systems' visualisation tools suitable for the host architecture. It visualises system workload metrics on a timeline and offers tools to identify, understand, and resolve performance bottlenecks. See this page for a tutorial series on how to use the profiler and optimise communication, memory allocation etc. for different platform.

    • NVIDIA Nsight Compute is an interactive kernel profiler for CUDA applications. It provides targeted metric sections for various performance aspects and presents those in tables and charts.Those metrics detail the workload on the GPU, including the number of instructions executed, and the distribution of work across different SMs; examine the memory access patterns of the application and provide insights into the GPU scheduler’s behavior, including task scheduling, warp scheduling, and occupancy management. It highlights the locations of performance bottlenecks within the source code and offers practical solutions. In addition, the baseline feature of the tool allows to compare different versions of the same kernel within the tool. ICHEC performance engineering projects benefit a lot from Nsight Compute for kernel optimisation .

AI Workflows#

The tools listed above supporting python such as

  • Score-P or Extrae for profiling/tracing

  • Vampir or Paraver for visualisation

  • LIKWID or PAPI for measuring hardware/software counters

  • Linaro Forge tools

can be used for python based AI Workflows. In particular, python profiler cProfile could be a starting point to gain insights into performance bottlenecks. Please see the handbook page for python profiling. For AI applications using machine learning frameworks, similarly, it’s best to start by using the tool’s built-in profiler. Pytorch is one of these frameworks used in the ICHEC SEODA project. See the handbook page for pytorch profiling. It provides instructions on how to collect performance data on both CPU and GPU. Data can then be viewed by tensorboard or other third party trace viewer tools. When targeting GPU-specific performance analysis, NVIDIA Performance Analysis Tools offer a comprehensive suite of features for optimising GPU-accelerated HPC AI Workflows.