Performance

Performance#

This chapter covers performance engineering techniques and tooling for HPC applications.

Introduction#

Performance engineering in HPC involves identifying and resolving performance bottlenecks in HPC applications. Performance is quantified by the speed (time to solution) and efficiency (resorce utilisation) of generating a desired solution.

Performance analysis of traditional HPC applications primarily focuses on:

speedup
scalability
memory bandwidth
I/O

AI and machine learning based HPC applications introduce additional concepts such as:

workflow management
data processing and manipulation
use of multiple GPUs or other specialised hardware
inference latency
model accuracy

ICHEC’s Performance Engineering Activity deals with HPC applications/workflows that come from internal ICHEC projects, for example related with Earth Observation or Quantum Computing, projects that are part of the Irish National HPC Service through L1/L2 support, or projects that are supported through national funding or European frameworks such as EuroHPC.

Performance Engineering Standard Workflow#

Using a standarized worklow is useful in performance engineering as an analyst may encounter applications and codebases from domains that they are not familiar with. It also allows re-use of tooling, templates and report strutures.

Taking an interative approach, some standard steps are:

Preparation:

Preparing application for performance measurement. There are two common ways to do this:

Sampling: A running application is periodically interrupted to take measurements at run time to collect data.
Instrumentation: Code is inserted to the application codebase to collect data from specific sections such that every event of interest is captured after the execution. This can be done by directly by manual instrumentation of the source code, automatic instrumentation by the compiler or linking against pre-instrumented libraries.

Measurement:

Collecting/recording raw data required to evaluate the performance metrics and visualising them. This is commonly done by:

Profiling: This provides metrics of an application run such as time spent in a routine, number of calls to a function, CPU/GPU usage, transferred bytes or hardware counters (e.g. number of instructions executed, number of cycles).
Tracing: It provides a complete timeline with recorded events of an execution path for each computational resource: CPU thread/process, GPU. The traces contain information about the state of the code at a particular time.

There are many tools and techniques available for profiling and tracing with different strengths and specialities. They are reviewed later in this Chapter and futher in the Advance Performance Chapter. Often a combination of ‘toolbox’ of tools is useful for investigating an application.

Analysis:

Processing performance data measured in the previous step by using metrics to identify bottlenecks.

The Performance Optimisation and Productivity Centre of Excellence in HPC (POP-COE) offers a set of metrics to measure the performance of applications. These metrics are useful for measuring the quality of parallelisation of codes using MPI, OpenMP, or hybrid (MPI+OpenMP).

For HPC AI workflows, there are more specific metrics depending on the application requirements and optimisation goals. For example, in image recognition, accuracy and inference time might be prioritised, while in natural language processing, training time and model size could be critical.

Metrics are compared to expected behavior using models such as the ‘Roofline Model’. A performance expectation of such an analytic model is the reliable way to identify blottlenecks and define optimisation strategies. The three most common bottlenecks are:

computational bottlenecks
memory bottlenecks
bandwidth bottlenecks

Identifying and addressing the bottlenecks is the first step in troubleshooting performance issues.

Optimisation:

Applying modifications intended to eliminate performance issues analysed in the previous step.

While there are generic optimisation techniques such as loop optimisation, data locality, vectorisation, data/task parallelism, etc.,the implementations of these are specific to the application and where the bottleneck comes from. Code optimization requires significant effort and is a trade-off between several factors. It can lead to wasted effort and less maintainable code. It’s generally better to prioritise readability and correctness first, then optimise only critical performance bottlenecks.

Performance Metrics#

POP-COE (Performance Optimisation and Productivity Centre of Excellence in HPC) is an initiative focused on improving the performance and productivity of parallel applications in HPC. It offers a methedology for parallel applications performance analysis which consists of a set metrics, each investigates one source of issue in the applicaton.

It is a hierarchical model where child metrics are multiplied to get the parent metric. These metrics help to identify the inefficiencies in the parallel structure of the code and impact of scaling:

Global Efficiency (GE)
- Parallel Efficiency (PE)
  - Load Balance Efficiency (LB)
  - Communication Efficiency (CommE)
    - Serialisation Efficiency (SerE)
    - Transfer Efficiency (TE)
- Computational Scaling (CompS)
  - IPC Scaling
  - Instruction Scaling
  - Frequency Scaling

The POP metrics can be calculated using data collected with tracing tools such as Extrae and Score-P, detailed in the Advanced Performance Engineering Chapter.

For parallel efficiency, perform trace measurements and critical path analysis from trace data illustrating computation/communication/delay patterns. Zoom in trace data to select regions for focus of analysis. Tools such as the Scalasca trace analyser with Score-P measurements provide even more insights into the performance behavior of the application. For computational scalability, perform profile measurements with a suitable hardware counters such as those are available in PAPI. Then, POP metrics can be viewed with visualisation tool such as Paraver for Extrae and Cube for Score-P.

Metrics are usually represented in a table with colorod cells, green for high values, red for low values. Values for a metric are typically between 0 and 1 (0% and 100%) and describe how efficient the compute resources are used with respect to a certain metric.

Standard POP metrics apply to large scale MPI or OpenMP applications while the POP methodology is applicable to various parallelism paradigms. There are two extensions to support hybrid parallel programs (MPI+OpenMP or MPI+CUDA) with the POP methodology: additive and multiplicative metrics.

Example#

Figure Fig. 27 was generated for the ‘IGMPlot’ code as part of a collaborative European project in 2020. It shows the values of POP metrics in each row with the increasing number of OpenMP threads. The analysis was done by POP. It can be observed from the metrics in the table that main factor that limits global efficiency is coming from OpenMP parallelisation. After the critical path analysis on the timeline with Paraver, load balance starts off okay but gets worse quickly, resulting in rapidly decreasing parallel efficiency. This is due to waiting for a small amount of threads to complete to begin the next set of calculations at iteration.

../_images/popmetrics-usecase.png — Fig. 27 POP metrics use case#

Metrics Explanations#

An ideal runtime of a parallel application is the runtime which would be achieved when all the bottlenecks are removed, i.e. zero communication cost, zero load imbalance, zero parallel overhead. After measuring the actual runtime, efficiency metrics measure how far it is from the ideal performance. On the other hand, scaling metrics measure how well the performance per instruction count with respect to different number of processes/threads.

POP Metrics

Global Efficiency (GE): It describes how well the parallelisation of the application works. It is combination of Parallel Efficiency and Computational Efficiency.

GE = PE * CommS

If Global Efficiency is low, one or both of these metrics could have a low value as well and this indicates in which direction to search for the inefficiency.

Parallel Efficiency (PE): It describes how well the execution of the code in parallel is working. It is a combination of Load Balance Efficiency and Communication Efficiency.

PE = LB * CommE

If Parallel Efficiency is low, the performance bottlenecks can be caused by synchronisation or unbalanced work distribution.

Load Balance Efficiency (LB): It describes how well the distribution of work between processes/threads is done. It is the ratio between the average time of a process spend in computation and the maximum time a process spends in computation.

LB = average(computation time) / max(computation time)

Load Balance is low when work is distributed unevenly between threads/processes so that they are forced to wait for others finish the work. In this case, it can investigated further by zooming in tracing data synchronisation points.

Communication Efficiency (CommE): It indicates the loss of efficiency caused by communication. It can be split into Serialisation Efficiency and Transfer Efficiency.

CommE = SerE * TE

If Communication Efficiency is low, the performance bottlenecks can be caused by extra time spend in the communication routines/data transfers or chain of dependencies.

Serialisation Efficiency (SerE): It describes loss of efficiency due to dependencies between processes which serialises program execution. It is computed as

SerE = max (useful time) / ideal runtime

A low value indicates the existence of circular dependences.

Transfer Efficiency (TE): It describes loss of efficiency due to actual data transfer time. It is computed as

TE = ideal runtime / actual runtime

If it is low, the execution is suffering from a high overhead of the runtime or a poor latency or bandwidth of the network.

Computational Scaling (CompS): It describes how the time spent in computing scales with the number of processes/threads. If Computational Scaling is low, it can be the effect of one or combination of the metrics: Instruction Scaling, Instructions Per Cycle (IPC) Scaling and Frequency Scaling. Number of instructions can be obtained from performance counters. PAPI is one of the main tools used for reading hardware counters. Tracing tools needs to be installed with it to produce POP metrics for Computational Scaling.

Instruction Scaling: It compares the total number of instructions executed for different numbers of threads/processes. A decrease in Instruction Scaling corresponds to an increase in the total number of instructions required to solve a computational problem. This can be caused by extra computation. For example, the work decomposition must be computed or computations on the surface/boundary are replicated.

IPC Scaling: It compares how many instructions per cycle (IPC) are executed for different numbers of threads/processes. Higher values indicate that rate of computation has increased. A decrease in IPC can be caused by processes/threads waiting for the cache, memory or contention on network/bandwidth.

Frequency Scaling: It compares the processor frequency for different numbers of threads/processes. It can be related to runtime variations during execution.

Performance Tools#

Please visit this handbook page for a list of performance engineering tools.