Performance#

LIKWID#

LIKWID is a lightweight performance-oriented tool suite targeting X86 processors. It builds upon the Linux MSR (Model Specific Register) kernel module. By simply reading the MSR device files, it renders reports for various performance metrics, for example, FLOPS, bandwidth, load to store ratio, and energy. It can use perf as backend but also provides other backends to be independent of the kernel version. Its main focus is in providing a collection of command line tools for the end user.

Introduction#

LIKWID is a tool suite of command line applications for performance oriented programmers. It was developed and maintained by the HPC group at Friedrich-Alexander-Universität (FAU) Erlangen-Nürnberg, Germany since 2009. While the focus of LIKWID is on x86 processors, some of the tools are portable and not limited to any specific architecture. The full support is extended to ARM and POWER CPUs with the same functionality and features as for x86 architectures since v 5.0.0 which was released in Nov 2019. See github page for a full list of supported architectures.

Current version is likwid-5.0.1 and support for Nvidia GPU monitoring (with NvMarkerAPI) is available since v 5.0.0. All functionality is provided as a C library to be integrable in other tools. It also offers instrumentation support for C/C++, Fortran, Lua, Python and Java. It follows the philosophy:

  • Simple

  • Efficient

  • Portable

  • Extensible

It consists of following components:

Gather Node Architecture Information:

  • likwid-topology: Display system topology ranging from thread topology to cache and NUMA topology.

  • likwid-powermeter: Measure energy consumption of an application using Intel RAPL counters.

Enforce Affinity Control and Data Placement:

  • likwid-pin: Pin application threads to specified CPUs.

  • likwid-mpirun: Enable pinning of MPI and MPI/threaded hybrid applications.

Query and Alter system settings:

  • likwid-setFrequencies: Manipulate CPU and Uncore frequencies.

  • likwid-perfctr: Measure hardware counters for an application and show derived metrics.

Microbenchmarking:

  • likwid-bench: Micro-architectural benchmarking by running hand-tuned assembly benchmarks for memory bandwidths, instruction counts and FP operations.

  • likwid-memsweeper: Cleanup L3 and ccNUMA domains.

More…:

  • likwid-agent : Monitoring agent to produce multiple output backends.

  • likwid-genTopoCfg : Config file writer that saves system topology to file for faster startup.

  • likwid-perfscope: Perform live plotting of performance data using gnuplot.

Installation#

LIKWID tries to be as simple to install and use as possible; without fancy GUIs and other library dependencies. You can get the releases of LIKWID at: http://ftp.fau.de/pub/likwid/. Installation steps:

tar -xjf likwid-<VERSION>.tar.bz2
cd likwid-<VERSION>
vim config.mk
make
sudo make install

sudo is required to install the access daemon with proper permissions. On joules, we set PREFIX ?= /ichec/packages/likwid/5.0.1 in config.mk.

To load LIKWID on joules:

module load likwid

LIKWID uses the Linux MSR module, reads the MSR files from user space and reports the hardware performance counters for a number of performance metrics. Thus, the MSR device files must be present. This can be checked with ls /dev/cpu/*/msr and should list one MSR device file per available CPU. Usually on your own system, you can use LIKWID with direct access to the MSR files. Set the file permission of the MSR device files: chmod o+rw /dev/cpu/*/msr.

If you install LIKWID on a shared system as a HPC compute cluster you may consider to use the access daemon likwid-accessD. In order to provide common users access to the hardware performance registers, you can use the access daemon. It is written with security in mind. It restricts accesses to hardware performance related registers, so users cannot read or write system related registers. Using the access daemon, the measurements involve more overhead. Update config.mk for enabling and setting the LIKWID daemon.

Alternative solution to using daemon is to update sudo sh -c 'echo 0 >/proc/sys/kernel/perf_event_paranoid'. 0 index allows measurements of the whole CPU, not only a specific PID and allows reading of uncore counters.

likwid-topology#

likwid-topology prints all information about the thread, cache and memory topology of a compute node for users to understand better the architecture they are running on. It extracts this information from the hwloc library or directly from procfs/sysfs. Some information is read with the cpuid instruction. likwid-topology reports on:

  • Thread topology: How processor IDs map on physical compute resources

  • Cache topology: How processors share the cache hierarchy

  • Cache properties: Detailed information about all cache levels

  • NUMA topology: NUMA domains and memory sizes

  • GPU topology: GPU information

likwid-topology -h
likwid-topology -- Version 5.0.1 (commit: 233ab943543480cd46058b34616c174198ba0459)
A tool to print the thread and cache topology on x86 CPUs.

Options:
-h, --help		         Help message
-v, --version		     Version information
-V, --verbose <level>	 Set verbosity
-c, --caches		     List cache information
-C, --clock		         Measure processor clock
-O			             CSV output
-o, --output <file>	     Store output to file. (Optional: Apply text filter)
-g			             Graphical output

likwid-topology numbers the processors ID in column HWThread as they appear in the Linux OS. Thread stands for SMT (Simultaneous Multi Threading) thread number inside a core. Core is the physical CPU core number. The Socket column lists the socket numbers of the hardware threads. The last column shows the hardware threads available to your application with *. The cache topology section lists some basic information for every cache level. The last part of the output is the NUMA topology. To get more information about the caches:

likwid-topology -c

For graphical output:

likwid-topology -g

likwid-pin#

likwid-pin helps pin a sequential or multithreaded application to dedicated processors so applications don’t migrate over the course of the job execution and lose cache locality. It explicitly supports pthread and the OpenMP implementations of Intel and GNU gcc. It sets OMP_NUM_THREADS with as many threads as specified in your pin expression if OMP_NUM_THREADS is not present in your environment. It supports different numberings for pinning by

  • Using a thread list

  • Specify a expression based thread list

  • Use scatter policy

Physical Numbering#

Processors are numbered according to the numbering in the OS and the IDs printed by likwid-topology. Examples:

likwid-pin -c 1 ./a.out (on CPU with ID 1)
likwid-pin -c 1,4 ./a.out (on CPUs with ID 1 and 4)
likwid-pin -c 1-3 ./a.out (on CPUs ranging from ID 1 to ID 3, hence CPUs 1,2,3)

Logical Numbering#

LIKWID supports the logical numbering inside of an affinity domain:

  • logical numbering in node: processors are logical numbered over whole node (N prefix)

  • logical numbering in socket: processors are logical numbered in every socket (S# prefix, e.g., S0)

  • logical numbering in cache group: processors are logical numbered in last level cache group (C# prefix, e.g., C1)

  • logical numbering in memory domain: processors are logical numbered in NUMA domain (M# prefix, e.g., M2)

  • logical numbering within cpuset: processors are logical numbered inside Linux cpuset (L prefix)

The affinity domain is optional and if not given, LIKWID assumes the domain ‘N’. All logical numberings have physical cores first. Examples:

likwid-pin -c N:0,1,2,3 ./a.out (on the first 4 physical cores of the node)
likwid-pin -c S0:0-1 ./a.out (on the first 2 physical cores of the first socket)
likwid-pin -c C0:0-3 ./a.out (on the first 4 physical cores on the first LLC group)
likwid-pin -c M0:0-3 ./a.out (on the first 4 physical cores on the first NUMA domain)
likwid-pin -c S0:0-3@S1:0-3 ./a.out (on the 4 first physical cores on both sockets)

To print out available thread domains use likwid-pin  -p.

Numbering by Expression#

Expressions based thread list generation with compact processor numbering. Usage:

likwid-pin -c E:<thread domain>:<number of threads>[:<chunk size>:<stride>]

This means use of “number of threads” threads with “chunk size” threads selected in row while skipping “stride” threads in affinity domain “thread domain”.

Examples:

likwid-pin -c E:N:4:1:2  ./a.out (selecting the first four physical cores on a system with 2 SMT threads per core.)
likwid-pin -c E:S0:4  ./a.out (selecting first two physical cores.)

Scatter Expression#

The scatter expression distributes the threads evenly over the desired affinity domains.

likwid-pin -c M:scatter ./a.out

This will generate a thread to processor mapping scattered across all NUMA domains with physical cores first.

likwid-perfctr#

likwid-perfctr reports on hardware performance events, such as FLOPS, bandwidth, TLB misses and power; its marker API provides focused examination of code regions of of interest. likwid-perfctr integrates the pinning functionality of likwid-pin and option -C can be used to specify the preferred affinity.

It supports almost all interesting core and uncore events for the supported CPU types. Almost all events that are defined in the Intel® Software Developer System Programming Manual and the AMD® BIOS and Kernel Developer’s Guides are available. Some may be missing caused by special handling likely with additional registers.

Multiple modes:

  • Wrapper mode: End to end measurement of application run. You can measure without altering your code. it is crucial to pin your application.

  • Stethoscope mode: Measure for specified duration events on set of cores independent of any code running.

  • Timeline mode: Listen to a specific application on the node with a specified sampling frequency (can be ms or s).

  • Marker API: Lightweight API with region markers to measure regions in your code.

In order to relieve the user from having to deal with raw event counts, it supports performance groups, which combine often used event sets and corresponding formulas for computing derived metrics. It is simple to create your own performance groups with custom derived metrics. Print all supported groups on a processor:

./likwid-perfctr -a

A help text explaining a specific event group can be requested with -H together with the -g switch:

likwid-perfctr -H -g MEM

Wrapper Mode#

Examples:

likwid-perfctr  -C S0:1  -g MEM  ./a.out

This will pin the application to the second core on socket zero and measure the performance group MEM on this core

Using multiple event sets or performance groups in a single run:

likwid-perfctr -C S0:0-3 -g L2 -g L3 -T 500ms ./a.out

This call measures the application pinned on the first 4 cores of socket 0. It starts with measuring the L2 performance group and after 500 milliseconds, it switches to the L3 group. Again, after 500ms, the L2 group is programmed and measured again.

likwid-perfctr allows you to specify custom event sets. You can measure as many events in one event set as there are physical counters on an architecture. This is highly architecture dependent!

likwid-perfctr -C 0-3 -g FP_COMP_OPS_EXE_SSE_FP_PACKED_DOUBLE:PMC0,FP_COMP_OPS_EXE_SSE_FP_SCALAR_DOUBLE:PMC1 ./a.out

Marker API#

The marker API allows you to measure named regions of your code and produces report for each region. Overlap or nesting of the regions is allowed. You can also enter a region multiple times, e.g. in a loop. The counters for each region are accumulated. The marker API only reads out the counters. The configuration of the counters is still handled via the wrapper application likwid-perfctr.

It consists of a bunch of function calls and macros that enable the measuring of code regions. These are defined in the header file likwid.h. For example:

  • LIKWID_MARKER_INIT or likwid_markerInit(): To initialize the Marker API.

  • LIKWID_MARKER_START(char* tag) or likwid_markerStartRegion(char* tag): To start a named region identified by tag.

  • LIKWID_MARKER_STOP(char* tag) or likwid_markerStopRegion(char* tag): To stop a named region identified by tag.

  • LIKWID_MARKER_CLOSE or likwid_markerClose(): To finalize the Marker API and write the aggregated results of all regions to a file that is picked up by likwid-perfctr for evaulation.

#include <likwid.h>
. . .
LIKWID_MARKER_INIT;
. . .
LIKWID_MARKER_START(“Compute”);
. . .
LIKWID_MARKER_STOP(“Compute”);
. . .
LIKWID_MARKER_START(“Postprocess”);
. . .
LIKWID_MARKER_STOP(“Postprocess”);
. . .
LIKWID_MARKER_CLOSE;
  • Activate macros with -DLIKWID_PERFMON as compiler flag.

gcc -D_GNU_SOURCE -DLIKWID_PERFMON -L/ichec/packages/likwid/5.0.1/lib test.c -llikwid
  • Run likwid-perfctr with –m switch to enable marking. Example:

likwid-perfctr -C 0-3 -g BRANCH -m ./a.out

For GPUs: Run command and measure on GPU 1 the performance group FLOPS_DP (Only with NVMarkerAPI):

likwid-perfctr -G 1 -W FLOPS_DP -m ./a.out

It is possible to combine CPU and GPU measurements (with MarkerAPI and NVMarkerAPI):

likwid-perfctr -C S0:1 -g CLOCK -G 1 -W FLOPS_DP -m ./a.out

The current implementation of likwid (likwid 5.0.1) supports only compute capability less than 7.0. So, we can’t use GPU profiling on joules at the moment. Future releases may have support for devices with compute capability 7.5 and higher.

Usecase#

This is a simple illustration of likwid-perfctr. We run a simple file read benchmark on joules to analyse the page cache usage.

Reminder: page cache is a cache of pages in RAM. It contains chunks of recently accessed files. The goal of this cache is to minimise disk I/O by storing data in physical memory.

In this benchmark, we open a file with O_RDONLY and O_DIRECT flags, read the file and close it. Then, we analyse the memory usage for each case with likwid-perfctr.

Reminder: O_DIRECT flag provides file I/O is done directly to/from user-space buffers bypassing the OS read and write caches.

In this usecase, file size is 512 KB and file is located on /mnt/nvme. The file is loaded into page cache prior to execution with vmtouch -t. To check:

vmtouch /mnt/nvme/testpc.txt
           Files: 1
     Directories: 0
  Resident Pages: 353/353  1M/1M  100%
         Elapsed: 0.000184 seconds

Reminder: vmtouch is a tool for learning about and controlling the file system cache of unix and unix-like systems. On joules type module load usefultools.

We both tested with wrapper mode and marker API. The marked region includes only the read() function. We compile and execute on core 0 on socket 0 as follows:

gcc -D_GNU_SOURCE testpc.c
likwid-perfctr -C S0:0 -g MEM ./a.out
# Marker API:
gcc -D_GNU_SOURCE -DLIKWID_PERFMON -L/ichec/packages/likwid/5.0.1/lib -I/ichec/packages/likwid/5.0.1/include testpcm.c -llikwid
likwid-perfctr -C S0:0 -g MEM -m ./a.out

The table below shows the values of the metrics for both likwid-perfctr modes for each case for the MEM group.

O_RDONLY

O_DIRECT

O_RDONLY (marked)

O_DIRECT (marked)

Runtime (RDTSC) [s]

0.0018

0.0132

0.0003

0.0116

Runtime unhalted [s]

0.0002

0.0002

8.511917e-06

1.004154e-05

Clock [MHz]

2272.5066

2359.3045

2292.8411

1377.2915

CPI

1.9961

2.0068

3.0860

3.6406

Memory read bandwidth [MBytes/s]

56.5005

15.5389

58.8866

4.6827

Memory read data volume [GBytes]

0.0001

0.0002

1.785600e-05

0.0001

Memory write bandwidth [MBytes/s]

395.5765

23.8944

857.1268

4.0657

Memory write data volume [GBytes]

0.0007

0.0003

0.0003

4.723200e-05

Memory bandwidth [MBytes/s]

452.0770

39.4334

916.0134

8.7484

Memory data volume [GBytes]

0.0008

0.0005

0.0003

0.0001

Our main focus is the comparison of memory bandwidth between two cases in particular for the marker API mode. Whenever the kernel begins a read operation, for example, when a process issues the read() system call, it first checks if the requisite data is in the page cache. (Page cache searching functions such as find_get_page() call radix_tree_lookup(), which performs a search on the given tree for the given object.) If it is, the kernel reads the data directly out of RAM. If the data is not in the cache, the kernel schedules block I/O operations to read the data off the disk. After the data is read off the disk, the kernel populates the page cache with the data so that any subsequent reads can occur out of the cache.

While using O_DIRECT it bypasses the cache and reads the file data from disk however, it reads/writes the file metadata to page cache which we can observe it here as 8.7484 MBytes/s in total.