C++#

Profiling#

Google Benchmark#

google/benchmark

Google benchmark is a microbenchmarking library for the analysis of small-ish portions of code.

The idea of Google Benchmark is:

  1. That you define a function that runs the code that you wish to benchmark

  2. Register that function with a BENCHMARK macro defined by the library

  3. Provide a main function using the BENCHMARK_MAIN macro

  4. Compile and run the code

The library will run all defined benchmarks, ensuring that the code is run many times to reduce the impact of variance on the measurements. It must be noted though that care must be taken to ensure that compiler optimisation don’t have an effect on what you’re trying to measure. Two convenience functions are provided that allow a user to prevent these optimisations:

  1. benchmark::DoNotOptimize - Prevent the compiler from optimising away variables

  2. benchmark::ClobberMemory - Perform pending writes to ensure the operation is observable

A simple example of benchmarking is shown below. However, for a full overview of the libraries features, including:

  • Output formatting

  • Filtering benchmarks to run

  • Passing arguments and ranges

  • etc.

see the GitHub page.

Quick bench offers a web frontend similiar to Compiler Explorer for quick and easy microbenchmarking.

The following example demonstrates how to effectively benchmark std::vector<T>::push_back:

#include <vector>
#include <benchmark/benchmark.h>

// Set a baseline on creating a vector
static void bench_create(benchmark::State &state) {
    while (state.KeepRunning()) {
        std::vector<int> v;
        benchmark::DoNotOptimize(v.data());
        (void)v;
    }
}
BENCHMARK(bench_create);

// Set a baseline for allocating memory within the vector
static void bench_reserve(benchmark::State &state) {
    while (state.KeepRunning()) {
        std::vector<int> v;
        v.reserve(1);
        benchmark::DoNotOptimize(v.data());
    }
}
BENCHMARK(bench_reserve);

// Benchmark push_back
static void bench_push_back(benchmark::State &state) {
    while (state.KeepRunning()) {
        std::vector<int> v;
        v.reserve(1);
        benchmark::DoNotOptimize(v.data());
        v.push_back(1);
        benchmark::ClobberMemory();
    }
}
BENCHMARK(bench_push_back);

// This macro generates the main boiler plate to run the benchmarks
BENCHMARK_MAIN();

Compiling the code as g++ push_back_bench.cpp -O3 -lpthread -lbenchmark and then running will result in an output similar to the following:

2020-06-05 13:49:59
Running ./a.out
Run on (4 X 2300 MHz CPU s)
CPU Caches:
  L1 Data 32K (x2)
  L1 Instruction 32K (x2)
  L2 Unified 262K (x2)
  L3 Unified 4194K (x1)
Load Average: 4.27, 3.74, 3.52
----------------------------------------------------------
Benchmark                Time             CPU   Iterations
----------------------------------------------------------
bench_create          1.52 ns         1.35 ns    517472075
bench_reserve         82.9 ns         75.8 ns      9352662
bench_push_back       85.9 ns         73.8 ns      9483938

PERForate#

https://git.ichec.ie/performance/perforate

PERForate is a tiny performance measurement library. The idea is to include one header, and drop PERFORATE_SCOPE_TRACE("my text") into a function, and have some measurement of its execution time printed after the program exits. This gets around a lot of issues wrt compiler optimizations inlining some functions, omitting symbols, finding the right incantation for the compiler on your system. Each call to PERFORATE_SCOPE_TRACE has about 100 ns overhead, so it can mostly be dropped in and ignored anywhere except for hot-loops.

This nicely works across MacOS and Linux, with Clang and GCC, and locally and remotely. No special configurations, no debug symbols, no compiler flags, no post-processing tools, no magic, no tearing your hair out. It works the same on low and high optimization settings. The primary downside is intrusive tracing means you need to modify your code. But let’s be honest, when you’re tracking down a performance bug, you’re going to be modifying your code.

The nature of scoped tracing means the scope it’s tracing is a basic block, and can’t be merged into other basic blocks during inlining. This can mess up optimizations that come from inlining. Fortunately, you can also set a macro before you include PERForate to completely disable it. This replaces function calls and classes with things that inline to nothing, re-enabling inlining optimizations after you’ve tracked down your performance bug.

The following example demonstrates how to time a simple program:

#include <perforate/scoped_trace.hpp>

#include <vector>

int main(int argc, char **argv) {
    PERFORATE_SCOPED_TRACE("my program");
    std::vector<int> v;

    for (int i = 0; i < 100; ++i) {
        v.push_back(i);
    }

    return 0;
}

Compiling the code as g++ main.cpp -O3 (assuming the library is installed, if not set the include path as necessary with -I) and then running will result in an output similar to the following:

------------
Range Stats:
------------
my program: 9.288e-06 s, 9.288e-06 s/call, 1 call(s)

------------

For a more complex example:

#include <perforate/scoped_trace.hpp>

#include <numeric>
#include <vector>

bool do_something() {
    PERFORATE_SCOPED_TRACE("do_something");

    // Just some busy work that won't get optimized away...
    std::vector<int> v;
    for(int i=0; i<100000; i++) {
        v.push_back(i);
    }

    const int acc = std::accumulate(v.cbegin(), v.cend(), int(0));

    return acc % 2 == 1;
}

int main() {
    PERFORATE_SCOPED_TRACE("my program");

    for(int i=0; i<1000; i++) {
        if(do_something()) {
            return 1;
        }
    }

    return 0;
}

Will generate the output

------------
Range Stats:
------------
my program: 0.120554 s, 0.120554 s/call, 1 call(s)
do_something: 0.120498 s, 0.000120498 s/call, 1000 call(s)

------------

Symbols and Linkers#

Conceptual Model of a Compiled Code#

Say we have the following C program:

int f(int x){
    int y = x;
    int z = y + y;
    return z;
}

int main(int argc, char* argv[]) {
    int a = 3;
    int b = f(a);
    return b;
}

we note it has roughly 2 “references”, f and main. When compiled, it is just a binary blob of 1s and 0s. However, somehow it knows there is a function called main which calls a function f. The binary will look something like the following:

...
< bits representing f(x) = x + x >
...
< bits representing main() = f(3) >
...

It becomes clear that the section for main must know where the section for f is stored so it can call to it.

In particular, it will need to know

  • how to pass arguments to f

  • how to call f

  • how to get a return value from f

  • how to get f to return execution to main after it’s done

This is all done by using registers and the stack to setup an environment for f to execute and jump back to main, then jumping to f.

In some more detail, the x86 (i.e. 32-bit) assembly for these sections will look like (courtesy of godbolt.org)

f:
        ; setup local stack
        push    ebp      ; store current stack postion

        mov     ebp, esp ; setup current frame

        sub     esp, 16  ; reserve 16 bytes for local variables

        ; int y = x;
        mov     eax, DWORD PTR [ebp+8] ; x => ebp[+8]
                                       ; Set general purpose register eax to x

        mov     DWORD PTR [ebp-4], eax ; y -> ebp[-4], set y to value of eax
        
        ; int z = y + y;
        mov     eax, DWORD PTR [ebp-4] ; Set `eax` to value of y

        add     eax, eax               ; add eax to eax, i.e., y + y

        mov     DWORD PTR [ebp-8], eax ; z => ebp[-8]
                                       ; Set z to result of y + y

        ; return z;
        mov     eax, DWORD PTR [ebp-8] ; Set return value, eax, to value of z
        
        ; cleanup and return
        leave ; cleanup stack positions and frames set at start of function

        ret   ; jump to return address stored in stack

main:
        ; setup local stack
        push    ebp      ; store current stack position
        
        mov     ebp, esp ; setup current frame

        sub     esp, 16  ; reserve 16 bytes on the stack for local variables (a and b)

        ; int a = 3;
        mov     DWORD PTR [ebp-4], 3 ; a => ebp[-4], set a = 3

        ; int b = f(a);
        push    DWORD PTR [ebp-4] ; push a onto the stack as first parameter of `f`

        call    f ; jump to `f`
                  ; implicitly pushes the current instruction address to the stack

        add     esp, 4 ; pop first parameter of `f` from the stack

        mov     DWORD PTR [ebp-8], eax ; b => ebp[-8]
                                       ; Set b to return value of `f`, stored in `eax`
        
        ; return b;
        mov     eax, DWORD PTR [ebp-8] ; Set return value, `eax`, of `main` to value of `b`
        
        ; cleanup and return
        leave ; cleanup stack position and frame setup at start of function
        
        ret ; Jump to return address stored in the stack

In x86, function parameters are setup by pushing them to the stack, and the return value is stored in the eax register. Functions locally manage their stack by saving the previous state on entry, doing their thing, and then restoring the state on exit.

For our mental model of a program, it’s sufficient to assume arguments get pushed to the stack. This isn’t always the case. In x86_64, for example, arguments are passed to functions through registers instead of the stack. Similarly, different optimization levels will produce different assembly, and different languages can have different calling conventions.

It’s worth noting, in particular, there is no mention of data types for variables in the assembly. Type-specific operations (e.g. integer add or floating point add) are performed on registers assuming they’ve been setup with appropriate bit patterns.

From this, we can note several things:

  1. We need to know the signature of a function at the call site to know how to setup the stack and registers to call a function.

  2. The function knows nothing about the call site and assumes all arguments were setup correctly.

  3. We need to know where the requested function is located in the binary so we can jump to it to execute it.

  4. We need to know where the current function is located in the binary so a return address can be set to jump back after a function call.

Until we have visibility of the entire program, we cannot know where the functions will end up in the final binary. If every source file is specified on the command line, this isn’t a problem, as the compiler has visibility of the entire program. However, if the source files are each compiled separately, an intermediate representation must be used where the addresses of functions remain variable values to be filled in. This is stored in the form of key-value pairs, where the key is called a “symbol”, and the value is the address.

Translation Units and the Linker#

A source file compiled separately from other source files is called a “Translation Unit” (TU). Each TU typically defines the body of a number of functions, providing the “definition” for function “declarations” provided in the header files. When a TU defines the body of a function, it generates a symbol for that function, which is typically globally visible. When linking a program into an executable, the linker gathers all the symbols from all the compiled TUs passed to it, and resolves all references to those symbols throughout the program. Resolving a reference to a symbol typically means replacing it with an address if the symbol is defined at link time, or wiring up calls to external symbols in the case of library calls.

If two TUs define the same function, and the function isn’t a template function, declared inline or static, you’ll typically get a linker error complaining about multiple definitions for a symbol. In this case, there are two equally valid candidate symbols / addresses for the linker to choose from, so it chooses not to make a decision. To get the linker to make a decision, the functions must be marked as inline. If two TUs define the same template or inline function, both definitions must be exactly the same, or you’ve got a One Definition Rule (ODR) violation. ODR is C++ parlance, but you can find similar issues in C.

In particular, when multiple TUs provide definitions for template or inline functions, they produce “weak” symbols for the definition. When the linker encounters multiple instances of the same symbol, and they’re all weak symbols, it will choose one (in an implementation defined manner) and discard all the rest. This is one of the reasons for the ODR rule: you don’t know which version of the function will be chosen, so they should all be identical.

If you’ve ever wondered why template heavy C++ code takes a long time to compile, it’s often because the compiler is re-compiling the same template functions over and over again! Some of this time can be mitigated by using extern template, which skips instantiation and compilation of the template in the TU until an explicit instantiation appears, typically in just one TU. This is often seen in large industry codes, and typical in MSVC projects. This also touches on the topic of “precompiled headers”, which we won’t elaborate on here.

Inlining#

An exception to notes 1 to 4 is inlining and cloning. If the compiler can see the body of the function from the call site, it can make a determination of whether it is more efficient to copy the entire body of the function call to the callsite than to actually call it. It can also make a determination if a small modification to the function interface and body might make the function more efficient, in which case it may make a “clone” of the function with these modifications made.

These are the basis of inlining and Return Value Optimization (RVO) and Named Return Value Optimization (NRVO). This typically occurs with templated code and static functions, where the function body is defined in the TUs it’s used in. When C++ people talk about “zero-overhead abstractions”, they’re typically talking about construct that make nice interfaces, and can be completely inlined by the compiler.

Inlining, RVO, and NRVO#

With “inlining”, the entire body of the function is copied into the call site, with the function parameters being replaced directly with the parameters it would have been called with. This allows the compiler to work with a larger “basic block”, allowing it more freedom to re-structure and re-arrange generated code into a more optimal configuration. The downside of inlining is increased function size and code size. The increase in code size can be an issue as the instructions for the entire function may begin spilling over into extra cache lines.

With “RVO” and “NRVO”, the compiler adds an extra, hidden, parameter to the function prototype: a pointer to the return value. This allows the compiler to replace any references to a returned value/variable with a direct reference to the object allocated on the calling function’s stack. This can save on double initialization and unnecessary copies when returning large/complex objects like std::array and std::vector. The C++ “as-if” rule allows it to eliminate extra calls to copy constructors, regardless of side-effects, as long as the semantics of “copying” are preserved.

Symbols and Mangling#

Symbols are used as a mapping between a function or variable name and the address in the binary where it starts. The symbol generated typically reflects the name of the function or variable. For C, the symbol name is typically 1-to-1. For other languages, some “mangling” often occurs.

C#

A function f generates a symbol f. Note: This symbol carries no information about the return type or argument types! This is one explanation for why headers are needed to use C-libraries: without the function prototype in C, the compiler wouldn’t know how to call the function at the call site! Even if it knew the registers to use and the amount of stack to allocate, it still needs to know the types of the arguments so it can set the stack variables and registers to the correct bit patterns. Consider 32-bit integers and 32-bit floating bit types as an example. Both are stored in the same type of register, but they represent very different things.

It is, in fact, perfectly possible to write your own function prototype, and call a library function without including its headers. In some cases, for example calling Fortran routines from C, it’s a necessity! But be careful! If you get the types or the number of arguments wrong, you can end up messing up your stack for the rest of the program, or even crashing your computer!

Symbol Versioning#

There are some exceptions to C’s simple view of symbol generation.

As an example, glibc, the GNU C runtime, provides versioned symbols, allowing for backward link compatibility. This allows glibc to provide bug fixes and improvements without breaking programs which were compiled expecting those bugs. In particular, if glibc is updated on a system, nothing is broken.

This is important to note when compiling programs to run on older systems. If you compile with a more up-to-date glibc, it’s quite likely your program will be looking for some newer version symbols. This wil be evident if you see errors like

./myapp: /lib/i686/libc.so.6: version `GLIBC_2.3' not found (required by ./myapp)
./myapp: /lib/i686/libpthread.so.0: version `GLIBC_2.3.2' not found (required by ./myapp)
...

The solution is to compile with an older version of glibc. And since glibc is very tightly coupled to the distro, compiler, Linux, et al., the real solution is to install an older version of Linux in a VM and compile your code there.

You may be able to compile it in an older distro in Docker, but I haven’t tried it. If you do try it, and have success, please update this section!

Fortran#

For Fortran, for instance, the mangling is light. A subroutine subroutine f will typically generate one of f_ F_ or f. As fortran is case-insensitive, the case of the symbol generated is not guaranteed. This is entirely compiler dependent, and there are often compiler flags to change the symbol generation.

Modules introduce an extra level of mangling. A function f in a module m may produce a symbol __m_MOD_f.

Modern Fortran standards have introduced the bind(c) attribute to allow the user to effectively control the symbol name generated for the given function! This is very useful for interfacing with C without needing to guess the Fortran mangling or setting compiler-specific options.

With Fortran, unlike C, the “header” file is generated from the source. These are the .mod files. They similarly contain information about how the function might be called. However, since they’re compiler generated, there’s some more scope for the compiler to provide more information about the function, and optimize how exactly symbols might be called.

C++#

For C++, name mangling is a whole topic by itself.

There are many many layers to it, all depending on your compiler, OS, and ABI model. This is, in fact, intentionally obtuse. Bjarne Stroustrup himself said compilers should not aim to generate the same symbol names to prevent people from trying to link code compiled with different compilers together.

In practise, however, most sane compilers follow the Itanium C++ ABI. Although, in reality, they follow whatever GCC is doing, and make a strong effort to remain inter-compatible. It is by sheer force of will that you can compile a library with the Intel C++ compiler, and link it to an executable compiled with Clang++. There is nothing in the standard which expects or requires that to work, despite a lot of focus placed on ABI compatibility between versions of the C++ standard.

Worth a mention, as always, is the Microsoft Visual C++ Compiler (MSVC, actual executable is cl.exe) , which does its own thing. It will not really work with anything else. However, this situation is rapidly changing. The MSVC compiler team has made very strong efforts in bringing the compiler into compliance, and are doing great work for the C++ community. As well, LLVM have put a lot of effort into making a cl.exe compatible frontend for Clang++.

The items that appear in a mangled C++ symbol are namespaces, scopes, template types and values, function parameter types, and some other secret sauces. This allows for function overloading to happen in C++, where two functions of the same name can be called with different types and template parameter values. One conspicuously missing item is the return type. This means functions can’t be overloaded based just on different return types.

Due to all the information baked into a mangled C++ symbol, it is actually possible to recover nearly the whole function prototype, except for parameter names, the return type and a few other qualifiers (e.g. constness and noexcept (pre C++-17)).

Demangling#

When looking at C++ symbols, whether from a debug trace, or directly from the library, it’s often a good idea to use c++filt. It will turn symbol soup into beautiful function prototypes.

Sometimes, however, the function prototypes are too big and beautiful. In this case, you may want to write a script to further filter symbols by replacing template parameters with <>. One of the (many) main complaints people have about C++ are the pages of output generated for compiler errors, particularly in heavily templated code. Scraping out all the templates can help.

Demangling in MacOS and -_#

On some systems, name mangling is just a little bit different. If your c++filt doesn’t seem to be transforming mangled symbols to prototypes, try adding the -_ flag. This tells c++filt that the name mangling contains an extra _. This can be an issue on MacOS when mixing AppleClang, LLVM and GNU tools, along with performance and debugging libraries that try to provide function traces.

If a tool is clearly using c++filt, but doesn’t provide a way of specifying flags, or a way of specifying your own, try making a directory /path/to/here/bin with a bash script named c++filt in it:

#!/usr/bin/env bash
exec /usr/local/bin/c++filt -_ "$@"

and setting your PATH=/path/to/here/bin/:${PATH} before running the program.

This can be a generally useful trick for hijacking commands in scripts that don’t allow sufficient customization. This can, in particular, be useful for stripping out nuisance flags from compile command, or adding some in at just the right place. Be careful though, this will make your scripts very brittle.

Further Reading#