Linux Trace Toolkit Next Generation - LTTng - is a powerful tracing framework providing flexible kernel- and user-space tracing tools. It is quite easy to configure and use, has wonderful documentation, produces traces in Common Trace Format (CTF) and together with TraceCompass allows one to quickly dig into complex performance issues.

LTTng User Space Tracing

LTTng provides a User-Space Tracing Library which can integrate directly with C/C++, Java and Python applications. Java and Python integration is done through logging APIs. C/C++ implementation, however, requires additional instrumentation of the program code. Users of LTTng-UST in C++ create tracepoints in strategic places in their system, which then remain in the program and can be activated when necessary via LTTng command line interface.

Although C/C++ integration is not done through logging facilities (of which there is no standard one), it has an advantage of structured tracing: every trace is a set of typed fields, where types closely resemble those of C. An advantage here comes from the size of trace and overall efficiency - for example - where python logger would convert everything to a string, incurring conversion and storage overhead, C/C++ can directly output data into the trace, correcting only endianness and alignment where necessary.

LTTng User Space Library
LTTng User Space Library. Source: https://lttng.org/docs/v2.10/#doc-lttng-ust Copyright © 2014 The LTTng Project

Static vs Dynamic

Although dynamic tracing provides great advantages for ad-hoc low overhead instrumentation for production environments, I see it as a precision tool which is not easy to use when one needs to see the bigger picture.

A good example would be a complex concurrent system consisting of a few dozen modules not working as efficiently as expected. There are spikes in resource utilization, there are few known bottlenecks, but in general everything is taking more time then allowed. In this case, collecting traces from multiple tracepoints at the same time, collating them with the events from additional sources and resource utilization data can expose patterns otherwise invisible.

Instrumentation and Tracing Overhead

Compared to their dynamic counterparts, static tracepoints are present in the instrumented code regardless of whether tracing is running, adding some overhead to the system - instrumentation overhead. This overhead is most of the time negligible and gets noticeable only when tracepoints are executed very often.

When tracepoints are enabled, however, the tracing overhead gets more pronounced as events are collected. LTTng is a high-performance tracer, and it provides great flexiblity with its session and channel concepts, encapsulating configurable pools of ring buffers.

LTTng Contcepts
LTTng Concepts. Source: https://lttng.org/docs/v2.10/#doc-core-concepts Copyright © 2014 The LTTng Project

Users can configure size and number of ring buffers per channel, a policy for buffer overflow (blocking, discard or overwrite), as well as other more specific options. According to the LTTng documentation, tracing performance greatly depends on the subbuffer size and count and its relation to the frequency of incoming events.

Quantifying Performance Loss

Two questions arise:

  • How high is the instrumentation overhead? Does one need to #ifdef the tracepoints in production use or can one keep them? How often should tracepoints execute for the overhead to become noticeable?
  • How high is the tracing overhead? Is it still practical to use LTTng-UST tracepoints, for example, in algorithmic code, executing few billion tracepoints per second? Of course such usage in production code is not recommended, but during development such structured tracepoints can be used as debug tools, replacing more classical printf approach.

After trying to find any information to help answer these questions, I realized I would have to collect the data myself. Below I will share some numbers I got when attempting to quantify instrumentation and tracing penalty resulting from the usage of LTTng-UST. Tracepoints were executed in a tight loop, surrounded by a minimal amount of computation, to stress the tracing infrastructure.

Benchmark Setup

At first I wanted to keep things really simple, but over time, as I was getting more insight, the setup was getting a little more involved. A synthetic benchmark which was built consists of a little C++ command line application that integrates with LTTng-UST shared library (using shared library for tracepoint provider), and a python instrumentation script that orchestrates the execution.

Methodology

For every sample obtained, following was executed:

Methodology

The Outer Loop:

  • Started an lttng-sessiond instance in background
  • Created tracing session (lttng create), enabled userspace channel (lttng enable-channel --userspace with --subbuf_size and --num_subbuf)
  • Activated userspace tracepoint (lttng enable-event --userspace)
  • Started traced application running the inner loop with perf stat attached
  • Stopped tracing (lttng stop) and destroyed session (lttng destroy)
  • Terminated lttng-sessiond
  • Collected number of obtained traces with babeltrace

The Inner Loop:

  • Repeatedly generates given number of random uint64 numbers with boost::random::mt19937
  • Triggers tracepoint for each generated number
  • Times full generation loop with std::chrono::high_resolution_clock.

Parameters

For the inner loop, inside C++ code, number of iterations was specified to either \(2^{29}\) with tracing disabled, or to \(2^{24}\) with tracing enabled. This resulted in a loop taking 1-2 seconds which should be enough to correctly estimate the average time per iteration.

The outer loop, consisting of entire instrumentation was executed 100 times for non-tracing scenarios, and 10 times for each inspected LTTng configuration.

LTTng by default operates in discard mode, so the events that cannot be processed on time are dropped. This mode was used in benchmark.

Where tracing session was enabled, number of subbuffers was varied as powers of two from \(2^{1}\) to \(2^{8}\), and subbuffers sizes - from \(2^{12}\) or 4KB to \(2^{28}\) or 256MB.

Hardware and Software

Benchmarks were executed on a workstation with:

  • Ubuntu 16.04 LTS with HWE stack, running Kernel 4.13.0-36
  • LTTng 2.9.5
  • Intel(R) Xeon(R) CPU E5-1660 v4 @ 3.20GHz
  • Total 64GB of ECC RAM @ 2400 MHz
  • Ext4 formatted SATA3.2 SSD with Block Size 4KB

Results

First, let’s have a look at the baseline measurement. Here, tracing code was not compiled into the C++ executable.

On average, single iteration of the loop took 2.7ns.

Now, let’s compare this to the instrumented code. Here, tracing code was compiled into the C++ executable, but tracing session was not enabled.

Clearly, tracing overhead is visible and for such a simple operation as random number generation, quite substantial. On average, single iteration now takes 3.4ns. The overhead, 0.7ns, order of an L1 cache miss, even though non-zero, most likely it won’t be noticeable in real systems where traced events are much less frequent.

Now, let’s have a look what happens if we enable tracing. Following chart shows samples obtained from all tracing sessions with numbers and sizes of LTTng subbuffers varied.

Tracing time distribution is clearly bimodal. While most events are clustered around 113ns - a substantial change from the previous result, there is a second cluster centered at around 5ns. Which number to trust? Lets look at the dependency between tracing time and session parameters:

Clearly, for large buffer sizes, the overhead goes down from 113ns (order of main memory reference) to 5ns (order of cache miss or branch misprediction). The more buffers allocated, the more smaller size yields the boost. Note that both axis are logarithmic, that is, values on the grid increase exponentially.

One possible conclusion - make the buffers large enough to hold entire burst of events and overhead goes down by a factor of 20. This, however is too good to be true. There is no free lunch in, well, anything. And this becomes visible when one looks at number of events captured:

Clearly, at high buffer sizes/counts, LTTng gives up. I am not sure what this is related to, as I was not able to see any feedback. Memory consumption suggests that it clearly allocates the buffers (or at least attempts to), program is executed, no error codes returned and no exceptions raised. Just no traces are left on the hard drive. My assumption is - LTTng tries to allocate the necessary amount of memory but fails to. Thus, the tracepoints, even though activated, have nowhere to write and thus exit (but do it a little later, thus difference between 3.4ns and 5ns).

Note that when traces are captured, not all of them end up recorded. If we now look again at the variation in the tracing times and events captured side by side, putting less emphasis on outliers, we will see clear patterns:

As subbuffer size goes up, event loss turns to exactly zero (the exact boundary depends on number of subbuffers). In turn, the overhead falls from its maximum around 125ns per operation to its minimum about 113ns. So, efficiently tuned parameters not only yield higher event capture rate, but also higher overall performance during tracing.

Summary

When LTTng-UST was used for very high frequency events (~2.7ns per event), following was observed:

  • 0.7ns average instrumentation overhead per tracepoint execution.
  • 110ns average tracing overhead per tracepoint execution, on a tuned setup.
  • 125ns average tracing overhead per tracepoint execution in presence of the dropped events.
  • Large subbuffer sizes and/or numbers of subbuffers could lead to no traces obtained, probably related to subbuffer allocation failures and might depend on total system memory.

Although these numbers will not be valid for a different system, they allow to get a ballpark estimate of the tracing overhead on a server-class hardware:

  • For an inactive tracepoint the slow down is an order of L1 Cache miss.
  • For an active tracepoint the slow down is an order of main memory reference.

Follow Up

I could think of a few ways to follow up to get a better picture of LTTng performance:

  • Vary event sizes, to better estimate overhead per event.
  • Run benchmark on low-end or embedded hardware (Raspberry Pi, for instance), to better understand impact on embedded systems, as both IO and memory access will get much more expensive.
  • Try other tracepoint provider linking modes, to see if overhead is depends on the integration model.

If I still have any free time on my hands, I’ll try to follow up on them :wink:. So, stay tuned, and thanks for reading!

Leave a Comment