Audio profiling


With the release of 0.22 Rodio got support for microphone input. Now we can use the library for all kinds of new applications like speech to text, recording, and live streaming! I’ve tried live streaming guitar play over Christmas, it even sounded better than Discord as we did not add any compression.

Live-streaming must minimize audio latency, especially when streaming both ways like in VoIP. Since there is a trade-off between latency and the amount of effects and sounds we can play, even non latency critical applications can benefit from overall lower latency. Latency is something the team has not focused on before, let’s change that! Especially as we’re “secretly” playing around with rather big changes to the audio engine (see: github.com/RustAudio/rodio-experiments).

Background

It’s quite difficult to lower audio latency, to not get this discussion to get lost in the weeds so we will discuss this using two use-cases. One of them will be VoIP, as especially group calls form a nice trying use-case. The other would ideally be games since they also need low audio latency which can get challenging given they can have a myriad of effects applied on hundreds of sounds. Making matters worse these effects and sounds are added and removed all the time causing the computational load to change wildly. A large game is hard to reason about, instead we will use an instrument: a simulated piano. You can press all the keys at once (on a real piano you may need help from a friend) adding a lot of sounds simultaneously. Our piano simulates nine overtones for three strings per key press. If you press all 88 keys at once that means 2376 sine waves need to be combined! Each sine wave generates 44100 samples per second, quite the increase in computational load.

OS audio interface

What are these samples that we generate and how do they get turned into audio? A speaker produces sound by moving a piece of paper or cloth - the diaphragm - back and forth in a precisely controlled way. The better that control - the better the speaker sounds. An audio sample describes the position the diaphragm should have at a single point in time. If we fail to provide the hardware with a sample on time, it’s called an underrun. At that point the hardware does not know what the next diaphragm position should be. Instead, it will pick something, that something is usually very wrong. You hear that as an unpleasant sound (this can be very loud which is especially fun with headphones on). We should therefore strive to always provide the samples on time. The OS is responsible for getting those samples to the hardware, we use the cross-platform audio library (CPAL) to interface with the OS.

The hardware gets samples from the OS in groups, each group is called a “buffer” of samples. We pass samples to the OS like that as well. With some control over the buffer size it is trivial to prevent underruns, simply use a huge buffer. That however means our samples will be heard quite a bit later than when we generated them, introducing latency. Picking the optimal buffer size gets more complex as the computational load of the audio pipeline varies more. In our piano example, when we are playing normally, the CPU can safely throttle down to save power. Assuming normal play engages no more then five keys at the time, pressing all keys at once instantly increases the computational load by a factor of 16. The throttled down CPU will likely not be able to keep up anymore, while the system scheduler will rapidly ramp up the CPU in response that can not be instant [1]. Unless we have a large enough buffer of samples to bridge that ramp-up period we’ll have an underrun. Our other use-case - VoIP - does not suffer from this issue given the chance is negligible that a bunch of new participants join at the exact same millisecond. Even if that did happen it could be mitigated by delaying subsequent participants joins by a few milliseconds which nobody would notice. This applies to the piano as well: we could delay rapid key presses but that does not extrapolate to more complex use cases for which the piano stands.

Audio pipeline

Now that we know how and why the audio interface contributes latency, let’s look at the part we can improve: the audio pipeline itself. It consists of a source of samples, zero or more effects and an output all chained together. Some of the effects contribute latency directly while others might have inconsistent performance which requires buffering to smooth out.

Let’s take the simple audio pipeline below to see how effects can contribute latency directly:

Sine generator              no buffering
    -> Effect               buffering 4k
    -> OS                   buffering 3k
    -> hardware             buffering 2k
    -> speaker (analog)     no buffering

This pipeline plays a sine wave with some effect applied to speakers. The sine generator itself contributes no latency. The audio pipeline has a single effect which buffers 4000 samples. Then the audio is pulled into the OS buffer and finally the hardware one. Many effects need to buffer because they can only operate on groups of samples, some examples are resampling and noise cancelling.

So how do samples move through these buffers? Like the OS audio APIs, Rodio’s audio pipeline is pull based. It is therefore lazy and samples are only generated upon request. Going through our example:

  1. The OS either gets an interrupt or notices the hardware needs samples while polling. The OS has 3k samples ready.
  2. The OS copies over some of those samples (a period) from it’s buffer to the device.
  3. The OS runs Rodio’s callback N times to get more samples from our audio pipeline.
  4. Effect is called N times to fill the OS buffer.
  5. Each time effect is called it either:
    • returns a sample from its internal output buffer.
    • calls the sine generator 4000 times to fill its input, runs the effect filling its output buffer, then returns the first sample from that.

Let’s say we run this on a perfect system where every calculation happens instantly, a reasonable approximation for modern a system (I am a recovering physicist /j). Such a perfect system is ironically also the worst case latency wise, we’ll see why in just a bit.

Now I’ve expanded the example pipeline, adding volume control and a second buffering effect. This will help introduce the concept of control latency and show how multiple buffers affect latency. The buffer sizes are unrealistically small so we can easily see how their state changes during playback. We will trace the journey of four samples through the pipeline. The HW and OS each have a buffer of two samples and the effects both need to buffer four samples.

Sine  Volume  Effect A     Effect B     OS       HW       speaker
[] -> [] ->   [       ] -> [       ] -> [4 3] -> [2 1] -> playing 0
[] -> [] ->   [       ] -> [       ] -> [4 3] -> [  2] -> playing 1
[] -> [] ->   [       ] -> [       ] -> [  4] -> [3 2] -> playing 1
# `Effect B` calls `next` once on `Effect A`
# `Effect A` in turn calls 'next' four times on `Volume` to fill it's buffer.
# `Volume` in turn calls next on Sine. Neither of those buffer.
[] -> [] ->   [8 7 6 5] -> [       ] -> [  4] -> [3 2] -> playing 1
# `Effect B` called `next` four times to fill its buffer
[] -> [] ->   [       ] -> [8 7 6 5] -> [  4] -> [3 2] -> playing 1
[] -> [] ->   [       ] -> [  8 7 6] -> [5 4] -> [3 2] -> playing 1
[] -> [] ->   [       ] -> [  8 7 6] -> [5 4] -> [  3] -> playing 2
[] -> [] ->   [       ] -> [  8 7 6] -> [  5] -> [4 3] -> playing 2
[] -> [] ->   [       ] -> [    8 7] -> [6 5] -> [4 3] -> playing 2
[] -> [] ->   [       ] -> [      8] -> [7 6] -> [5 4] -> playing 3

The number of samples in both effects’ buffers varies between zero and four, the HW and OS buffers, however try to stay as full as possible. Since we have a perfect system they succeed at that, they are therefore completely filled before the next sample is played. Now on an imperfect system (physicists nightmare) the speed at which the audio pipeline runs can fluctuate. Usually it will run faster than the speakers need samples, but at times it might run slower. As mentioned before a large enough OS buffer will bridge this gap by providing the samples the hardware needs during slowdowns.

The buffers are therefore fullest on a well running system which is why that is when we get the worst latency. Maximum latency is fully determined by the HW, OS and any effects that buffer. Up till now we’ve only seen chunking effects: these act as transformations on a group of samples without any overlap. There are also windowing effects which maintain a moving window of samples.

Let’s look at the trace above, note how the second chunking effect contributes no extra latency! It turns out we can generalize this, the latency - in samples - added by multiple chunking effects is: (largest // second largest) * second_largest. With // denoting integer division, largest being the size of the biggest buffer and second_largest the size of the second largest buffer. This even holds if there are other effects in between.

Windowing effects always add the full size of that window to total latency. The effect on latency of other buffering behavior is left as an exercise to the reader :).

Lowering latency

That was a lot of background on Rodio’s audio pipeline. Before we discuss how to achieve the lowest total latency it’s worth talking a little about control latency: the time it takes between adjusting some effect and hearing the result. In the previous section when we traced how samples went through the pipeline we put the volume control at the very beginning. That meant there was a full effect buffer of latency between changing volume and the output to the OS. If we simply move the volume control to the end of the pipeline that extra control latency is gone.

One part of minimizing total latency is being careful with buffering effects. You always want to minimize the number of windowing effects, for chunking ones it is more complex. There it can even be worth it to increase some buffer sizes! Take VoIP there you’ll always need a resampler. A high quality resampler transforms groups of samples, so it is a chunking effect. It’s nice to offer echo and noise cancelling as well. We can actually set those up to add no extra latency! First we need to use chunking implementations for them so that the equation from the previous section applies. If we could treat the integer division as normal division the equation would reduce to just largest, which holds integer division has no remainder. When that is true second_largest does not contribute to total latency! We can not help the need for windowing and chunking effects but as a crate we can pick suitable default buffer sizes for our users, which can get quite complex.

Now that we have minimized the latency from buffers in the audio pipeline that leaves the OS. We need to keep its buffer as small as possible, that in turn requires a consistent processing time. This is a soft realtime constraint, it’s a soft constraint since no one really dies when we fail to provide a sample on time.

There are some rules of thumb people use for real time programming. They vary from never (de)allocating (the allocator could block on locking a Mutex) to not even using Mutexes at all. Figuring out these “rules” is complicated and easy to make a mistake in.

We could set up a list of real-time programming rules - ever increasing in exemptions and nuance - and require contributions to follow those to the letter. That would however also mean checking every dependency against these rules. Making matters worse, unless a dependency explicitly states its goal is to be real-time safe we’d need to re-check it whenever we update it. Not only is this a lot of - rather annoying - work it requires expert knowledge. As a project we want to make it easy to use and extend Rodio. Most of our contributors have a deep understanding of audio, but they are not performance engineers. We would lose amazing people if we were to suddenly require that specific expertise. Not only that, it opens the door to endless bike-shedding. In my opinion it would kill the project.

Instead we will try to measure whether the rate at which we make samples available is consistently fast. We can then see if new contributions improve that metric. With that as a measure of realtime safety, we can, together with our existing throughput benchmarks, really start optimizing Rodio.

Reporting realtime safety

When holding new contributions to a metric, it alone should never determine whether a contribution is merged. Is a slowdown worth it when fixing a bug? How about when adding a really desired feature? Now what about increasing throughput by 50% while worsening latency for buffer sizes under 500 samples? To help make these decisions our tool will try to show what a change could mean for each Rodio use-case. I chose my wording carefully there, because making a representative and repeatable benchmark is incredibly difficult.

When we introduce this tool we’ll have a rough approximation for common use-cases. We will make it simple for users to run it on their audio pipelines so they can easily find real-time behavior issues in Rodio. When they report such cases we will ask them to contribute a minimal pipeline which reproduces the issue to the reporting suite, improving it over time.

Design

Rodio’s audio pipeline is OS buffer size agnostic, therefore our report needs to be as well. With that in mind, I considered two methods for gathering the necessary data:

  • Timing samples: We measure how long it takes to generate each sample, then calculate how close to an underrun the pipeline got for each common buffer size.

    • It’s fast: only goes through the pipeline once, never sleeping.
    • Gives an indication of throughput, which could be useful for optimizing.
    • Can present the bad luck case where the latency spikes align in one buffer.
    • May be unrepresentative: measuring the elapsed time slows the code down and could prevent key optimizations.
  • Timing simulated callbacks: We iterate through the audio pipeline multiple times replicating the callback behavior for various OS buffer sizes. Each time we measure how long it took to collect the next buffer.

    • Slow: needs to run at normal playback speed to replicate OS behavior.
    • May hide bad latency spikes: two neighboring latency spikes may fall in separate callbacks purely by chance.
    • Allows for optimal code generation: no timing code in the hot loop.

As we cannot put this in CI easily (shared VMs are not the place to do performance testing). We’ll want every contributor to run our tool twice when opening a PR: once on the old code, once on their changes. As the suite grows over time this will get more and more annoying unless it is fast. So we will use the Timing samples approach. Its main disadvantage will be the measuring overhead after every sample is yielded.

Measuring Time

Getting the time in Rust is trivial: simply call Instant::now. On UNIX this ends up running clock_gettime64. Is that not a syscall though, and aren’t those really slow? Yes and no: clock_gettime64 is handled through the vDSO (virtual dynamic shared object) which places the needed information in user space accessible memory. Instead of an expensive system call Instant::now is therefore a normal user space function call which only performs a few memory accesses. So how fast is that? I ran a micro benchmark of it (see Appendix B) which reports 20ns per call.

Still, can we do better?

An x86_64 CPU tracks the number of cycles (not really but let’s not go into that!) since reset in a special register. We get a dedicated instruction to read it: rdtsc (ReaD TimeStamp Counter). That takes only 38 whole cycles on my machine [2], wicked fast. I again ran a micro benchmark, it reported only 9ns per “call” (Appendix C). Not only is it way faster we also get ‘cycle’ count resolution. Finally note that cycle tracking is not exclusive to x86_64, for example on modern ARM we get a similar instruction: ccnt (Cycle CouNTer).

Putting these numbers into perspective, most audio pipelines will need to output stereo at 44100hz, that comes down to a sample every 11 microseconds. We can measure the elapsed time three orders of magnitude faster. Clearly both methods will be fine for latency peak detection.

Codegen

The performance penalty from measuring time can be explained from looking at the codegen. The micro benchmarks in Appendix B and C can hardly be called representative, still they might give an idea of the optimization potential of both methods.

Below we see the the optimized assembly for Instant::now. Note that the call does not get inlined. We also see two registers being used to return the result, that checks out as Instant has a size of 16 bytes.

        mov     r14, qword ptr [rip + std::time::Instant::now::h9352a5813f06301d@GOTPCREL]
.LBB6_1:
        call    r14      // call instant now
        mov     qword ptr [rsp + rbx + 72], rax // store the result
        mov     dword ptr [rsp + rbx + 80], edx // store the result
        add     rbx, 16
        cmp     rbx, 1608
        jne     .LBB6_1  // loop back to the call

When we use rdtsc, the optimized assembly features a partially unrolled loop, which reads rdtsc and stores the result five times each iteration. Here is one of those read and store section:

.LBB0_3:
        rdtsc   // puts the current timestamp in register rdx
        shl     rdx, 32
        or      rdx, rax
        mov     qword ptr [rsp + rcx + 32], rdx
        ...
        addq	$40, %rcx
        cmpq	$80032, %rcx
        jne	    .LBB0_3

It amazes me how calling the OS provided clock_gettime64 takes only twice as long as directly accessing the timestamp register! The instruction cache combined with pipelining is truly amazing. What does using rdtsc get us? We save a register and some space in the instruction cache. It could be that the space we use with Instant::now is desperately needed by the audio code. Even if that was true realistically we would at most get one extra L1 (instruction) cache miss per sample. As an L1 miss is still orders of magnitude faster than the mean time allowed to generate a sample the extra time from the cache miss will not change whether the tested audio code is reported as real time safe. It is a problem if we want to use the tool to guide manually optimizing the code.

Given that cycle tracking through rdtsc and equivalents has less impact on codegen, more precision and is faster we will use it when available. For latency peak detection we can confidently fall back to Instant::now where cycle counting is not available.

Repeatable measurements

We assume the time between samples to be a consistent measure of the amount of computational work performed and blocking occurred. That is generally not true though, why and what can we do about that?

Pre-emption

While our measurement is running it can be paused by the OS to make room for other work when this happens our thread is “pre-empted”. While paused, time - unfortunately - still progresses without the audio pipeline yielding samples. Pre-emption therefore shows up in our results as big latency spikes. Whether pre-emption will happen at all depends on what the rest of the system is doing making it more or less random. What can we do about this?

On Linux one way to minimize pre-emption is to request real-time scheduling at a high priority. This is a must for low latency applications which is why most linux distributions these days provide a method to do this from userspace (rtkit). Under realtime, scheduling even if our thread ends up running blocking code and yielding as soon as it is able to make progress again, a lower priority thread will be pre-empted to make space for it. That is unless there is another equal or higher priority thread which can also make progress. For our measurement we could use the max priority of 99, though in production you should copy the priority of the sound server.

I expect there to be few real-time scheduled tasks on a normal system, and even less at the highest priority (99). Lets see if I’m correct:

ps -ATo class,pri | awk
'BEGIN                                                   { count=0 }
($1 == "FF" || $1 == "DLN" || $1 == "RR") && $2-40 >= 99 { count += 1 }
END                                                      { print count }'

This lists all threads for all processes (-AT), printing the scheduling class and priority (-o class,pri) in two columns. Then it uses an awk script to print the number of threads that run under one of the real time scheduling classes at max priority. Since ps displays realtime priorities 0 to 99 as numbers in the range 40 to 139 the awk script has to subtracts 40 from the priority field ($2).

On my 32 hardware thread machine this prints 33, that is quite a few more threads than I expected. It turns out each hardware thread gets its own kernel thread that handles scheduling. That obviously must run at max priority or it would never get to do any scheduling in the presence of max priority threads. These scheduling threads should not cause problematic pre-emption since they will simply yield back to our loop immediately. The 33rd thread turns out to be rtkit’s: the user space realtime scheduler. For the same reason this does not have to be a problem.

We could go down this road, praying we never encounter more max priority threads. We might even check on what core rtkit is running and pick another. If only we could just claim a part of the CPU for our benchmark… wait, we totally can!

On Linux we can use cpuset to isolate CPU cores, moving all threads - including kernel ones - off those cores, and preventing new ones from being scheduled there. Great! Now we can get a whole CPU core for our measurements.

CPU speed

Any change in CPU speed changes the aforementioned relation between time and work performed, making it useless for comparison. There are three factors that will change the CPUs speed:

  • The OS - trying to be energy efficient - running the CPU slower to save power.
  • The CPU itself boosting for short periods under high load.
  • The CPU overheating and thermal throttling, slowing down.

We can tell the OS to run under the performance scaling governor and instruct the CPU to run at a fixed frequency. It’s impossible, however to ensure proper cooling for the user’s system. We could run at a significantly lower base frequency, that would dramatically lower the chance of thermal throttling. It would also slow down the measurement for everyone, so instead we will monitor the CPU. If the system thermal throttles, we will ask the user to run the measurement at a lower base CPU frequency (or put their machine in the fridge).

Other factors

This covers the major sources of inaccuracies and should be enough for our application. We can do more though! For example we might move interrupt handlers out of ‘our’ CPU core and disable address space layout randomization for our binary.

Finally these measurements say nothing about how the pipeline would perform when actually pre-empted. I would argue that any latency sensitive application must run under a real-time scheduling policy. When using Rodio the pipeline usually runs like this. We really ought to verify this and let our users know in some way when it’s not the case. Still I’ve been playing with an idea on how we might be able to get somewhat repeatable results representing an audio pipeline under a non realtime scheduling policy. We would still use cpuset to create our own little dominion on the CPU but now we run a fixed repeatable workload concurrently with our measurement. This would only work with the timing simulated callbacks method. Something for the future.

Implementation

As I said before, we do not want this tool to be a barrier to contributing, so let’s see if we can make everyone enjoy using it. I like tools when they are trivial to use and do not require me to do anything. So we will automate as much as we can. When you start the tool it will:

  • Check out Rodio’s main branch and do a baseline measurement storing the results.
  • Go back to the commit the tool was invoked on and perform another measurement, again storing the result.
  • Present significant differences between the runs and overall statistics.

The measurement logic itself is trivial. Since Rodio sources are iterators we simply call next in a loop while recording the time at the end of each iteration. The loop ends when we have collected enough data.

for measurement in measurements {
    black_box(source.next()); // don't optimize out what we want to measure
    *measurement = unsafe { core::arch::x86_64::_rdtsc() }; // or Instant::now
}

Right before the loop starts, we use cpuset to clear a CPU core and move ourselves there. Then we set the CPU governor to performance, and instruct the CPU to use a constant frequency. The teardown code for those changes lives in the drop implementation of a struct to prevent any panic during measuring from leaving the system in the wrong state. While running in parallel on another core we track the CPU frequency to detect thermal throttling.

Appendix A

A short example showing how two chunking effects still follow the latency equation (introduced at the end of the background section) even when there is a windowing effect in between.

chunking     windowing    chunking
[4 3 2 1] -> [       ] -> [       ]
[4 3 2  ] -> [1      ] -> [       ]
[4 3    ] -> [2 1    ] -> [       ]
[4      ] -> [3 2 1  ] -> [       ]
[       ] -> [4 3 2 1] -> [       ]
[       ] -> [  4 3 2] -> [1      ]
[8 7 6 5] -> [  4 3 2] -> [1      ]
[8 7 6  ] -> [5 4 3 2] -> [1      ]
[8 7    ] -> [6 5 4 3] -> [2 1    ]
[8      ] -> [7 6 5 4] -> [3 2 1  ]
[       ] -> [8 7 6 5] -> [4 3 2 1]

Appendix B

A micro-benchmark determining how long it takes to get the time using Instant::now. Executed on a AMD Ryzen 9 9950x3d with a clock speed of 3GHz.

use std::mem::{self, MaybeUninit};
use std::time::Instant;

fn main() {
    const LEN: usize = 100;
    let mut times = {
        let mut times: [MaybeUninit<Instant>; LEN] = [const { MaybeUninit::uninit() }; LEN];

        for time in &mut times {
            time.write(Instant::now());
        }

        unsafe { mem::transmute::<_, [Instant; LEN]>(times) }
    };

    let first = times.first().unwrap().clone();
    let elapsed: Vec<_> = times
        .into_iter()
        .skip(1)
        .map(|t| t.saturating_duration_since(first))
        .collect();

    println!(
        "Average time per Instant::now(): {:?}",
        (elapsed[98] - elapsed[48]) / 50 // let cache warm up
    )
}

Appendix C

A micro-benchmark determining how long it takes to get the time using rdtsc. Executed on a AMD Ryzen 9 9950x3d with a clock speed of 3GHz.

    use std::time::Instant;

    fn main() {
        const LEN: usize = 10_000;
        let mut times: [u64; LEN] = [0u64; LEN];

        // also gets Instant::now's code into the instruction cache.
        let started = Instant::now(); // 20ns overhead
        for time in &mut times {
            *time = unsafe { core::arch::x86_64::_rdtsc() };
        }
        let done = Instant::now(); // 20ns overhead

        // no reordering into of anything into the analysis part please
        std::hint::black_box(&done);
        std::hint::black_box(&times);

        let elapsed = done.saturating_duration_since(started);
        println!(
            "average rdtsc call took: {:?}",
            elapsed.div_f64(times.len() as f64)
        );
    }

[1]: Alessio Balsini, Tommaso Cucinotta, Luca Abeni, Joel Fernandes, Phil Burk, Patrick Bellasi, Morten Rasmussen, Energy-efficient low-latency audio on android, Journal of Systems and Software, Volume 152, 2019, Pages 182-195, ISSN 0164-1212 [2]: https://uops.info/html-instr/RDTSC.html#ZEN5 [3]: https://ankush.dev/p/reliable-benchmarking [4]: https://llvm.org/docs/Benchmarking.html [5]: Todd Mytkowicz, Amer Diwan, Matthias Hauswirth, Peter F. Sweeney, Producing Wrong Data Without Doing Anything Obviously Wrong!