GPUI profiling


Even though GPUI is a really fast UI framework you can still get slowdowns. I’ve written this guide to help out when that happens. We’ll discuss:

  • What performance actually means for our users.
  • How async in GPUI is uniquely difficult.
  • How to analyze performance issues.
  • Ways to address common performance issues.
  • Measures to improve performance across all.

As I work on Zed this guide is primarily aimed at its contributors. While the examples are Zed specific the principles apply to anyone using GPUI!

Performance for our users

We want our users to say our editor is the best performing one they’ve ever used. They, however, only perceive performance - they do not measure it. Metrics like how fast we syntax highlight a document do not matter to them. For Zed to be perceived as performant it must:

  • Feel fast.
  • Not get in their way.
  • Not eat their battery.

Feeling fast

Common actions should take place within one or two frames. We target 120Hz, which means we need to be ready in about 8 to 16ms. This is especially true for typing, but also for opening files or UI elements like pickers.

Not getting in the way

Some operations will always be slow to complete. Maybe a slow connection is getting in the way, or there are many inputs. That does not have to matter as long as we stream in results faster than the user can read them. We now do this in project search.

Not eating their battery

The classic advice for power-efficient computing is to get the device back to a low power state (C-state) as soon as possible: the famous “race to sleep”. This is no longer true on modern machines. We now want to use available efficiency cores and prevent the system from boosting to high clock frequencies which are less power efficient. This is sometimes called the “pace to idle”. At the OS level, this is usually handled via thread/task priorities.

Finally, there is another balance here: continually front loading work can help the app feel faster at the cost of power usage.

GPUI async and performance

GPUI async tasks run on a single-threaded foreground and multi-threaded background executor. The closest analogy to the foreground executor (hereafter, foreground) is tokio’s unstable LocalRuntime.

Frequently-accessed state is kept on this single-threaded foreground. Since we do not need locks or channels when interacting with it, anything modifying its state also needs to run on the foreground. Most of our app needs to do this; therefore, most of our app - while highly concurrent - is single-threaded. The work running on the foreground comprises of cooperative tasks. As the foreground is single-threaded there is no work stealing; we can only start new work once the running task hands back control (yields). When a task runs for a long time before yielding, it causes the executor to seemingly hang: nothing can stop the task and nothing else can do work. We render on the foreground, so delay by non-yielding tasks leads to a delay in rendering which the user will ‘feel’ as a stutter or hang.

Our background executor spreads tasks across multiple threads running in parallel. Once a background task is assigned to a thread it cannot be moved to another. There is still no proper work stealing, but these threads do share a global queue. A hang in a background task will hang all other tasks already assigned to that thread.

To guarantee that tasks yield in time, it’s not enough to limit async code to nonblocking APIs. If the CPU has to do a lot of work, it might take too long to get to the next possible yield point (a .await). An example of such a high CPU task is syntax parsing. Also keep in mind a .await only means the task could yield there; it does not have to if it can immediately continue. For example, consuming an async stream might never yield if every item in the stream is already available.

Analyzing performance issue

Ideally, we would have a large benchmark suite with representative workloads. We are not there (yet). We do now have an easy attribute macro to make our tests do double duty as unit benchmarks. So for now, most performance work will be in response to issues reported by our users.

To figure out what is going on during a slowdown, use:

  • Zed’s built-in minimal profiler: miniprofiler visualized through Tracy.
  • Attach a sampling profiler: such as perf (Linux only), Instruments (Mac only) and samply.
  • Instrument Zed: using tracing visualized through Tracy.

Keep in mind that these performance tools give a hint to where an issue is coming from, but this isn’t necessarily the root cause. Eliminating a hot call can just move the calculation somewhere else (performance Whack-a-Mole). These tools should inform your understanding of the problem, not the solution.

For specific information on how to set up and run these tools see: https://zed.dev/docs/performance

Miniprofiler

We have a built-in profiler that is always monitoring task duration. When a task takes too long to yield back to the executor (yield at an await point), miniprofiler will log to the terminal and write out a mini-profile. At any time, you can manually trigger this too. You can view the profile in Tracy to see what the foreground executor was working on, and how long each task took. Most performance issues in Zed are not a lack of throughput, but tasks hanging or the executor spending critical time on a less-important task. A mini-profile can be a great start to see if that’s the case. Since it’s built-in and automatically saves on hangs, our users can send us these profiles too.

Note

The profile contains the time for still-running tasks as well. Therefore a task hanging forever will be included in the profile!

Sampling

A sampling profiler records statistics on where the CPU is spending its time. You get a callstack for each of these places or samples. Usually you’ll want to view this as a flamegraph. Both samply and Instruments offer a UI to limit the viewed statistics to a period within the recording. Use that to limit the results to when the issue was occurring.

Notes

  • The profiler needs debug symbols or you will get unreadable function names.
  • Apple’s Instruments provides quite some additional features like hang detection and tracking CPU usage. We’ve found those can be wrong and I recommend not relying on them.
  • The profile is based on statistics, it can miss short running function calls. Do not use a sampling profiler to reason through what the program is doing.

Processor tracing

While only available on Intel CPUs and M4 Macs, processor tracing records the exact flow of execution including the exactly duration of each call. You can use it on Linux use perf-intel-pt and on Mac Instruments with the processor trace template.

Notes

  • Drop in replacement/setting for sampling profilers perf and Instruments.
  • As it’s done in hardware, the overhead is far lower than sampling (between 2% and 22%).
  • This generates vast amounts of data: you may not be able to trace for long.

Instrumenting

Instrumentation involves modifying the program to emit information for us to analyze. This is extremely powerful: at its most basic, we can instrument a function to get exactly when it started, how long it ran, and who called it. We can later use this to trace through the program.

This works by creating and entering a Span at the start of the function. When this Span gets dropped, the exit time is recorded and the Span is sent to Tracy. Spans do not need to correspond to a function; they can also be used to measure how long some section of code takes. You can record variables into a Span, such as the arguments to the function, or properties of arguments (like their length).

In Tracy we can view these spans on a timeline and/or do statistical analysis on (part) of the recording. Some examples of analysis:

  • Finding the functions that took longer than T seconds excluding the time spent by functions called by them (their children).
  • Finding all instances of a function that took longer than T seconds and display the arguments of those functions in a list.

Notes

  • We still need to add instrumentation support for futures. A Span that crosses an await point reports incorrect data.
  • Instrumenting can create a lot of information. It’s not uncommon to ingest 500MB/s. This is all kept in RAM to prevent slowing down the program. Take care to stop the instrumentation before your system runs out of RAM.
  • Watch out when instrumenting trivial operations which may run millions of times a second (like Add on a newtype). The fixed overhead of instrumenting (tens of nanoseconds) will then become a drag on performance and make callers seem slower then they are without instrumenting.
  • Take care not to record huge arguments (like a long list) as that will also slow down the instrumented function and can even grind the program to a halt. Instead, record metadata like the length of a list.

Addressing performance problems

First check if there is a bug causing us to do unnecessary work. Maybe we are invalidating a cache when we shouldn’t, can do incremental update but don’t, or we are doing work which is invisible to the user.

Tasks hanging the executor

If a task is running long without yielding (hanging), you should see so clearly in the mini-profile.

Common causes and solutions:

  • The task is doing blocking IO: replace the IO with async.
  • A foreground task doing CPU-heavy work for a long time:
    • Run in the background, send back result to foreground. - If the task’s shared state is not used by other tasks, consider explicitly yielding a few times with yield_now.
    • Rewrite the task such that it can yield. Consider buffering the state and atomically swapping it when done.
  • A background task doing CPU heavy work for a long time:
    • Add explicit yield_nows (no need to worry about state, as background tasks own their data).
    • Spawn a dedicated thread for the CPU-heavy work and communicate the result back over a channel.

Keep in mind that hitting a .await does not necessarily mean the task will ever yield there. The Future awaited could already be in its ready state. This tends to happen when communicating results from the background executor to the foreground one through a queue, especially if the latter is running slow.

Runtime overload

When Zed is under high load, the runtime can have too much work to get through. We always prioritize the UI as it should stay responsive, but Zed is allowed to be slow to do the requested work. This resolves once the load disappears.

We can address this by lowering the priority of tasks whose results are not directly visible, and by increasing the priority on anything that the user is waiting on.

An example is opening a panel, which triggers many tasks to (pre)-load every entry in the panel. When the user clicks an entry, it should be loaded immediately and not get queued behind the pre-loads. We can trivially do this by setting the priority of the load task spawned in the on_click handler to high.

Other

Still remaining are the classic performance issues. We could have written suboptimal code or the issues could originate from outside Zed, like git taking a long time. In the case of suboptimal code, there is nothing unique to Zed to discuss here; the solution depends on the specific issue.

When the cause is external, inform the user: they might be able to resolve it (for example if high network latency is causing a delay).

Finally, consider streaming in results if the slow work cannot be sped up, or display a loading bar. This makes the wait more tolerable and keeps our editor feeling fast, even when we can’t actually be fast.

Future work

To close I’d like to argue for a little more work on our performance tooling, perfecting our scheduler, and considering if we can do more background work.

Our new tooling (miniprofiler and instrumentation) has already paid for itself speeding up the resolution of severe performance issues. There are however still a few areas where we are relatively blind:

  • We cannot yet instrument futures. Both Tracy and tracing support async instrumentation; we only need to set them up.
  • We have no insight into the state of the executors. We know how long a task runs but not how long it has been since it was spawned. Knowing this delay would enable us to optimally assign priorities. To implement this we need to figure out how to collect this data and display it in Tracy. Ideally these timings should also be recorded by miniprofiler and it should emit a profile when tasks have to wait too long before starting.

As the scheduler decides what we do first, it is the key to achieving an app that feels fast. While we’ve made massive improvements to it there still is some work to do. Specifically we need to:

  • Add spawn_blocking and spawn_blocking_with_priority. Previously we ran most blocking tasks using rayon. We had issues with it, and now we run those tasks on the background executor. As they are not async they will never yield back, causing a host of issues. Instead, these blocking tasks should run on a dedicated thread pool where we can leverage the OS for scheduling through pre-emption and work stealing. We can trivially add priority to blocking tasks by running some of the threads at different OS priorities.
  • In light of the race to sleep no longer being efficient, investigate and document the influence of thread priority on power efficiency.
  • Ensure a hierarchy of OS priorities. We need to make sure the OS schedules the foreground executor before the background executor and any threadpools created by Zed dependencies.
  • Go through existing task spawns and assign priorities.

Finally, the text snapshots can only be updated atomically and must run on the foreground executor (as that is where their state is). Updating them takes a while due to how CPU-intensive it is when there are large changes. While updating we cannot yield (as that would break the atomicity). Like any long blocking task on the foreground executor, this keeps us from rendering on time. Fortunately this just keeps us from hitting our 120Hz target and is not perceived as the UI hanging. This will be a hard solve, and is probably not worth our time right now.