Adopting the Parallel DWARF linker in dsymutil

On Apple platforms, the development experience was designed around making the compile-link-debug cycle as fast as possible. For debugging, that means that rather than processing large amounts of DWARF to link it into the final binary, the linker leaves the debug info in the object files and records a debug map that tells the debugger where to find it. When you’re debugging locally, that’s all you need. But if you want to archive the debug info for crash reporting or remote debugging, you need a way to produce a self-contained bundle. That’s where dsymutil comes in.

dsymutil is more than a DWARF concatenator. It’s an optimizing linker that leverages the One Definition Rule (ODR) to deduplicate types across compilation units. In a C++ project, every translation unit that includes a header gets its own copy of the DWARF for every type defined in that header.¹ dsymutil identifies equivalent types and keeps only one canonical copy. For large C++ projects, this makes the difference between fitting within Mach-O’s 4GB limit or not. In order to do these optimizations, dsymutil needs to parse and semantically analyze the DWARF, which is where most of the time goes.

The classic DWARF linking algorithm was single-threaded by design. Debug info for a large project can easily reach hundreds of gigabytes. To avoid loading all of it into memory at once, dsymutil processes one compile unit at a time and streams the output. That constraint makes parallelizing the core linking loop nontrivial. Over the years we’ve made incremental improvements, like processing architectures in parallel and running the analysis and cloning phases in lockstep on separate threads, but the fundamental bottleneck remained: the ODR uniquing happens on one thread. A parallel DWARF linker that can unique types across threads has existed in LLVM for a while. Building that was an incredible effort, but unfortunately it wasn’t quite production-ready due to some major limitations.

The Qualification Problem

The biggest challenge with dsymutil has always been qualification. When we upstreamed dsymutil to LLVM, we qualified it by generating bug-for-bug identical DWARF. We did the same thing when we rewrote the cloning phase to use the lockstep algorithm. Having binary-identical output meant we could run diff on two dSYMs to convince ourselves a change was truly NFC.

The parallel linker can’t produce binary-identical output. It processes compile units concurrently, so the order in which types are encountered and deduplicated is different. The output is semantically the same (or should be), but the DWARF structure, and hence the bytes, differ. That means the binary compatibility approach that qualified every previous dsymutil change doesn’t apply here.

Without a way to compare the output semantically, we had no way to confirm the correctness of the DWARF generated by the parallel linker. It’s relatively easy to spot-check small things in tests, but that doesn’t scale to even medium-sized projects. The really tricky issues only surface at debug-time when the debugger starts misbehaving. In order to even consider the parallel linker in dsymutil, we needed a tool that could tell us, concretely, how the parallel linker’s output differs from the classic linker’s.

Semantic DWARF Diffing

Although DWARF looks like a tree of tags and attributes, it’s really a directed acyclic graph. Attributes can reference DIEs in other parts of the tree. A variable references its type, a type references its members’ types, a subprogram references its parameter types, and so on. Comparing two DWARF outputs means matching nodes across two graphs and verifying that their attributes and reachable subgraphs are equivalent.

You can’t do this by diffing dwarfdump text. The offsets are different, the ordering of DIEs may differ, and the cross-references point to different positions. That’s without even considering that the dwarfdump output for any real-world project is too big to handle for most tools. What you need is to anchor the comparison on stable identifiers like linkage names, declaration coordinates, and type signatures, then walk the graph from there, comparing attributes and children structurally.

We prototyped a semantic diffing tool and ran it on clang, comparing the classic and parallel linker output. Out of roughly 5 million DIEs, it identified about 50,000 differences. We haven’t verified all of those results, and the tool itself is far from production-ready, but it was sufficient to give us a concrete picture of where the two linkers diverge.

Determinism

The single biggest blocker towards adopting the parallel linker was its non-determinism. Reproducible builds are non-negotiable for any serious build tool. Without them, you lose the ability to cache, bisect, and verify your artifacts.

The non-determinism came from how the parallel linker selects canonical DIEs during ODR uniquing. When multiple compile units define the same type, threads race to claim the canonical copy. Whichever thread gets there first wins. Since thread scheduling varies between runs, different runs pick different canonical DIEs.

The fix assigns each compile unit a priority based on its position in the link order. When a thread wants to register a canonical DIE, it only overwrites the current one if its priority is strictly higher, meaning it appears earlier in the input. This guarantees the same canonical DIE regardless of scheduling, while preserving the parallelism.

Switching the Default

The differ is one piece of the qualification strategy, but not the whole picture. It tells us whether the two linkers agree, not whether either one is correct. It’s also the least automatable part of the process. In order to switch the default, we need a comprehensive qualification strategy to ensure correctness.

The first step is running dsymutil’s existing test suite with --linker=parallel. These tests exercise specific DWARF constructs and edge cases, and any test that passes with the classic linker should also pass with the parallel one. Getting all of these to pass, or understanding why they don’t, is a prerequisite for switching the default.

The second step is running the LLDB test suite against dSYMs produced by the parallel linker. LLDB’s tests cover a wide range of debugging scenarios. By default, it runs multiple variants of every test: once using the DWARF in the object files and once with a dSYM. This will allow us to both identify regressions between the classic and parallel linker and between DWARF in the object files and the dSYM generated by the parallel linker.

The final step is using the diffing tool to manually inspect the differences between the dSYMs generated by the classic and parallel linker for increasingly large projects.

This work is tracked and ongoing.

Performance

To put all of this in perspective: running dsymutil on clang takes roughly 3 minutes with the classic linker compared to about 40 seconds with the parallel linker. The speedup was always the expected outcome. The parallel linker was designed to be faster. The work described here was about building the confidence to actually use it.

On Apple platforms, clang passes -fno-limit-debug-info, which means every compilation unit gets complete type definitions rather than forward declarations for types defined elsewhere. Having the full type information available means we don’t have to parse all the input DWARF, which improves the performance and reliability of the debugger, but increases the amount of DWARF that dsymutil has to process. ↩︎

The Qualification Problem#

Semantic DWARF Diffing#

Determinism#

Switching the Default#

Performance#

The Qualification Problem

Semantic DWARF Diffing

Determinism

Switching the Default

Performance