LLDB comes with built-in syntax highlighting for C-based languages. It does so using the embedded clang compiler. Other languages following this model of embedding the compiler, such as Swift, can take a similar approach.

Motivation

There are two important use cases where relying on the compiler for syntax highlighting doesn’t work:

  1. Some languages, like Rust and Zig, don’t have their own type system and instead reuse the Clang type system.
  2. LLDB has the ability to create synthetic stack frames representing any language, including interpreter languages like Python.

In both scenarios, there is no embedded compiler to rely on for syntax highlighting. Creating a language plugin for every language, each with its own syntax highlighting library dependency doesn’t scale.

What if instead, LLDB could use a syntax highlighter that is general and can support almost any programming language? That’s exactly the idea behind using tree-sitter.

Tree-sitter

Tree-sitter is both a parser generator tool and a language agnostic parsing library. Its biggest selling points are that:

  • The library is general enough to work with any programming language.
  • Incomplete code is handled gracefully.
  • No dependencies, MIT-licensed, and written in C, means it can be embedded easily.
  • A large collection of grammars for various languages.

Tree-sitter consists of two components: the C library and a command-line tool. The library uses the parsers to support different languages. Parsers are generated from a grammar by the command line tool.

Grammars

A Tree-sitter grammar is a set of formal parsing rules (grammar.js) that defines the syntax of a language. The command line tool generates a parser (parser.c) for that language, which can be used by the library.

Many languages have some tokens whose structure is hard to describe with a regular expression. Grammars for those languages contain an external scanner (scanner.c) that contains custom code for recognizing certain tokens.

Syntax Highlighting

Tree-sitter uses queries to pattern-match on its syntax trees. Grammars provide a highlights query (highlights.scm). The highlights query uses captures to assign highlight names to different nodes in the tree. Each highlight name can then be mapped to a color.

Tree-sitter in LLDB

The goal was for Tree-sitter to extend, rather than replace, the existing compiler-based syntax highlighting in LLDB. Therefore, the existing highlighter was refactored to become its own plugin instead of being part of the language plugin.

Most of the Tree-sitter highlighter plugin is generic. A language agnostic base class is shared by all Tree-sitter highlighters.

Each highlighter contains a vendored copy of the grammar, the scanner if necessary, and the highlights query. At build time, the parser is generated using the command-line tool and a header is generated that contains the query so it doesn’t need to be loaded at runtime. Finally, the parser is statically linked into the plugin.

Supported Languages

  • Swift (under review)
  • Rust (under review)
  • Zig (coming soon)
  • Python (coming soon)

Enabling Tree-sitter

Tree-sitter is an optional dependency in LLDB. By default, the feature gets automatically enabled if the library and command line tool are available. It can be forced on with -DLLDB_ENABLE_TREESITTER=ON in which case CMake will fail to configure if the dependencies cannot be found.