Source-code emission
In plain terms
panproto reads source code in 261 languages by sitting on top of tree-sitter, which has a community-maintained grammar for each one. The reverse direction, writing source code back out from a schema, is harder. tree-sitter grammars are written for parsing; they record the production rules the parser uses but not the spacing, indentation, or alternative-disambiguation choices an emitter needs. panproto’s source-code emitter, emit_pretty, closes that gap by deriving emission behaviour structurally from the same grammar.json the parser already ships.
The output of every emit is a function of three things: the schema, the production grammar, and a small per-language cassette that supplies defaults for the cases grammar.json cannot describe. Everything else (token roles, bracket pairs, spacing rules, indent triggers, CHOICE dispatch) is derived at construction time from the grammar’s own structure. There is no per-language emitter code.
A separate format-preserving codec covers structured-data formats (JSON, YAML, TOML, XML, CSV) and guarantees byte-for-byte round-trip. The two systems are independent; this page is about the tree-sitter source-code path.
The two halves of the gap
grammar.json lists every rule (SEQ, CHOICE, REPEAT, OPTIONAL, FIELD, ALIAS, SYMBOL, STRING, PATTERN, IMMEDIATE_TOKEN, PREC, TOKEN, RESERVED) the parser walks. It does not record:
- Whitespace and layout. Whether
a + bora+b; whetherif (x) { ... }indents its body; whetherdef f(): x = 1is one line or two. Tree-sitter strips whitespace during parse; nothing ingrammar.jsonsays how to put it back. - CHOICE disambiguation for synthesised schemas. When two CHOICE alternatives both admit the same children (
binary_expression CHOICE[+, -, *, /, ...]over the same operand kinds), the parser picks via its parse-table state. A schema built from scratch (no parse history) has no record of which alt was taken.
emit_pretty handles the first half structurally and the second half via a layered fallback hierarchy.
Grammar-derived token roles
Every STRING literal in a rule body is classified by its structural role at Grammar::from_bytes time. There are eight roles, classified per (rule, position) pair, not per token text:
| Role | Identification rule |
|---|---|
BracketOpen | First STRING in a SEQ whose first / last STRINGs satisfy the matched-pair predicate |
BracketClose | Last STRING of the same SEQ |
Separator | First STRING in a REPEAT body’s inner SEQ |
Keyword | STRING matching [a-zA-Z_][a-zA-Z0-9_]* |
Operator | Non-alphanumeric STRING between content members inside a CHOICE alternative |
Connector | Non-alphanumeric STRING between content members in a standalone SEQ (a structural connector such as . or ::, not an algebraic operator); emits no surrounding space |
Terminal | Text from a leaf vertex’s literal-value constraint |
Immediate | A token the grammar wraps in IMMEDIATE_TOKEN, glued to its neighbour with no intervening whitespace |
The matched-pair predicate identifies (), [], {}, <>, begin/end, do/done, |...|, <<>>, ${}, ⟨⟩, and every other bracket-like construct from grammar structure alone. There is no fixed character set: a per-SEQ check that the first and last STRINGs are different (or same-text with at least one wrapped in IMMEDIATE_TOKEN, e.g. regex /.../), with non-STRING content between them, identifies the pair.
The role-pair lookup table (needs_space_by_role) takes two adjacent token roles and returns whether to emit a space. No token text is inspected by the layout pass; spacing decisions are entirely role-driven. The table is small (one fixed function) and the same for every grammar.
Indent triggers
A bracket-open token triggers indentation when the content between open and close contains REPEAT or REPEAT1 (recursively through CHOICE / SEQ wrappers). This is the structural witness of “block-level content the source author would have indented”. Function bodies, statement lists, struct fields, namespace contents all match; inline constructs like type parameter lists, function arguments, and string interpolation do not, even though they may use the same bracket pair characters in other rules.
Word-like bracket pairs (function/end, if/end, module/end, etc., the matched-pair shape with alphanumeric delimiters) always trigger indentation: they only appear in block constructs across the 261 vendored grammars.
CHOICE dispatch
At emit time, pick_choice_with_cursor selects which CHOICE alternative to walk for a given vertex. The pipeline composes several structural filters; each rejects alternatives that cannot be the right pick given the cursor’s unconsumed edges and the schema’s constraint witnesses.
The categorical core is the acceptance predicate, accepts_first_edge(prod, edge_field, target_kind), an inductive function over the production tree:
| Production | Acceptance rule |
|---|---|
STRING / PATTERN / BLANK | reject (consumes no edges) |
SYMBOL X (concrete) | edge_field == "child_of" AND target_kind ⊑ X |
SYMBOL X (hidden / supertype) | accepts(X.rule, edge) |
ALIAS{c, named:true, value:V} | edge_field == "child_of" AND target_kind == V |
FIELD{name, content} | edge_field == name AND content.yield admits target_kind |
SEQ[m1, m2, ...] | accepts(m1, edge) OR (m1 ε-able AND accepts(SEQ[m2..], edge)) |
CHOICE[a1, a2, ...] | any of accepts(ai, edge) |
OPTIONAL / REPEAT / REPEAT1 / wrappers | accepts(inner, edge) |
accepts_first_edge fuses four previously-separate ad-hoc checks (FIELD-name matching, SYMBOL subtype dispatch, ALIAS rewrite, yield-set admission) into one rule.
Three discriminators layer on top when more than one alternative passes accepts_first_edge:
- Token-set restriction. An alt’s FIELD body of shape
ALIAS{CHOICE[STRING_1, STRING_2, ...], value: V}constrains the field child’s literal to that set. If the cursor’s field-named edge target carries a literal outside the set, the alt is rejected. This is what disambiguates Gocall_expression(alt 0 hasfunction: ALIAS{CHOICE["new","make"], value:"identifier"}, valid only when the function name is literallynewormake). - Alias-source filter. An alt’s FIELD body of shape
ALIAS{SYMBOL X, named: true, value: _}requires the cursor’s field-named edge target to carry apre-alias-symbolconstraint equal toX. The walker recordspre-alias-symbolfromtree_sitter::Node::grammar_name()whenever it differs fromkind(); this is the only ALIAS-disambiguation signal tree-sitter 0.25 / 0.26 actually exposes. - Positional interstitial scoring. When the above filters leave more than one candidate, alts are scored against the slice of recorded interstitials from the current cursor position forward. The cursor’s consumed count gives the slice offset, so a trailing CHOICE-with-BLANK at position N sees only the interstitial at gap N, not the comma separators from earlier REPEAT iterations.
On yield-set ties after every filter, tree-sitter precedence (PREC annotations on alternatives) breaks the tie. This honours the grammar author’s explicit disambiguation rule.
The subtype closure
accepts_first_edge consults a precomputed subtype closure K ⊑ Y (“a vertex of kind K can appear where the grammar says SYMBOL Y”). The closure is computed once at Grammar::from_bytes time:
- Walk every hidden rule (
_-prefixed) and supertype rule’s body to identify which concrete kinds satisfy each dispatch point. - Walk every named ALIAS (
ALIAS{c, named:true, value:V}) and record that the valueVis satisfied by every kind reachable fromc. - Close the relation under composition: a Tarjan SCC over the dispatchable-only subgraph (hidden / supertype names form the nodes; satisfying edges are the transitions), producing an exact
O(V + E)transitive closure without any iteration cap. The closure is keyed by concrete kind name, soaccepts_first_edge’s lookup isO(1).
Cassettes
The output of every external scanner token must come from somewhere. tree-sitter’s external scanners are C code that produces tokens whose text varies at runtime; grammar.json declares the token name but not the text. Four resolution tiers cascade:
- Anonymous ALIAS.
ALIAS{SYMBOL ext, named:false, value: V}declares that the external token emits exactlyV. Read directly from grammar structure. - CHOICE equivalence.
CHOICE[SYMBOL ext, STRING s]declares that the external token is equivalent to the literals. Read directly from grammar structure. - CstComplement. For tokens whose text was captured at parse time, the schema’s
literal-valueconstraint on the vertex carries the actual text. Used for heredoc bodies, raw-string content, regex patterns, escape sequences. - Cassette default. When none of the above applies, a per-grammar
GrammarCassettesupplies a default.
The cassette layer itself composes two parts:
- Universal layer (
common_external_default). A closed table of name-pattern → default-text mappings derived from a structural audit of every vendored grammar. Recognises layout markers (_concat,_no_space,_brace_start), immediate-position markers (_immediate_*), error sentinels (_error_*,error_sentinel), automatic-semicolon family (_automatic_semicolon,_optional_semi), generic string delimiters (string_start,string_end), heredoc / raw-string content (returns""; actual text comes fromliteral-value), and the descendant-operator family from CSS-like grammars. - Per-grammar override (
GrammarCassette::external_token_default). A small per-grammar implementation that overrides specific tokens. The lookup composes per-grammar first, universal fallback second, viaresolve_external_token.
The per-grammar overrides exist only where the language needs an override on top of the universal layer. The default empty cassette (used for ~230 of the 261 grammars) delegates entirely to the universal layer.
IMMEDIATE_TOKEN as a layout marker
tree-sitter’s IMMEDIATE_TOKEN is the grammar’s explicit signal that the wrapped token must have no preceding whitespace. emit_pretty lifts this to a single NoSpace marker emitted at the unique structural site where IMMEDIATE_TOKEN is declared:
- When the production walker enters
Production::ImmediateToken, emitNoSpace. - When
emit_vertexenters a vertex whose rule body isIMMEDIATE_TOKEN(...)at the head, emitNoSpacebefore any content the leaf shortcut or rule-body walk produces.
The layout pass reads the marker; downstream code does not need to inspect production shapes to recover the property. This is what makes regex literals like /abc/g round-trip tight on both delimiters: regex_pattern’s rule body starts with IMMEDIATE_TOKEN, and the trailing / is wrapped in IMMEDIATE_TOKEN in the regex SEQ.
The verification tier API
emit_pretty is structurally sound for any grammar with a vendored grammar.json, but structural soundness is not the same as an output that round-trips. The fixed-point law emit_pretty(parse(emit_pretty(s))) == emit_pretty(s) is the actual correctness witness, and it has to be exercised per grammar. ParserRegistry::emit_verification_status reports which tier a protocol falls into:
| Status | Meaning |
|---|---|
Verified | The protocol round-trips under the strict oracle described below; downstream tooling may trust its emit |
Generic | The protocol is registered and the generic dispatch path applies, but no test asserts emit correctness; output is structurally derived and likely correct, yet unverified |
Unsupported | The protocol is not registered, or its grammar lacks the vendored grammar.json that emit_pretty requires |
Downstream tooling (notably quivers and other transpile pipelines) calls this API upfront and refuses emit on any backend returning Generic or Unsupported. There are no runtime warnings; the silence-versus-noise tradeoff falls on the caller.
A protocol earns Verified on one of two bases. The strong basis is corpus verification: every entry in the grammar author’s own test/corpus/ round-trips under the full oracle, which the strict emit_corpus_audit test checks over the whole corpus rather than one hand-written sample. The oracle is three conjuncts on each corpus entry s. Writing e1 for emit_pretty(parse(s)) and e2 for emit_pretty(parse(e1)), it requires e1 == e2 (emit reaches a fixed point), kind_multiset(parse(s)) == kind_multiset(parse(e1)) (no vertex kind is gained or lost), and the edge multisets equal likewise (no edge is gained or lost). Notice what the oracle does not require: e1 == s. A verified protocol may legitimately reformat, as json re-indents arrays and go applies gofmt spacing; what it may not do is alter the abstract syntax tree or fail to reach a fixed point. The second basis is backend verification: a quivers transpile backend (python, stan, bugs, jags, julia, scheme, javascript) covered by dedicated emit regression tests over the construct surface quivers actually emits, with full corpus pass tracked as follow-on work.
The current Verified set covers 255 protocols, kept in sorted order in VERIFIED_EMIT_PROTOCOLS in crates/panproto-parse/src/registry.rs. Adding a protocol requires the corpus audit to pass for it (or a backend regression test to cover it), then appending the name to that array. An earlier expansion to 149 protocols on single hand-written samples was reverted after the corpus audit showed most failed their own grammar’s test corpus: one sample is not a sufficient witness.
Limitations
Three classes of input fall outside what the structural pipeline can recover:
- By-construction CHOICEs with no constraint signal. When a schema is built from scratch (no parse history) and a CHOICE is over multiple alternatives that all admit the same children (
CHOICE[STRING "model", STRING "data"]for a BUGS block keyword), nothing in the abstract content picks. The universal cassette + per-grammar cassette provide deterministic defaults, but the choice is opinionated and may not match the source author’s intent. Decorate an abstract schema is the canonical workflow. - Heredoc / raw-string synthesis. The universal cassette returns
""for heredoc / raw-string content because the actual text is parse-time-dependent. Synthesised vertices with such kinds but no capturedliteral-valuewill emit empty. Quivers and similar synthesis pipelines should populateliteral-valueexplicitly. - Operator precedence in synthesised expressions. No precedence-driven parenthesisation pass exists. Parsed schemas preserve original parens via interstitial constraints, but synthesised
binary_expressionschemas may emit ambiguously and re-parse to a different tree. A precedence pass on the schema before emit would close this gap; deferred.
The deeper limitation, structurally: tree-sitter’s parse-table state (the actual record of which CHOICE alternative the parser took at each step) is not exposed through the C API. The only ALIAS-disambiguation signal tree-sitter 0.25 / 0.26 surfaces is grammar_name() (the pre-alias SYMBOL name), which is consumed via pre-alias-symbol. A future upstream tree-sitter change exposing production / reduce IDs would let the emitter trace the parse exactly, eliminating the heuristic tiers entirely.
See also
- Decorate an abstract schema for the synthesise-then-render workflow.
- Layout enrichment for the categorical framing of the parse / emit relationship.
- Round-trip with format preservation for the structured-data codec (JSON / YAML / TOML / XML / CSV), which is byte-exact and orthogonal to this system.
- Parse full ASTs for the parser side.
- Reference: protocol catalogue for the list of supported languages.