26 Full-AST Parsing: Programs as Schemas
Every programming language is a theory in the GAT system. A program is a model (schema) of that theory. Earlier chapters covered type-level structures: structs, interfaces, enums. This chapter extends that to full programs—functions, statements, expressions, control flow, and module structure. The principle is the same: represent everything as vertices and edges.
26.1 The principle
A TypeScript file is not just type declarations. It contains functions with bodies, if/else branches, for loops, try/catch blocks, function calls, and expressions. All of these are structural elements you can represent as vertices and edges in a schema graph. The only difference from type-level schemas is the number of vertex kinds and edge rules.
26.2 Theory extraction from tree-sitter grammars
Tree-sitter provides battle-tested parsers for every major programming language. Each grammar ships with a node-types.json file that describes every possible AST node type and its fields. This file is structurally isomorphic to a GAT:
node-types.json |
panproto GAT |
|---|---|
Named node type (e.g. function_declaration) |
Sort (vertex kind) |
Field with required: true (e.g. body) |
Operation (mandatory edge kind) |
Field with required: false |
Partial operation (compose with ThPartial) |
Field with multiple: true |
Ordered operation (compose with ThOrder) |
Supertype (e.g. _expression) |
Abstract sort with subtype inclusions |
The extract_theory_from_node_types function reads this JSON and produces a complete Theory:
use panproto_parse::extract_theory_from_node_types;
let meta = extract_theory_from_node_types(
"ThTypeScriptFullAST",
tree_sitter_typescript::TYPESCRIPT_NODE_TYPES.as_bytes(),
)?;
// typescript has ~180 named node types and ~60 field names.
println!("{} sorts, {} operations",
meta.vertex_kinds.len(),
meta.edge_kinds.len());26.3 The generic walker
Because the theory is auto-derived from the grammar, the AST walker requires no manual mapping table. One AstWalker implementation handles all 10 supported languages:
- The tree-sitter node’s
kind()is the panproto vertex kind - The tree-sitter field name is the panproto edge kind
- Anonymous tokens (punctuation, keywords) are captured as interstitial text
use panproto_parse::ParserRegistry;
let registry = ParserRegistry::new();
let schema = registry.parse_file(
std::path::Path::new("src/main.ts"),
source_bytes,
)?;26.4 Interstitial text and round-trip emission
The walker captures the text between named children (keywords like function, punctuation like { and }, whitespace, and comments) as interstitial constraints with byte positions. The emitter collects all fragments (interstitials plus leaf literals), sorts by byte position, and concatenates:
let emitted = registry.emit_with_protocol("typescript", &schema)?;
assert_eq!(emitted, original_source); // exact round-trip26.5 Multi-file projects
A project with multiple files has a schema that is the coproduct of per-file schemas:
use panproto_project::ProjectBuilder;
let mut builder = ProjectBuilder::new();
builder.add_directory(std::path::Path::new("./src"))?;
let project = builder.build()?;
for (path, protocol) in &project.protocol_map {
println!("{}: {protocol}", path.display());
}The ProjectBuilder detects languages by file extension, falls back to raw_file for non-code files, and prefixes vertex IDs with file paths for uniqueness in the coproduct.
26.6 Git bridge
The git bridge imports entire repository histories into panproto-vcs:
schema git import /path/to/repo HEADEach git commit is parsed into a project schema and stored as a panproto-vcs commit, preserving authorship, timestamps, and the parent DAG structure.
Git operates on text. panproto operates on structure. A git merge is a heuristic three-way text merge that can produce syntactically invalid results. A panproto merge is a categorical pushout on schema graphs that is provably commutative and always produces a valid schema.
26.7 Exercises
Parse a TypeScript file and count the vertex kinds. How many distinct node types does the grammar produce?
Parse the same file with the type-level parser (
panproto_protocols::type_system::typescript::parse_ts_types) and compare the vertex counts. What structural information does the full-AST parser capture that the type-level parser misses?Parse a multi-file project and examine the coproduct schema. How are vertex IDs prefixed? What happens to cross-file import references?
Import a git repository and inspect the panproto-vcs commit log. Does the DAG structure match the git log?
Round-trip a source file:
emit(parse(source)). Is the output byte-identical to the input?