26  Full-AST Parsing: Programs as Schemas

Every programming language is a theory in the GAT system. A program is a model (schema) of that theory. Earlier chapters covered type-level structures: structs, interfaces, enums. This chapter extends that to full programs—functions, statements, expressions, control flow, and module structure. The principle is the same: represent everything as vertices and edges.

26.1 The principle

A TypeScript file is not just type declarations. It contains functions with bodies, if/else branches, for loops, try/catch blocks, function calls, and expressions. All of these are structural elements you can represent as vertices and edges in a schema graph. The only difference from type-level schemas is the number of vertex kinds and edge rules.

26.2 Theory extraction from tree-sitter grammars

Tree-sitter provides battle-tested parsers for every major programming language. Each grammar ships with a node-types.json file that describes every possible AST node type and its fields. This file is structurally isomorphic to a GAT:

node-types.json panproto GAT
Named node type (e.g. function_declaration) Sort (vertex kind)
Field with required: true (e.g. body) Operation (mandatory edge kind)
Field with required: false Partial operation (compose with ThPartial)
Field with multiple: true Ordered operation (compose with ThOrder)
Supertype (e.g. _expression) Abstract sort with subtype inclusions

The extract_theory_from_node_types function reads this JSON and produces a complete Theory:

use panproto_parse::extract_theory_from_node_types;

let meta = extract_theory_from_node_types(
    "ThTypeScriptFullAST",
    tree_sitter_typescript::TYPESCRIPT_NODE_TYPES.as_bytes(),
)?;

// typescript has ~180 named node types and ~60 field names.
println!("{} sorts, {} operations",
    meta.vertex_kinds.len(),
    meta.edge_kinds.len());

26.3 The generic walker

Because the theory is auto-derived from the grammar, the AST walker requires no manual mapping table. One AstWalker implementation handles all 10 supported languages:

  • The tree-sitter node’s kind() is the panproto vertex kind
  • The tree-sitter field name is the panproto edge kind
  • Anonymous tokens (punctuation, keywords) are captured as interstitial text
use panproto_parse::ParserRegistry;

let registry = ParserRegistry::new();
let schema = registry.parse_file(
    std::path::Path::new("src/main.ts"),
    source_bytes,
)?;

26.4 Interstitial text and round-trip emission

The walker captures the text between named children (keywords like function, punctuation like { and }, whitespace, and comments) as interstitial constraints with byte positions. The emitter collects all fragments (interstitials plus leaf literals), sorts by byte position, and concatenates:

let emitted = registry.emit_with_protocol("typescript", &schema)?;
assert_eq!(emitted, original_source); // exact round-trip

26.5 Multi-file projects

A project with multiple files has a schema that is the coproduct of per-file schemas:

use panproto_project::ProjectBuilder;

let mut builder = ProjectBuilder::new();
builder.add_directory(std::path::Path::new("./src"))?;
let project = builder.build()?;

for (path, protocol) in &project.protocol_map {
    println!("{}: {protocol}", path.display());
}

The ProjectBuilder detects languages by file extension, falls back to raw_file for non-code files, and prefixes vertex IDs with file paths for uniqueness in the coproduct.

26.6 Git bridge

The git bridge imports entire repository histories into panproto-vcs:

schema git import /path/to/repo HEAD

Each git commit is parsed into a project schema and stored as a panproto-vcs commit, preserving authorship, timestamps, and the parent DAG structure.

NoteWhy not just use git?

Git operates on text. panproto operates on structure. A git merge is a heuristic three-way text merge that can produce syntactically invalid results. A panproto merge is a categorical pushout on schema graphs that is provably commutative and always produces a valid schema.

26.7 Exercises

  1. Parse a TypeScript file and count the vertex kinds. How many distinct node types does the grammar produce?

  2. Parse the same file with the type-level parser (panproto_protocols::type_system::typescript::parse_ts_types) and compare the vertex counts. What structural information does the full-AST parser capture that the type-level parser misses?

  3. Parse a multi-file project and examine the coproduct schema. How are vertex IDs prefixed? What happens to cross-file import references?

  4. Import a git repository and inspect the panproto-vcs commit log. Does the DAG structure match the git log?

  5. Round-trip a source file: emit(parse(source)). Is the output byte-identical to the input?