graph TB
L0["Level 0: GAT Engine<br/>(hardcoded Rust, ~2000 lines)"]
L1["Level 1: Theory Specifications<br/>(GAT data: ThGraph, ThConstraint, …)"]
L2["Level 2: Concrete Schemas<br/>(models of Level 1 theories)"]
L3["Level 3: Concrete Instances<br/>(models of Level 2 schemas)"]
L0 -->|interprets| L1
L1 -->|parameterizes| L2
L2 -->|governs| L3
14 Self-Description and Building Blocks
Throughout this tutorial, theories, schemas, instances, and migrations have been presented as distinct concepts. This chapter reveals that they are all instances of the same framework, organized into four levels. Each level is defined as a model of the level below it. A concrete schema is a model of a theory specification, which is itself data interpreted by the GAT engine. This self-similarity has a practical payoff: the mechanism for extending the system is the same mechanism the system uses on itself.
Following the GATlab design (Lynch et al. 2024):
14.1 Four levels
14.1.1 Level 0: The GAT engine
Approximately 2,000 lines of Rust implement the core GAT machinery: sorts, operations, equations, theory mappings, composition, and model checking. This is the only part of panproto that isn’t data-driven. It’s the interpreter for everything above.
14.1.2 Level 1: Theory specifications
The theories used throughout this tutorial (ThGraph, ThConstraint, ThConstrainedMultiGraph, and the protocol-specific theories for ATProto, SQL, and so on) are all data at this level. They’re values of type Theory, constructed by the Level 0 engine. Adding a new schema language means adding new Level 1 data; no Rust code changes.
14.1.3 Level 2: Concrete schemas
A schema is a model of a Level 1 theory (Chapter 3 for the definition). Writing an ATProto lexicon means constructing a model of ThConstrainedMultiGraph. Writing SQL DDL means constructing a model of the SQL protocol theory. The schema is governed by the theory it instantiates, just as a value is governed by its type.
14.1.4 Level 3: Concrete instances
The actual data (a Bluesky post, a database row, a Protobuf message) lives at Level 3. An instance is a model of a Level 2 schema. It’s the concrete content that conforms to a specific schema within a specific protocol.
Migrations happen at every level, using the same engine:
- Level 1 to Level 1: A theory mapping transforms one schema language into another
- Level 2 to Level 2: A schema mapping transforms one schema into another within the same protocol
- Level 3 to Level 3: Data migration via
restricttransforms instances to conform to a new schema
The machinery is uniform. The same lift, get, put, and checkExistence operations work at each level.
14.2 ThGAT: panproto can describe its own type system
The theory of GATs is itself a GAT. panproto defines ThGAT: a theory whose models are themselves theories.
ThGAT has:
- Sorts:
Sort,Op,Eq,Theory,Param(s: Sort),Name - Operations:
sort_name,op_name,op_output,op_param,eq_lhs,eq_rhs, and others that specify how sorts, operations, and equations relate within a theory
A model of ThGAT is a concrete theory (ThGraph, ThConstraint, or any protocol theory). This isn’t a metaphor—it’s a literal statement about the GAT formalism. The same Theory type that represents ThGraph also represents ThGAT. The same model-checking machinery that validates a schema against ThGraph validates a theory against ThGAT.
14.2.1 Why self-description matters
Self-description has three practical consequences.
Versioning the theory language. If you need to extend what a “sort” or “operation” can express (say, adding support for dependent sorts with multiple parameters), you can define a new version of ThGAT and construct a theory mapping from the old version to the new one. The migration engine then handles the upgrade of all Level 1 data automatically. Without self-description, extending the theory language would require rewriting the Level 0 engine and manually migrating all existing theory definitions. With it, the extension is a data migration like any other.
Uniform tooling. Because theories are models of ThGAT, all the tooling that operates on models (diff, existence checking, lifting, lenses) can be applied to theories themselves. You can diff two versions of a protocol’s theory, check whether a migration between them exists, and lift schemas through the change. The tooling isn’t specialized per level; it’s generic and uniform.
Formal guarantees at every level. The correctness properties that panproto provides (round-tripping for lenses, completeness of existence checking, correct migration behavior) hold at every level. A theory mapping between Level 1 theories satisfies the same laws as a schema mapping between Level 2 schemas. This uniformity follows directly from the self-description architecture: every level is an instance of the same framework.
ThGAT is a model of itself, which sounds circular. How is the circularity resolved?
The circularity is broken by Level 0: the GAT engine is hardcoded Rust, not data. ThGAT describes what a theory is, but the engine that interprets ThGAT is outside the self-description loop.
14.2.2 The system is closed
There’s an appealing sense in which panproto is closed. The only code that can’t be replaced by data is Level 0 (the GAT engine, which interprets everything above it). Everything at Level 1 and above is defined by the level below and can be extended, versioned, and migrated using the same engine.
This is the sense in which panproto is a general-purpose schematic version control system. It’s not general because it supports every protocol (it doesn’t, yet). It’s general because the mechanism for adding a new protocol is the same mechanism that the existing protocols use, and that mechanism is itself described within the system.
14.3 The building-block catalog
panproto’s 76 protocol definitions are assembled from 27 building-block theories by merging them together (the composition operation from Chapter 9). Each building block captures a structural distinction that affects migration correctness: whether collections are ordered, whether a schema supports sum types (like Protobuf oneof or GraphQL unions), whether references can be recursive, or whether a single vertex can have multiple outgoing edges of the same kind.
The six key building blocks:
| Theory | What it captures |
|---|---|
| ThGraph | Directed binary relationships. The foundation: vertices and edges with src/tgt operations. Almost every protocol starts here. |
| ThMultigraph | Parallel edges between the same vertices. Protocols that allow multiple fields of the same type (JSON Schema, ATProto) need this. |
| ThHypergraph | N-ary relationships. SQL foreign keys spanning multiple columns, or GraphQL arguments, require edges that connect more than two vertices. |
| ThReflexiveGraph | Identity/default edges with equations src(id(v))=v and tgt(id(v))=v. Protocols with default values or self-referential types (recursive JSON, XML) use this. |
| ThCategory | Composable edges with an associativity equation. Protocols with transitive relationships (RDF inference, permission hierarchies) use this. |
| ThConstrainedGraph | Graph + validation constraints. The colimit of ThGraph and ThConstraint. This is the base for most real-world protocols. |
The full catalog of all 27 theories is in Appendix C.
14.3.1 Structure vs. validation
Some schema properties can be expressed directly as sorts, operations, and equations in a theory. Others are inherently computational: you can’t capture “matches this regex” or “is at most 3000 characters” as an equation between algebraic operations.
The distinction matters. Structural properties propagate automatically through migrations and theory composition. Validation constraints must be checked separately.
Structural (captured by theory equations):
- Edge uniqueness: at most one edge between any two vertices (ThSimpleGraph)
- Optionality: a field that can be present or absent, with a round-trip guarantee (ThPartial)
- Recursive types: types that can contain themselves, like a tree node with child nodes (ThRecursion)
- Sum types: a value that is one of several alternatives, like Protobuf
oneof(ThCoproduct) - Reflexivity: every vertex has a self-loop (identity edge), useful for modeling defaults (ThReflexiveGraph)
- Symmetry: if there’s an edge from A to B, there’s also one from B to A (ThSymmetricGraph)
Validation (requires ThConstraint):
maxLength,minLength: arithmetic bounds on string/array sizepattern: regular expression matchingminimum,maximum: numeric range boundsformat(date-time, email, URI): domain-specific format checks
If a property can be expressed as an equation between sorts and operations, should it be structural? What determines when a property belongs in the validation layer instead?
If the property can be expressed as an equation between sorts and operations (e.g., “every vertex has a self-loop,” “foreign key columns belong to foreign key tables”), make it structural. Structural properties propagate through migrations automatically and are checked at compose time. If the property requires arithmetic, string matching, or domain-specific logic (e.g., “length ≤ 3000,” “matches this regex”), it belongs in the validation layer. The test is whether the property can be stated purely in terms of the theory’s algebraic vocabulary.
14.4 Schema-side theories
14.4.1 Graph shapes (6)
| Theory | Sorts | Equations | Captures |
|---|---|---|---|
| ThGraph | Vertex, Edge | Directed binary relationships | |
| ThSimpleGraph | Vertex, Edge(s,t) | Edge uniqueness via dependent sort | |
| ThHypergraph | Vertex, HyperEdge, Label | N-ary relationships | |
| ThReflexiveGraph | Vertex, Edge | src(id(v))=v, tgt(id(v))=v | Identity/default edges |
| ThSymmetricGraph | Vertex, Edge | src(inv(e))=tgt(e), inv(inv(e))=e | Undirected relationships |
| ThPetriNet | Place, Transition, Token | Concurrent structure |
14.4.2 Modifiers (8)
| Theory | Sorts | Equations | Captures |
|---|---|---|---|
| ThConstraint | Vertex, Constraint(v) | Validation constraints | |
| ThMulti | Vertex, Edge, EdgeLabel | Parallel edges | |
| ThInterface | Vertex, Interface | Subtyping/polymorphism | |
| ThOrder | Edge, Position | Ordered collections | |
| ThCoproduct | Vertex, Variant, Tag | variant_of(injection(v))=v | Sum types/unions |
| ThRecursion | Vertex, Mu | unfold(fold(v))=v | Recursive types |
| ThPartial | Vertex, Defined | defined(witness(d))=d | Optionality as structure |
| ThLinear | Edge, Usage | Use-counting (linear/affine) |
14.4.3 Structural (4)
| Theory | Sorts | Captures |
|---|---|---|
| ThNominal | Vertex, Name | Nominal identity |
| ThSpan | Vertex, Span | Diffs/correspondences |
| ThCospan | Vertex, Apex | Merge targets |
| ThOperad | Color, MOperation | Fan-in aggregation |
14.5 Instance-side theories (6)
| Theory | Sorts | Captures |
|---|---|---|
| ThWType | Node, Arc, Value | Tree-shaped data |
| ThFunctor | Table, Row, ForeignKey | Relational/tabular data |
| ThFlat | Node, Field, Value | Key-value data |
| ThGraphInstance | IVertex, IEdge, Value, Vertex, Edge | Graph-shaped data (most general) |
| ThAnnotation | Vertex, Annotation, AnnotationKey, Value | Out-of-band metadata |
| ThCausal | Event, Timestamp, Before(e1,e2) | Causal/temporal ordering |
14.6 How protocols compose building blocks
Protocols compose building blocks via colimit into theory groups:
| Group | Schema theory | Instance theory | Count | Examples |
|---|---|---|---|---|
| A | colimit(Graph, Constraint, Multi) | WType | 30 | ATProto, JSON Schema, OpenAPI |
| B | colimit(Hypergraph, Constraint) | Functor | 11 | SQL, Cassandra, Parquet |
| C | colimit(SimpleGraph, Constraint) | Flat | 9 | Protobuf, Avro, Thrift |
| D | colimit(Graph, Constraint, Multi, Interface) | WType | 16 | GraphQL, TypeScript, Rust |
| E | colimit(Graph, Constraint, Multi) | colimit(WType, Meta) | 9 | XML, HTML, CSS |
| F | colimit(Graph, Constraint, Multi) | GraphInstance | 1 | Neo4j |
Each protocol opts into additional theories (ThOrder, ThCoproduct, etc.) by composing them into its schema theory via colimit.
14.7 Equations as structural invariants
Only 6 of the 27 theories have equations. Each has been type-checked against the theory’s sort declarations. As of v0.6, this type-checking happens automatically at commit time. When you schema commit a theory that includes any of these building blocks, panproto verifies that every equation is well-sorted in the composed theory. If a colimit introduces a sort mismatch (by identifying two sorts that break an equation’s typing), the commit is rejected with a diagnostic pointing to the offending equation.
The equations and their sort-checking:
ThRecursion: unfold(fold(v)) = v. fold(v) has sort Mu; unfold(fold(v)) has sort Vertex. The right side v also has sort Vertex. Both sides match.
ThReflexiveGraph: src(id(v)) = v and tgt(id(v)) = v. id(v) has sort Edge; src(id(v)) has sort Vertex, matching v.
ThSymmetricGraph: src(inv(e)) = tgt(e), tgt(inv(e)) = src(e), inv(inv(e)) = e. All sides have matching sorts (Vertex, Vertex, Edge respectively).
ThCoproduct: variant_of(injection(v)) = v. Both sides have sort Variant.
ThPartial: defined(witness(d)) = d. Both sides have sort Defined.
What happens at commit time when two building blocks are composed via colimit and a sort identification makes an equation ill-typed?
panproto rejects the composition at commit time. The diagnostic names the equation, the sorts involved, and which sort identification caused the conflict. For example, if composing ThReflexiveGraph with a theory that identifies the Edge and Vertex sorts would make src(id(v)) = v ill-typed (because id(v) would need to be both an Edge and a Vertex), the error message pinpoints the identification and the broken equation.
The next chapter covers testing: confirming that compositions, migrations, and lenses behave correctly on concrete data.