14  Self-Description and Building Blocks

Throughout this tutorial, theories, schemas, instances, and migrations have been presented as distinct concepts. This chapter reveals that they are all instances of the same framework, organized into four levels. Each level is defined as a model of the level below it. A concrete schema is a model of a theory specification, which is itself data interpreted by the GAT engine. This self-similarity has a practical payoff: the mechanism for extending the system is the same mechanism the system uses on itself.

Following the GATlab design (Lynch et al. 2024):

14.1 Four levels

graph TB
    L0["Level 0: GAT Engine<br/>(hardcoded Rust, ~2000 lines)"]
    L1["Level 1: Theory Specifications<br/>(GAT data: ThGraph, ThConstraint, …)"]
    L2["Level 2: Concrete Schemas<br/>(models of Level 1 theories)"]
    L3["Level 3: Concrete Instances<br/>(models of Level 2 schemas)"]
    L0 -->|interprets| L1
    L1 -->|parameterizes| L2
    L2 -->|governs| L3
Figure 14.1: The four levels of panproto’s self-description hierarchy.

14.1.1 Level 0: The GAT engine

Approximately 2,000 lines of Rust implement the core GAT machinery: sorts, operations, equations, theory mappings, composition, and model checking. This is the only part of panproto that isn’t data-driven. It’s the interpreter for everything above.

14.1.2 Level 1: Theory specifications

The theories used throughout this tutorial (ThGraph, ThConstraint, ThConstrainedMultiGraph, and the protocol-specific theories for ATProto, SQL, and so on) are all data at this level. They’re values of type Theory, constructed by the Level 0 engine. Adding a new schema language means adding new Level 1 data; no Rust code changes.

14.1.3 Level 2: Concrete schemas

A schema is a model of a Level 1 theory (Chapter 3 for the definition). Writing an ATProto lexicon means constructing a model of ThConstrainedMultiGraph. Writing SQL DDL means constructing a model of the SQL protocol theory. The schema is governed by the theory it instantiates, just as a value is governed by its type.

14.1.4 Level 3: Concrete instances

The actual data (a Bluesky post, a database row, a Protobuf message) lives at Level 3. An instance is a model of a Level 2 schema. It’s the concrete content that conforms to a specific schema within a specific protocol.

Note

Migrations happen at every level, using the same engine:

  • Level 1 to Level 1: A theory mapping transforms one schema language into another
  • Level 2 to Level 2: A schema mapping transforms one schema into another within the same protocol
  • Level 3 to Level 3: Data migration via restrict transforms instances to conform to a new schema

The machinery is uniform. The same lift, get, put, and checkExistence operations work at each level.

14.2 ThGAT: panproto can describe its own type system

The theory of GATs is itself a GAT. panproto defines ThGAT: a theory whose models are themselves theories.

ThGAT has:

  • Sorts: Sort, Op, Eq, Theory, Param(s: Sort), Name
  • Operations: sort_name, op_name, op_output, op_param, eq_lhs, eq_rhs, and others that specify how sorts, operations, and equations relate within a theory

A model of ThGAT is a concrete theory (ThGraph, ThConstraint, or any protocol theory). This isn’t a metaphor—it’s a literal statement about the GAT formalism. The same Theory type that represents ThGraph also represents ThGAT. The same model-checking machinery that validates a schema against ThGraph validates a theory against ThGAT.

14.2.1 Why self-description matters

Self-description has three practical consequences.

Versioning the theory language. If you need to extend what a “sort” or “operation” can express (say, adding support for dependent sorts with multiple parameters), you can define a new version of ThGAT and construct a theory mapping from the old version to the new one. The migration engine then handles the upgrade of all Level 1 data automatically. Without self-description, extending the theory language would require rewriting the Level 0 engine and manually migrating all existing theory definitions. With it, the extension is a data migration like any other.

Uniform tooling. Because theories are models of ThGAT, all the tooling that operates on models (diff, existence checking, lifting, lenses) can be applied to theories themselves. You can diff two versions of a protocol’s theory, check whether a migration between them exists, and lift schemas through the change. The tooling isn’t specialized per level; it’s generic and uniform.

Formal guarantees at every level. The correctness properties that panproto provides (round-tripping for lenses, completeness of existence checking, correct migration behavior) hold at every level. A theory mapping between Level 1 theories satisfies the same laws as a schema mapping between Level 2 schemas. This uniformity follows directly from the self-description architecture: every level is an instance of the same framework.

CautionExercise: The circularity problem

ThGAT is a model of itself, which sounds circular. How is the circularity resolved?

The circularity is broken by Level 0: the GAT engine is hardcoded Rust, not data. ThGAT describes what a theory is, but the engine that interprets ThGAT is outside the self-description loop.

14.2.2 The system is closed

There’s an appealing sense in which panproto is closed. The only code that can’t be replaced by data is Level 0 (the GAT engine, which interprets everything above it). Everything at Level 1 and above is defined by the level below and can be extended, versioned, and migrated using the same engine.

This is the sense in which panproto is a general-purpose schematic version control system. It’s not general because it supports every protocol (it doesn’t, yet). It’s general because the mechanism for adding a new protocol is the same mechanism that the existing protocols use, and that mechanism is itself described within the system.

14.3 The building-block catalog

panproto’s 76 protocol definitions are assembled from 27 building-block theories by merging them together (the composition operation from Chapter 9). Each building block captures a structural distinction that affects migration correctness: whether collections are ordered, whether a schema supports sum types (like Protobuf oneof or GraphQL unions), whether references can be recursive, or whether a single vertex can have multiple outgoing edges of the same kind.

The six key building blocks:

Theory What it captures
ThGraph Directed binary relationships. The foundation: vertices and edges with src/tgt operations. Almost every protocol starts here.
ThMultigraph Parallel edges between the same vertices. Protocols that allow multiple fields of the same type (JSON Schema, ATProto) need this.
ThHypergraph N-ary relationships. SQL foreign keys spanning multiple columns, or GraphQL arguments, require edges that connect more than two vertices.
ThReflexiveGraph Identity/default edges with equations src(id(v))=v and tgt(id(v))=v. Protocols with default values or self-referential types (recursive JSON, XML) use this.
ThCategory Composable edges with an associativity equation. Protocols with transitive relationships (RDF inference, permission hierarchies) use this.
ThConstrainedGraph Graph + validation constraints. The colimit of ThGraph and ThConstraint. This is the base for most real-world protocols.

The full catalog of all 27 theories is in Appendix C.

14.3.1 Structure vs. validation

Some schema properties can be expressed directly as sorts, operations, and equations in a theory. Others are inherently computational: you can’t capture “matches this regex” or “is at most 3000 characters” as an equation between algebraic operations.

The distinction matters. Structural properties propagate automatically through migrations and theory composition. Validation constraints must be checked separately.

Structural (captured by theory equations):

  • Edge uniqueness: at most one edge between any two vertices (ThSimpleGraph)
  • Optionality: a field that can be present or absent, with a round-trip guarantee (ThPartial)
  • Recursive types: types that can contain themselves, like a tree node with child nodes (ThRecursion)
  • Sum types: a value that is one of several alternatives, like Protobuf oneof (ThCoproduct)
  • Reflexivity: every vertex has a self-loop (identity edge), useful for modeling defaults (ThReflexiveGraph)
  • Symmetry: if there’s an edge from A to B, there’s also one from B to A (ThSymmetricGraph)

Validation (requires ThConstraint):

  • maxLength, minLength: arithmetic bounds on string/array size
  • pattern: regular expression matching
  • minimum, maximum: numeric range bounds
  • format (date-time, email, URI): domain-specific format checks
CautionExercise: Deciding what’s structural

If a property can be expressed as an equation between sorts and operations, should it be structural? What determines when a property belongs in the validation layer instead?

If the property can be expressed as an equation between sorts and operations (e.g., “every vertex has a self-loop,” “foreign key columns belong to foreign key tables”), make it structural. Structural properties propagate through migrations automatically and are checked at compose time. If the property requires arithmetic, string matching, or domain-specific logic (e.g., “length ≤ 3000,” “matches this regex”), it belongs in the validation layer. The test is whether the property can be stated purely in terms of the theory’s algebraic vocabulary.

14.4 Schema-side theories

14.4.1 Graph shapes (6)

Theory Sorts Equations Captures
ThGraph Vertex, Edge Directed binary relationships
ThSimpleGraph Vertex, Edge(s,t) Edge uniqueness via dependent sort
ThHypergraph Vertex, HyperEdge, Label N-ary relationships
ThReflexiveGraph Vertex, Edge src(id(v))=v, tgt(id(v))=v Identity/default edges
ThSymmetricGraph Vertex, Edge src(inv(e))=tgt(e), inv(inv(e))=e Undirected relationships
ThPetriNet Place, Transition, Token Concurrent structure

14.4.2 Modifiers (8)

Theory Sorts Equations Captures
ThConstraint Vertex, Constraint(v) Validation constraints
ThMulti Vertex, Edge, EdgeLabel Parallel edges
ThInterface Vertex, Interface Subtyping/polymorphism
ThOrder Edge, Position Ordered collections
ThCoproduct Vertex, Variant, Tag variant_of(injection(v))=v Sum types/unions
ThRecursion Vertex, Mu unfold(fold(v))=v Recursive types
ThPartial Vertex, Defined defined(witness(d))=d Optionality as structure
ThLinear Edge, Usage Use-counting (linear/affine)

14.4.3 Structural (4)

Theory Sorts Captures
ThNominal Vertex, Name Nominal identity
ThSpan Vertex, Span Diffs/correspondences
ThCospan Vertex, Apex Merge targets
ThOperad Color, MOperation Fan-in aggregation

14.5 Instance-side theories (6)

Theory Sorts Captures
ThWType Node, Arc, Value Tree-shaped data
ThFunctor Table, Row, ForeignKey Relational/tabular data
ThFlat Node, Field, Value Key-value data
ThGraphInstance IVertex, IEdge, Value, Vertex, Edge Graph-shaped data (most general)
ThAnnotation Vertex, Annotation, AnnotationKey, Value Out-of-band metadata
ThCausal Event, Timestamp, Before(e1,e2) Causal/temporal ordering

14.6 How protocols compose building blocks

Protocols compose building blocks via colimit into theory groups:

Group Schema theory Instance theory Count Examples
A colimit(Graph, Constraint, Multi) WType 30 ATProto, JSON Schema, OpenAPI
B colimit(Hypergraph, Constraint) Functor 11 SQL, Cassandra, Parquet
C colimit(SimpleGraph, Constraint) Flat 9 Protobuf, Avro, Thrift
D colimit(Graph, Constraint, Multi, Interface) WType 16 GraphQL, TypeScript, Rust
E colimit(Graph, Constraint, Multi) colimit(WType, Meta) 9 XML, HTML, CSS
F colimit(Graph, Constraint, Multi) GraphInstance 1 Neo4j

Each protocol opts into additional theories (ThOrder, ThCoproduct, etc.) by composing them into its schema theory via colimit.

14.7 Equations as structural invariants

Only 6 of the 27 theories have equations. Each has been type-checked against the theory’s sort declarations. As of v0.6, this type-checking happens automatically at commit time. When you schema commit a theory that includes any of these building blocks, panproto verifies that every equation is well-sorted in the composed theory. If a colimit introduces a sort mismatch (by identifying two sorts that break an equation’s typing), the commit is rejected with a diagnostic pointing to the offending equation.

The equations and their sort-checking:

ThRecursion: unfold(fold(v)) = v. fold(v) has sort Mu; unfold(fold(v)) has sort Vertex. The right side v also has sort Vertex. Both sides match.

ThReflexiveGraph: src(id(v)) = v and tgt(id(v)) = v. id(v) has sort Edge; src(id(v)) has sort Vertex, matching v.

ThSymmetricGraph: src(inv(e)) = tgt(e), tgt(inv(e)) = src(e), inv(inv(e)) = e. All sides have matching sorts (Vertex, Vertex, Edge respectively).

ThCoproduct: variant_of(injection(v)) = v. Both sides have sort Variant.

ThPartial: defined(witness(d)) = d. Both sides have sort Defined.

CautionExercise: Composition and type safety

What happens at commit time when two building blocks are composed via colimit and a sort identification makes an equation ill-typed?

panproto rejects the composition at commit time. The diagnostic names the equation, the sorts involved, and which sort identification caused the conflict. For example, if composing ThReflexiveGraph with a theory that identifies the Edge and Vertex sorts would make src(id(v)) = v ill-typed (because id(v) would need to be both an Edge and a Vertex), the error message pinpoints the identification and the broken equation.

The next chapter covers testing: confirming that compositions, migrations, and lenses behave correctly on concrete data.

Lynch, Owen, Kris Brown, James Fairbanks, and Evan Patterson. 2024. GATlab: Modeling and Programming with Generalized Algebraic Theories.” arXiv Preprint arXiv:2404.04837, ahead of print. https://doi.org/10.48550/arXiv.2404.04837.