Algebraic and generalised algebraic theories

Disclaimer. The content of this page is largely LM-generated. It was written as a stopgap to make the panproto system legible while we work through the book verifying and editing the content by hand. When a chapter has been verified or edited by a human, the parts that were verified or edited will be noted at the head of the chapter.

Three identifications carry the technical content of panproto, and the present chapter is where each of them is made precise. A protocol is a generalised algebraic theory. A schema is a model of one. A migration is a morphism of models. The chapter has the job of making those three sentences land as mathematics rather than slogans, so that every later chapter can invoke them without further justification.

The concepts we are going to use are older than panproto by decades, and most of them are older than most working programmers. Algebraic theories in the modern sense are due to Lawvere (1963), whose 1963 thesis reformulated the venerable notion of an algebraic structure — group, ring, vector space — as a small category with finite products, thereby making “model of a theory” mean “product-preserving functor into the category of sets”. Generalised algebraic theories, which extend Lawvere’s framework to accommodate dependent sorts, are due to Cartmell (1978), whose thesis gave them their first formal treatment; the more widely cited journal version is Cartmell (1986). What this book adds is the engineering claim that the schema languages working programmers use every day — JSON Schema, Avro, Protobuf, ATProto lexicons, SQL DDL, tree-sitter grammars — all fit in this framework, and that once they do, the migration and lens machinery of Part II applies to them uniformly. The present chapter states the mathematical side of that claim; Parts II and IV vindicate the engineering side case by case.

We assume the reader has worked through Categories, Functors and natural transformations, Universal properties, and Colimits and pushouts. The present chapter leans on all four.

Algebraic theories

An algebraic theory specifies a kind of mathematical structure by listing its sorts, its operations, and the equations those operations satisfy. Before anyone called it categorical, the idea was standard in universal algebra: a group, say, was a set equipped with a binary operation, an inverse operation, and a distinguished identity element, satisfying the usual axioms; a ring was a set with two binary operations and a few more axioms; a module over a ring was two sets and three operations across them. The common shape — sorts, operations, equations — was the content that universal algebra abstracted and that Lawvere reformulated in categorical terms.

The data of an algebraic theory $T$ are, concretely, a finite list of sorts $s_{1}, \dots, s_{n}$ , a collection of operation symbols each with a declared arity $(s_{i_{1}}, \dots, s_{i_{k}}) \to s_{j}$ , and a collection of equations between terms built from the operations and free variables. The theory of a monoid is small enough to write in one sentence: one sort $M$ , two operations (a constant $e : () \to M$ for the unit and a binary $m : (M, M) \to M$ for the multiplication), and three equations for associativity and two-sided unitality. A monoid in the ordinary sense is any interpretation: a set standing for $M$ , an element for $e$ , a binary function for $m$ , such that the three equations hold in the interpretation. Groups, rings, lattices, and vector spaces are given in the same way, with more sorts or more operations.

The categorical presentation

Lawvere’s insight was that the theory of a monoid — or of any algebraic structure — can be packaged as a small category rather than as a list of symbols. The associated category $Th (T)$ has the finite products of sorts as its objects and equivalence classes of terms (modulo the theory’s equations) as its morphisms. Composition is substitution. The category has finite products by construction, and the operations of $T$ sit inside it as specific morphisms between products of sorts.

The payoff of this repackaging is that a model of $T$ in a category $C$ with finite products is now simply a product-preserving functor $M : Th (T) \to C$ . “Product-preserving” means the functor takes products to products ( $M (A \times B) ≅ M (A) \times M (B)$ ) and the terminal object to the terminal object ( $M (1) ≅ 1$ ). When $C$ is $Set$ , a model is what every undergraduate textbook calls “a monoid” or “a group”: an assignment of a set to each sort and a function to each operation that respects the theory’s equations. The equations are respected automatically, because two terms that are equal under the theory have become the same morphism of $Th (T)$ , and every functor sends equal morphisms to equal morphisms.

The category of monoids, with monoid homomorphisms as morphisms, is the category of product-preserving functors $[Th (Mon), Set]$ ; the category of groups is similarly $[Th (Grp), Set]$ ; and the pattern extends. Every algebraic structure the reader has met in an undergraduate algebra class has a presentation of this form, and the categorical tools from Part I — functors between theories, natural transformations between models, colimits of models — apply uniformly across the family.

For a reader to whom the move from “a list of symbols” to “a small category with finite products” still feels abstract: the motivation is that the categorical presentation packages all the information the theory contains (sorts, operations, equations) into a single object of a kind we already understand. The categorical tools of Part I operate on categories, and a theory reformulated as a category is now something those tools can operate on. The payoff in panproto’s setting is that schema languages with very different surface syntax — JSON Schema, Avro, ATProto lexicons — become specific small categories $Th (T)$ , and migrations between schemas under a protocol become natural transformations in the corresponding functor category. The uniformity is what the engine exploits.

Generalised algebraic theories

Lawvere’s framework is sufficient when every sort is global, meaning that the sort makes sense before any element of it has been chosen. The sort of monoid elements is global in this sense: there is a sort $M$ , a fixed set, and its elements are not indexed by anything else.

Many structures arising in programming, type theory, and databases are not of this form. A sort of “vectors of length $n$ ” depends on a previously chosen natural number $n$ ; one cannot speak of a vector without first committing to its length. A sort of “migrations from schema $S_{1}$ to schema $S_{2}$ ” depends on two previously chosen schemas. Sorts of this kind are called dependent sorts, and they do not fit into Lawvere’s original framework because the framework has no way to say that a sort is indexed by elements of another sort.

A generalised algebraic theory, or GAT, is Cartmell’s extension of Lawvere’s framework to dependent sorts. The data are the same three kinds of thing as before — sort symbols, operation symbols, equations — but each is now declared in a context of free variables whose values may appear in what is being declared. A sort $Vec (n)$ depending on a natural number $n$ is declared in the context $n : N$ . A sort $Hom (A, B)$ depending on two objects of a category is declared in the context $A, B : Ob$ . Operations and equations are contextualised the same way, with their own free variables whose types may themselves be dependent.

The canonical example of a GAT is the theory of a small category. It has two sorts, $Ob$ (global) and $Hom (A, B)$ (dependent on two objects, in the context $A, B : Ob$ ); it has two operations, composition and identity, each contextualised over triples or singletons of objects; and it has the associativity and identity equations we met in Categories, each stated in the appropriate context. A model of this GAT is, unsurprisingly, a small category. The fact that the theory of a category is naturally a GAT rather than a Lawvere theory is a sign that dependent sorts are not an edge case but a routine feature of the theories one actually encounters.

Contextual categories

Cartmell (1986) gives GATs their categorical face. A GAT corresponds to a contextual category, a small category equipped with the additional structure needed to interpret dependently-sorted syntax internally: a distinguished class of context objects, a notion of type in a context, and an extension operation that lets a context be enlarged by a type. The definition is a careful piece of machinery, and we will not repeat it here; Hofmann (1997) is the place to go for the full account.

The move from syntactic GAT to categorical contextual category plays the same role for dependent theories that the Lawvere-theory-as-small-category move plays for algebraic theories: it makes models into functors and migrations into natural transformations, and it brings all the tools of Part I to bear. A model of a GAT $T$ in a contextual category $D$ is a structure-preserving functor $Th (T) \to D$ ; when $D$ is $Set$ with its canonical contextual structure, a model is a family of sets (one for each context-indexed sort, indexed by the interpretations of the context’s variables) together with the functions the operations require.

GATs and type theory

The broader programme GATs belong to is the categorical semantics of type theory, a line of research with its own substantial literature. The simply typed lambda calculus of Church (1940), the polymorphic System F of Girard (1972), the dependently typed framework of Martin-Löf (1984), and the homotopy-type-theoretic extension of The Univalent Foundations Program (2013) can all be cast as generalised algebraic theories, with types, terms, and judgments playing the roles of sorts and operations. Associated categorical work develops the connection in several directions: Hofmann (1997) works out the syntactic side in detail; Jacobs (1999) is the encyclopedic reference; Dybjer (1996) introduces categories-with-families, a variant of contextual categories that many working type theorists find easier to use; Fiore et al. (1999) extends Lawvere’s framework to algebraic theories with variable binding, complementing Cartmell’s direction. Panproto does not use the type-theoretic machinery explicitly, but the engine’s internal representation of theories follows the contextual-category pattern these sources establish.

How panproto-gat represents dependent sorts

The abstract story above has a concrete counterpart in the panproto-gat crate, and the correspondence is worth stating explicitly because every later chapter takes it for granted.

A sort declaration is a Sort value: a name, a list of SortParam entries for its parameters, and a kind (structural, value, coercion, merger). An operation is an Operation: a name, a list of (param_name, SortExpr) inputs, and a SortExpr output. A SortExpr is either a bare name or a head name applied to argument terms (the dependent case); Name(n) and App { name: n, args: vec![] } are two spellings of the same value, enforced by the Eq and Hash impls so that the two spellings never split a hashed carrier.

The parameter names in an operation’s input list are binders: each name is in scope in the sort expressions of later inputs and in the output sort expression. This is the syntactic feature that lets $compose : (x, y, z : Ob, f : Hom (x, y), g : Hom (y, z)) \to Hom (x, z)$ express the middle-object constraint directly: y in the second input position is the same y that appears in the third, tying f and g together.

Typechecking an application f(a_1, \ldots, a_n) walks the declared signature of f while accumulating a substitution $θ$ that maps each input parameter name to the corresponding concrete argument term. The expected sort for argument $a_{i}$ is inputs[i].1.subst(&θ), computed from the running substitution so that earlier arguments flow into later input sorts. The output sort of the whole application is output.subst(&θ). The relevant entry points are typecheck_term and typecheck_theory.

Equations introduce one extra wrinkle. The free variables in an equation are universally quantified, and their sorts are not supplied at the call site; they have to be inferred. panproto-gat does this by collecting sort-expression constraints across both sides of the equation and running Robinson unification on the argument terms. A variable appearing in two uses must have heads that match; the arguments pile up as term equations whose most general unifier fills in the variable sorts. This is why the associativity equation for compose typechecks even though the context ${x, y, z, w : Ob, f, g, h : \dots}$ is never written explicitly.

The free-model generator organises its carrier sets by instantiated sort rather than by head name. A fiber of $Tm (Γ, A)$ is indexed by the pair $(Γ, A)$ , and terms are filed under the fiber determined by their instantiated output sort. The upshot is that a theory with two parallel generators $f, g : Hom (a, b)$ produces ${id (a), id (b), f, g}$ rather than the Cartesian product of candidates a head-indexed generator would produce, because compose(f, g) would require the middle object to be both b and a at once. The fiber matching rules out the non-composite.

Three further features were added to the representation in the 0.37 release. The first is the Implicit tag on each operation input, which marks a parameter as recoverable at the call site by unification rather than supplied explicitly; an operation whose first three inputs are implicit is written app(f, x) at the use site even though its declaration reads app : (G, A, B, f : Tm(G, arrow(A, B)), x : Tm(G, A)) -> Tm(G, B). The second is the SortClosure attribute on each Sort, which marks a sort as closed with a fixed constructor list. The third is Term::Case, the case-analysis term whose coverage the typechecker verifies against the scrutinee’s closure. The three features sit alongside the dependent-sort machinery above and are what make the encodings of Part II read the way a textbook writes them.

Pattern matching on closed sorts

A closed sort names every inhabitant its model can produce: the theory of natural numbers closes $N$ with ${zero, succ}$ ; the theory of booleans closes $Bool$ with ${true, false}$ ; the theory of an inductive list closes $List (A)$ with ${nil, cons}$ . Once a sort is closed, a case analysis on an inhabitant is decidable: every inhabitant is one of finitely many constructors, and a case expression lists one branch per constructor.

panproto-gat represents a case expression as a Term::Case { scrutinee, branches } whose branches field is a Vec<CaseBranch>. Each CaseBranch names the constructor it handles, binds one local name per input of that constructor, and carries a body term. The typechecker checks two conditions. Coverage: the branch list contains exactly the constructors of the scrutinee’s closure, each one exactly once. Uniformity: every branch body typechecks to the same output sort, under a context extended by the branch’s own binders. An incomplete branch list, a duplicated constructor, or two branches with mismatched output sorts are all errors surfaced at the Case site. This is the feature the inductive-type shorthand in panproto-theory-dsl and the inductive! macro in panproto-gat-macros are built on: the shorthand closes a new sort with its constructor list, and every case expression against that sort is now a well-posed question about coverage rather than a pattern-matching guess.

Local let bindings

The third feature added to the term syntax in 0.37 is Term::Let { name, bound, body }, which binds the value of a term to a local name in the body. A let binding is the term-level analogue of the type-theoretic let x = e in b construct: the name x is in scope in b and nowhere else. The typechecker infers the bound term’s sort, records it in a SortScheme that carries both a sort expression and a list of quantified sort metavariables, and extends the local context with the pair $(x, scheme)$ before typechecking the body.

GAT signatures are first-order and their sorts have no free sort-metavariables, so the scheme recorded at a let site in the 0.37 release is always monomorphic: the metavars list is empty and the scheme is a single sort expression wearing polymorphic clothes. The type-level machinery is in place so that a later pass that introduces sort-level metavariables at the signature level can populate the scheme nontrivially without reshaping the term syntax. Until that pass lands, a let binding is a readability aid that saves the user from writing the same term twice; it is not yet a polymorphism primitive.

Directed equations

The engine also carries a third layer of equations, independent of the undirected equation list the theory declares: a list of DirectedEquation values, each an oriented LHS-to-RHS rewrite used for coercion round-trips, for the REPL’s :normalize command, and for the typecheck_equation_modulo_rewrites path that accepts an equation whose two sides share a normal form under the directed system. Directed equations need guarantees an undirected equation list does not: termination (so normalization halts) and confluence (so the normal form is unique). Rewriting, confluence, and termination is the chapter that works out what those guarantees mean and what the engine checks to supply them.

Models and morphisms

With GATs in hand, we can state the categorical facts about schemas and migrations the rest of the book will use.

A model of a GAT $T$ in a contextual category $D$ is a structure-preserving functor $Th (T) \to D$ . The structure preservation is stronger than for ordinary functors, because the functor has to respect the dependent-sort structure as well as composition and identity; in the contextual-category presentation this amounts to respecting the extension operation. In practice $D$ is usually $Set$ with its slice structure, and a model is a family of sets together with the functions the theory’s operations require.

A morphism of models $M \to M^{'}$ is a natural transformation between the functors that preserves the contextual structure. Concretely: for each sort $s$ of the theory, a function from $M$ ’s interpretation of $s$ to $M^{'}$ ’s, compatible with the theory’s operations — which is to say, applying the operation under $M$ and then crossing to $M^{'}$ is the same as crossing to $M^{'}$ first and then applying the operation under $M^{'}$ . The condition is naturality from Functors and natural transformations, generalised to the dependent setting.

Models of $T$ and their morphisms form a category, written $Mod (T)$ . In panproto, the objects of $Mod (T)$ are the schemas under the protocol whose theory is $T$ , and the morphisms are the migrations between schemas. The category we have been calling $Sch_{P}$ is $Mod (T)$ for the GAT $T$ of the protocol $P$ .

Panproto’s equation

The technical content of panproto can now be stated in a single equation, which every chapter of Parts II through V will unpack.

A protocol is a generalised algebraic theory. In Rust, a protocol’s theory is a Theory value from panproto-gat, holding sort declarations, operation declarations with their contexts, and equations, plus a type-checker that verifies well-formedness.

A schema under a protocol $P$ is a model of the GAT corresponding to $P$ . The Rust representation is a Schema value from panproto-schema, built by a SchemaBuilder that supplies interpretations for each sort and operation and validates the result against the protocol’s equations.

A migration from a schema $S_{1}$ to a schema $S_{2}$ (both under the same protocol) is a morphism of models $S_{1} \to S_{2}$ in $Mod (P)$ . The Rust representation is a Migration value from panproto-mig.

Three identifications; three Rust types. That is the equation. The rest of Part II takes the three identifications apart one at a time: Protocols as theories, schemas as instances handles the first two; Theory morphisms and instance migration handles the third.

Why this definition

The claim that a panproto protocol is a GAT is a strong one, and the claim that every schema language panproto supports fits into the framework is stronger still. The framework has to be expressive enough to cover ATProto lexicons, Avro records, Protobuf messages, JSON Schema documents, SQL DDL, FHIR resource profiles, and tree-sitter-derived programming-language grammars, and it has to be restrictive enough to admit the constructions Part II relies on — pushouts in the category of schemas, lifting of instances along morphisms, lenses between schema-indexed families.

Neither requirement is obvious in advance. Cartmell’s 1986 paper establishes the framework and shows that first-order signatures with equational axioms embed into it, which is a start, but the embedding does not automatically extend to the schema languages in production use. What it takes for the framework to cover them, and where the fit is exact versus approximate, is the subject of a dedicated chapter in Part IV; reading that chapter alongside Protocols as theories, schemas as instances is the right way to check the claim against a specific protocol one cares about.

Closing

One foundations chapter remains: Rewriting, confluence, and termination covers the directed-equation machinery that closes out Part I. Part II then opens with protocols as theories, schemas as instances, which translates the equation of the present chapter into working Rust.

Keyboard shortcuts

panproto