Preface
Disclaimer. The content of this page is largely LM-generated. It was written as a stopgap to make the panproto system legible while we work through the book verifying and editing the content by hand. When a chapter has been verified or edited by a human, the parts that were verified or edited will be noted at the head of the chapter.
Data outlives its schemas. Any system that stores records for more than a few months eventually faces a situation in which the records on disk were written against one shape and the code that reads them expects another, and the gap between the two has to be closed by somebody working by hand, under time pressure, without a mechanical check that the thing being done is correct. The tools most developers reach for on either side of this problem — source control on one side, serialisation formats on the other — were designed for something narrower. Git versions byte sequences. Protobuf, Avro, and JSON Schema describe the shape of those sequences. Neither addresses the relationship between the two, which is where things go wrong.
This book presents a different way of arranging those tools. Its central claim is that the schema, the data under it, and the transformations between schema versions can all be treated as members of the same kind of mathematical object. Once the treatment is carried out, the operations one ordinarily does by hand — merging a schema change across two branches, migrating data across a schema version boundary — follow from constructions worked out carefully in a part of mathematics most working developers have not yet seen. The mathematics is the categorical treatment of generalised algebraic theories and their morphisms, due largely to Eilenberg & Mac Lane (1945), Lawvere (1963), and Cartmell (1986), applied to databases by Spivak (2012) and Spivak & Wisnesky (2015). The software discussed in the book, panproto, is an implementation of the arrangement.
Who this book is for
The book is written for developers comfortable reading code in at least one statically typed language, who have not previously picked up a category theory textbook. Every concept is introduced from first principles rather than by reference to a prerequisite course. A reader who follows the foundations chapters closely will be in a position to read papers on functorial data migration, bidirectional lenses, and generalised algebraic theories without further help.
Researchers in formal methods, database theory, type theory, and programming languages will recognise the underlying literature: functorial semantics, bidirectional transformations, generalised algebraic theories. Citations point to the papers whose constructions panproto adopts, and departures from those designs are flagged at the point they occur.
How this book is organised
The book has seven parts plus appendices. Two opening chapters orient the reader and position panproto against the tools a working developer has already used. Part I develops the mathematical foundations — categories, functors and natural transformations, universal properties, colimits, and generalised algebraic theories, in that order. Each concept is introduced through running examples from code before being stated abstractly.
Part II develops panproto’s core constructions. It opens with the identification of a protocol with a generalised algebraic theory and a schema with a model of one, and closes with the corresponding identification of migrations with bidirectional lenses. The intermediate chapters cover the restrict-and-lift pipeline panproto uses to carry data along a migration, the round-trip laws that make a migration safe to run in either direction, and the dependent families panproto calls protolenses.
Part III documents the small pure expression language the engine uses when a migration needs to compute a value that depends on a record. Part IV covers the protocols panproto already knows about, through four case studies — ATProto, Avro, a relational case, and FHIR as a document case — plus the tree-sitter-based auto-derivation that produces protocols for 248 programming languages.
Part V turns to version control. It opens with a careful account of what git actually versions (byte sequences in a Merkle DAG) and what it does not (everything that gives those byte sequences meaning), and uses the gap to motivate the object model panproto-vcs puts in its place. From there the part develops the merge algorithm as a pushout in the category of schemas, the automatic data migration that follows from the inferred schema diff, and the bidirectional bridge between a panproto repository and an ordinary git remote.
Part VI is operational: the WebAssembly boundary, the Rust, TypeScript, and Python SDKs built on top of it, and the command line. Part VII is for contributors: the workspace layout, the CI pipeline, how to extend panproto with a new protocol or lens combinator, and the experimental subsystems. Three appendices close the book: a notation reference, a glossary, and an open-problems list.
How to read this book
A reader with prior category-theory experience can start at Part II and treat Part I as reference. Without that background, the first two parts repay reading in order; the protocol, VCS, and SDK parts that follow can be read in whatever order matches the practical problem at hand. Contributors to panproto’s code should also read Part VII before the others. The book assumes no previous encounter with panproto.
What the software can do today
Panproto is pre-release software. Several of its subsystems are complete enough to support the worked examples of the book and would not yet stand up to production use in isolation. Feature-gated subsystems still under development are catalogued in the chapter on experimental and feature-gated subsystems. The open-problems appendix lists the places where the software is ahead of the theory we have a citation for, and the places where the theory is ahead of the software.
A note on notation
Disclaimer. The content of this page is largely LM-generated. It was written as a stopgap to make the panproto system legible while we work through the book verifying and editing the content by hand. When a chapter has been verified or edited by a human, the parts that were verified or edited will be noted at the head of the chapter.
Categories and their inhabitants are written in a standard way across the mathematical literature, and this book follows the standard without deviation.
A category is written with a script capital letter: , , . Two specific categories come up often enough to earn shorter names. is the category whose objects are sets and whose morphisms are functions. is the category whose objects are Haskell types and whose morphisms are Haskell functions. Specific categories of panproto objects carry subscripts: is the category of schemas under a protocol .
Objects of a category are written with capital letters: , , . Morphisms are written with lowercase letters and an arrow: reads “ is a morphism from to ”. Composition is written right-to-left with a small circle: is “ after ”, and applied to an argument it computes . The right-to-left order matches ordinary function application but disagrees with the left-to-right convention of Unix pipes and F#’s >> operator, which do not appear in this book.
The identity morphism on an object is . The set of morphisms from to is the hom-set ; some authors write this . Two objects and are isomorphic, written , when an invertible morphism between them exists.
Functors are written with capital Latin letters: , , . A functor from to is written , and its action on a morphism of is . A natural transformation between two functors is written with a Greek letter and a double arrow: . Its component at an object is .
Displayed mathematics sits on its own line between the usual delimiters, like this:
Commutative diagrams are rendered as squares and triangles in the pattern
with the understanding that the diagram commutes: the composite equals the composite . Identity morphisms are elided in diagrams by the universal convention, and composites are drawn only where they bear on the argument.
Code appears in fenced blocks. Haskell is used for the mathematical examples of Part I because its type syntax is the closest syntax to the mathematics. Rust is used for the panproto implementation. A language tag on every block identifies which is which:
id :: a -> a
id x = x
#![allow(unused)]
fn main() {
fn identity<T>(x: T) -> T { x }
}
Short code examples with no caption need no further comment. Where a code example earns a reference back from later prose, it carries a caption beneath it, numbered within the chapter.
Panproto-specific symbols accumulate as the book proceeds. The notation reference appendix collects them in one table with page references.
What panproto is
Disclaimer. The content of this page is largely LM-generated. It was written as a stopgap to make the panproto system legible while we work through the book verifying and editing the content by hand. When a chapter has been verified or edited by a human, the parts that were verified or edited will be noted at the head of the chapter.
A small service stores user records as JSON and keeps the schema for those records in a file called user.schema.json, checked into the same repository as the code. The schema declares a name field and an email field, both required. After a few months, a new requirement arrives: users should have a display name that may differ from the account name. The schema grows a displayName field. Old records in the database do not have a displayName; new records do. A developer writes a small migration script to populate the new field on the old records, and commits the script alongside the schema change.
This is not unusual. Every non-trivial production system runs some version of the same sequence. The specifics vary between ATProto, Apache Avro, FHIR, Postgres DDL, and the two dozen other schema languages in wide use, but the shape is the same: a schema file is edited, a migration runs against existing data, and the system is now operating against a new schema.
The ordinary tools handle the parts of this sequence they were designed for. Git versions the schema file and the migration script as text. Protobuf and JSON Schema validate records against their schemas. Database migration frameworks — Flyway, Alembic, Rails ActiveRecord migrations, the schema.rb pattern, the Cassandra ALTER TABLE DDL — sequence the script’s application against the data store.
What none of these tools do is verify that the migration script is correct. Git sees a text diff to a schema file plus a text diff to a script. Protobuf sees the new schema as a set of new constraints to enforce on incoming records. A migration framework sees the script as a procedure to run. Nothing in the stack verifies that the script produces well-formed records under the new schema, that the script undoes itself when reversed, or that the script’s effect on the records is the same whether it runs once or twice.
The gap is where the trouble lives. The script can go wrong in several small ways that are easy to miss. The displayName default may violate a uniqueness constraint on the new schema, or the migration may forget one of the three tables that contain user records. Alternatively, the migration’s inverse may differ subtly from its forward direction, so that a rollback introduces data loss the original step did not announce. Each of these is a bug whose manifestation is a production incident.
Panproto’s claim is that these are not separate problems, to be solved separately by a different tool for each case. They are all instances of a single operation: the lift of a morphism between two schemas to a function on the data each schema holds. If schemas are mathematical objects — specifically, models of generalised algebraic theories — and if migrations are morphisms of those models, then the correctness properties above are specific, mechanical equations. A library that represents schemas and migrations this way can verify the equations at build time, before a migration touches production data.
Panproto is an implementation of that representation. It exposes:
- schemas as models of a GAT registered as a protocol, with the model’s well-formedness mechanically checked;
- migrations as morphisms between models, with the migration engine producing actionable diagnostics at each stage where a migration can fail;
- bidirectional transformations (lenses) between schemas, whose round-trip laws the library verifies automatically;
- version control for schemas as first-class objects, so that a schema’s history is trackable independently of the byte history of its file;
- cross-protocol translation, so that a schema or a record written against one protocol can be mapped mechanically to another.
None of these is exotic. Each is a specific construction in category theory applied to a specific problem in working data engineering, and each has a chapter of its own later in the book.
What this book is not
The book is not a category theory textbook. It introduces the mathematical machinery panproto rests on, but it does so by showing how the machinery applies to schemas, migrations, and version control. A reader who finishes the book has seen category theory at the level panproto’s design needs, and no more. For deeper reading, the references that close each chapter point the way.
The book is not a reference manual either. Specific parts of the Rust code are linked to docs.rs where relevant, and the API documentation lives there. The focus of the book is the mathematical and engineering design behind that API.
Nothing in the book is meant as a polemic against existing tools. Git works; Protobuf works; Flyway works. Panproto is what arises when the job is enlarged to include the relationship between schemas and data, and a reader who finishes the book will have a picture of how that enlargement is carried out.
The next chapter collects the notation the book uses; Categories then begins the mathematical development.
Categories
Disclaimer. The content of this page is largely LM-generated. It was written as a stopgap to make the panproto system legible while we work through the book verifying and editing the content by hand. When a chapter has been verified or edited by a human, the parts that were verified or edited will be noted at the head of the chapter.
Most of the rest of this book can be read as commentary on a single definition, and the present chapter gives it. The definition is short enough to write down in half a page. Becoming comfortable enough with it to reason by it takes longer, and the chapter is paced for that second requirement rather than the first: the formal statement arrives after the middle, once three worked settings have built up enough familiarity that the definition lands as the name for something already seen.
The concept originates with Eilenberg & Mac Lane (1945), who introduced it to organise the comparison of algebraic structures in topology. It has since become the standard vocabulary for any situation where things and directed operations between things are studied together, and has been picked up in turn by logic, algebra, programming language theory, and the theory of databases. In this book the things are schemas and the operations are the migrations that carry data from one schema to another. By the end of the chapter the words thing, directed operation, and structure will have been pinned down precisely enough to be usable in every later chapter.
A reader who has met categories before can skim the next two sections and pick up at the category of panproto schemas. A reader who has not can expect the chapter to take about half an hour of careful reading; it is the one chapter in the book whose contents everything else depends on, and it repays a slow first pass.
Composition, already familiar
The primitive operation of a category is composition: given two operations whose ends fit, a third operation is obtained that does the work of both in succession. Before writing this down as a definition we will look at three settings in which the operation is already familiar, so that the eventual definition lands as a name for something we can already do rather than as a new imposition.
A single running example will carry the chapter. A team maintains a dataset of people, each person carrying a name and an email address. After a time, the team decides to add a phone field. Later they rename email to contact_email to match a sibling field contact_phone. Each of the two decisions is a migration against the previous schema, and the effect of running them in order is itself a migration, one that transforms the original records in a single step. We will keep this picture in view as we move between settings.
Function composition
The first place a programmer has already met composition is in the wiring together of functions. Given a function that trims whitespace from a string, and a second function that parses a trimmed string into a date, the operation of trimming a raw string and then parsing the result is itself a function from strings to dates. Haskell writes its two components as
trim :: String -> String
parse :: String -> Date
and writes the composite directly, using a period:
parse . trim :: String -> Date
The period is pronounced “after”: parse . trim is parse after trim, and applied to a string s it computes parse (trim s). Rust supplies no operator of this shape, and the composite has to be written out as a new function by hand:
#![allow(unused)]
fn main() {
fn parse_trimmed(s: String) -> Date {
parse(trim(s))
}
}
Whether we write parse . trim, parse(trim(s)), or parse_trimmed, we are pointing at the same object: a function whose input is a string and whose output is a date. The operator is a matter of convenience; the object it builds is the point.
One subtlety is worth flagging in advance. The order of composition runs right to left: the rightmost function is the one that runs first, which agrees with ordinary function application (parse(trim(s))) and with the standard mathematical notation for composition of morphisms. It disagrees with the left-to-right convention of Unix pipes and of F#’s >> operator, where the leftmost operation runs first. A reader calibrated to pipes will want to recalibrate here; we will use the right-to-left convention throughout and flag it on its first few uses.
Command composition in a shell
The second setting is one any Unix user already has in hand. A shell pipeline chains commands, each consuming the output of the previous one:
lsof | grep Chrome | wc -l
Each command is a process that reads from standard input and writes to standard output. The vertical bar connects the output of one to the input of the next, and the pipeline as a whole is a process that consumes whatever the leftmost command consumes and produces whatever the rightmost command produces. Chain three commands and the result is a single “command” you could bind to a new name; chain four and the same holds. The bar is doing what the period in Haskell does: given two operations whose ends fit, it produces one new operation of the same kind.
The pipeline example is the one this chapter will have to give up on. It motivates the definition well enough, but processes have side effects and buffering behaviour that do not match the rigidity of a mathematical morphism, and pipelines do not in fact form a category in the strict sense. We will come back to this and admit it at the end of the chapter. Recognising the pattern without believing the example supports it fully is a useful habit: category theory tolerates motivating analogies and rejects them as instances in the same breath, and the reader will encounter the move often.
Migration composition
The third setting is the subject of the book. A panproto schema is an object of a specific category, and a migration is a morphism between two schemas of that category. Given the two migrations of the running example — a first migration adding the phone field, a second renaming the email field — their composition is itself a migration, carrying a record of the original schema through both transformations in a single step.
Write the three schemas as , , and and the two migrations as and . The composition is the migration that does both things at once, and the engine in panproto-mig treats it as a first-class value of the same Rust type as its two components. What panproto_mig::compose returns is not a script that runs the two in sequence but a single migration, compiled once, which can be applied to an instance of to yield an instance of directly.
The pattern, named
All three settings present a class of objects (types, streams, schemas) together with a class of directed operations (functions, pipes, migrations) between them. Any two operations whose ends meet compose to yield a third in the same class. A category is the mathematical structure that captures exactly this pattern, and the next section writes down what that means.
A reader new to the idiom may reasonably ask why the pattern deserves a name of its own when function composition is already familiar to every programmer. The answer takes more than this chapter to spell out in full, but the short version is that once the structure is named, every result in the category-theoretic literature becomes available to every instance of the structure. The identity law we are about to meet is one any programmer would verify unconsciously for their favourite kind of arrow; what the category definition buys us is that verifying it once, at the level of the abstract structure, hands us a tool kit — functors, natural transformations, colimits, adjunctions — that applies the moment the structure is recognised. The rest of Part I is that tool kit.
The definition
A category consists of four pieces of data subject to two axioms.
The data: a class of objects; for each ordered pair of objects a set of morphisms from to , with an individual morphism written ; for each triple of objects a composition operation sending a pair with and to a morphism ; and for each object a distinguished identity morphism .
The axioms require composition to be associative,
for any three composable morphisms, and the identity morphism to behave as a two-sided unit for composition,
for every .
The definition will appear in every chapter that follows. It rewards a second reading now.
Reading the definition
What is notable about the four pieces of data is what they do not say. Nothing in the definition fixes what an object is. An object of a category is whatever the category has chosen to call one, and in different categories an object may be a Haskell type, a panproto schema, a topological space, or a formal proposition; the definition does not pretend to decide between these. The same looseness applies to morphisms: the definition requires only that between any two objects a set of morphisms exist, not that they be functions or continuous maps or any other specific kind of arrow. (The word for a category that respects the set-not-class restriction is locally small. Every category in this book is locally small, and we will not say so again.)
A morphism from to is also called an arrow, and the two words are used interchangeably; one says “arrow” while drawing a picture and “morphism” while writing an equation. The set is called the hom-set from to , a piece of vocabulary inherited from categories whose morphisms preserve algebraic structure and were therefore called homomorphisms. Some authors write or when the ambient category is obvious. We use : it is less to write and no less precise.
Associativity
Associativity says that a chain of three or more composable morphisms has a well-defined composite that does not depend on how we group adjacent pairs. Given
there are two ways to collapse the chain into a single morphism from to : compose the first two first and get , or compose the last two first and get . The axiom says the two morphisms are equal, which is what lets us write without parentheses and without committing to an associativity convention.
Any programmer who has ever chained three functions together has relied on associativity without thinking about it: parse . trim . normalise is unambiguous because function composition is associative, and if it were not, Haskell would have to choose an associativity and the programmer would have to remember the choice. In every category we work with, composition is associative for a reason particular to the category — for functions, two functions agreeing on every input are equal, and both groupings produce the same value on every input — and once we know this is so, the reasoning from that point on is independent of why: we say “composition is associative” and treat the fact as available.
The identity law
The identity law says that composing a morphism with an identity on either side leaves it alone:
As a picture, the law is the commutativity of the square
Figure 1.1: the identity law. The two paths from in the upper-left to in the lower-right yield across the top and down the left; the axiom requires both to equal .
The identity morphism plays the role a zero plays in addition or a one in multiplication: the thing a composition leaves untouched. The identity on a type is the function that returns its argument; the identity on a panproto schema is the migration that leaves every record exactly where it was; the identity on a topological space is the map sending every point to itself. A fair question is why the law needs to be stated at all, when every example one meets has an obviously neutral identity. The answer is that the definition of a category does not hand us the identity morphism: we have to supply one when constructing a category, and the law is the check that what we have supplied actually plays the role.
Isomorphisms
A morphism is an isomorphism when there is a morphism satisfying both
The morphism , when it exists, is uniquely determined by ; we call it the inverse of and write . Two objects and are isomorphic, written , when some isomorphism between them exists.
Isomorphism is the category’s own notion of sameness. Two isomorphic objects are indistinguishable from within the category: any construction expressible in its morphisms will treat them identically. In the category of sets, isomorphism coincides with the existence of a bijection. In the category of Haskell types, two types are isomorphic if they have the same inhabitants up to relabelling. In the category of panproto schemas, two schemas are isomorphic when a migration goes each way and the two composites are identities — they are different names for the same structural content. Recognising which morphisms of a category are isomorphisms, and which are not, is usually the single most informative thing one can say about what the category is. The isomorphism question will reappear in every subsequent chapter, and it is the starting point of Bidirectional lenses, where the failure of most schema migrations to be invertible forces a weaker and more useful notion.
The three examples, checked
We now have a definition and three informal examples. Whether the examples really are instances of the definition is a question the definition was written to make answerable, and we now answer it for each.
Haskell types and functions
The category has Haskell types as its objects and Haskell functions as its morphisms. Composition is the period from the standard library:
(.) :: (b -> c) -> (a -> b) -> (a -> c)
(f . g) x = f (g x)
The identity on any type a is the polymorphic function
id :: a -> a
id x = x
which simply returns its argument.
Associativity holds because Haskell functions are equal when they agree on every argument, and both groupings of a three-function chain agree on every argument: for every ,
The identity law holds by the same reasoning, since .
A parenthetical for readers who know some Haskell: as stated here is an idealisation. Real Haskell admits non-termination, exceptions, and the polymorphic bottom value undefined; taking these into account gives a more delicate structure sometimes written . The chapters that use use the idealised form and the subtlety does not bear on any argument we will make. Barr & Wells (1990) and Riehl (2016) both discuss the fuller structure for a reader who wants it.
Panproto schemas and migrations
Fix a panproto protocol . The category has the schemas under as its objects and migrations between them as its morphisms. The Rust representation of an object is a value of the Schema type from panproto-schema; a morphism is a Migration value from panproto-mig. Composition is panproto_mig::compose, and the identity migration on a schema is the migration whose lift function returns its input untouched.
The claim that is a category is therefore a claim about how the crate is arranged. Associativity is checked by the test suite: compose three migrations, compute the composite either way, and compare the lifted records on every input in a sampled state space. The identity law is checked similarly. Both properties hold by construction — compose was written so that they would — but the tests are what catch a bug in compose before it propagates.
Back to the running example. The team’s composition is a morphism in whose lift carries an -record into the corresponding -record. In Rust:
#![allow(unused)]
fn main() {
let m_01 = /* add-phone migration */;
let m_12 = /* rename-email migration */;
let m_02 = panproto_mig::compose(&m_12, &m_01)?;
}
The arguments are in right-to-left order, matching the mathematical convention. What compose returns is a single Migration value whose lift, applied to any -record, yields the corresponding -record; running the two migrations in sequence by hand would produce the same records at twice the cost. Every chapter of Part II is in one way or another about operations on .
Unix pipelines, and why the example gives up
The pipeline example of §1 is not, strictly speaking, a category. Three obstacles stand in the way.
A process has side effects. make clean | grep foo does not merely transform bytes: it can delete files. Two pipelines that produce identical byte sequences can differ in what they write to disk, what signals they send, what they log, so there is no equivalence on processes under which composition is well-defined and identity neutral.
A process has buffering behaviour. cat file | head -n 1 may kill cat with SIGPIPE before it has read the file in full, depending on buffer sizes. Reordering commutative-looking operations can change what runs and what does not, so the composition operation is not cleanly a function of the bytes in and bytes out.
A process imitates the identity only up to an equivalence that ignores timing and buffer boundaries. The idealised cat of the motivating example is the identity on byte streams; the actual executable introduces a process that reads, buffers, and writes, none of which a mathematical identity does.
The pipeline example therefore motivates the definition as illustration rather than as instance. Composition-as-primitive is already present in the working vocabulary of anyone who has used a shell, which is what we wanted from the example; that the analogy does not survive close inspection is a pattern we will meet often in category theory and is a habit worth cultivating. When we are told that something forms a category, the first response is to check; the second is to say, even when it does not, what structure it does have and what the failure tells us.
The category of schemas
The category that will carry the rest of the book is , the schemas under a fixed protocol .
A schema, in this book, means a model of a generalised algebraic theory in the sense of Algebraic and generalised algebraic theories. The class this covers is much wider than the word “schema” might suggest to a database engineer. A JSON Schema document is a schema; an ATProto lexicon is a schema; an Apache Avro record definition is a schema; a SQL DDL file is a schema; a Rust source file, parsed through a tree-sitter grammar, is a schema; a FHIR resource profile is a schema. The claim developed in GATs and vindicated case by case in Part IV is that all of them are instances of the same mathematical object, and that migrations between them can therefore be handled uniformly by a single engine rather than per-format by many.
A migration from to is a structure-preserving map between the two models, in a sense Theory morphisms and instance migration will make precise. For the present chapter we need only the categorical shape: whatever a migration is, it is the kind of operation the definition of ’s morphisms demands.
Every construction of Part II takes place inside . Protocol composition is a colimit in a related category, handled in Colimits and pushouts. Functorial data migration, covered in Morphisms and migration, is a triple of functors between and induced by a protocol morphism . Bidirectional lenses enrich the morphisms of with a controlled kind of inverse, and are developed in Lenses. Schematic version control uses pushouts in to merge branches, which is the subject of Merge as pushout. None of these constructions makes sense unless is, in fact, a category; the claim that it is was the subject of this chapter.
Further reading
The categorical literature is large and uneven in register. The reading list below is short and ordered from most accessible to most demanding.
Lawvere & Schanuel (2009), Conceptual Mathematics, is the gentlest book-length introduction, written for a reader who has not studied university mathematics. Awodey (2010), Category Theory, is a careful graduate introduction example-rich enough to serve alongside this book’s foundations chapters. Leinster (2014), Basic Category Theory, is shorter, more opinionated, and freely available on arXiv; it is our recommendation to a reader who finds the register of this chapter too slow. Riehl (2016), Category Theory in Context, is the best single reference for anyone continuing past the basics. Mac Lane (1998), Categories for the Working Mathematician, is the canonical reference text, demanding but authoritative.
For category theory with programming examples as motivation, Barr & Wells (1990), Category Theory for Computing Science, and Rydeheard & Burstall (1988), Computational Category Theory, are the two classical sources; the latter gives the same material as executable ML code. Milewski (2014), the blog series underlying most of this book’s pedagogical moves, is the closest contemporary work in register and is recommended for a reader who wants an alternative pass through the same ideas.
Closing
The next chapter introduces functors, the structure-preserving maps between categories and the setting in which most of Part I’s later constructions are stated.
Functors and natural transformations
Disclaimer. The content of this page is largely LM-generated. It was written as a stopgap to make the panproto system legible while we work through the book verifying and editing the content by hand. When a chapter has been verified or edited by a human, the parts that were verified or edited will be noted at the head of the chapter.
The previous chapter left us with a collection of categories — , , the category of panproto schemas under a fixed protocol — and no vocabulary for comparing them. The present chapter supplies that vocabulary. A functor is the right shape of map between two categories, one that moves objects to objects and morphisms to morphisms in a way that preserves composition and identity. A natural transformation is the right shape of map between two functors that share a source and a target, one that is compatible with whatever morphisms the source category provides.
Both concepts are due to Eilenberg & Mac Lane (1945), who introduced natural transformations first and had to introduce functors and categories along the way to make the definition of naturality go through. For a working developer the payoff in this book is specific. Panproto’s migration engine is built around a functor sending each schema to the set of its instances and each migration to the function that lifts those instances. Bidirectional lenses are natural transformations between two such functors. Every claim the engine makes about data being preserved across a migration is a functoriality claim stated in the language we now introduce.
Functors
A functor from a category to a category is a pair of assignments subject to two laws.
The object part of sends each object to an object . The morphism part sends each morphism of to a morphism of . For every composable pair and we require the image of the composite to equal the composite of the images,
and for every object we require the image of its identity to be the identity on its image,
Both laws are equations in : the composition and identity appearing on the left-hand sides belong to , those on the right to , and the functor is what lets us put the two sides into the same equation at all.
A reader approaching the definition for the first time may wonder why the laws pin the functor down where they do. The mechanical answer is that the object part and the typing of the morphism part already determine a great deal: a morphism in is forced to go to a morphism in , and that constraint by itself is strong. What the laws add is that the structure within each hom-set — how morphisms compose, which morphism is the neutral one — must also travel across. Omit the composition law and a functor could reassign composites independently of their components; omit the identity law and a functor could renominate which morphism plays the neutral role. Neither kind of map would be useful.
The composition law has a pleasant pictorial reading. It says the square
Figure 2.1: the composition law as a commuting triangle (here drawn as a degenerate square). The top-then-right path composes after ; the bottom path is the single morphism ; the law equates them.
commutes in , and the identity law can be drawn similarly. Diagrams of this shape are how category-theoretic arguments are most often written down, and reading off the two paths of a square as the two sides of an equation is a habit worth acquiring.
The identity functor and functor composition
For every category there is a functor whose object and morphism parts are both the identity. Its functor laws reduce to and in , each of which is immediate. The identity functor is the neutral element we will need when functors themselves compose, which they do: given and , the composite acts as on objects and on morphisms. Both laws for the composite follow by chaining the laws for and .
A reader who has been tracking the categorical theme may notice that functors between categories themselves form the morphisms of a category whose objects are categories. That category is usually called , and it will reappear in Colimits and pushouts when protocols get composed by pushout. For now we need only that functor composition is associative and unit-respecting — which it is, by pointwise application of the facts about composition in and .
Functors in Haskell
Haskell makes the definition of a functor into an ordinary piece of its typeclass system:
class Functor f where
fmap :: (a -> b) -> f a -> f b
A type constructor f becomes an instance of Functor when a polymorphic function fmap is supplied sending each a -> b to a corresponding f a -> f b. The object part of the functor is f itself, viewed as an operation on types; the morphism part is fmap, viewed as an operation on functions.
The list type constructor is the archetypal example. It sends a type a to the type of lists [a], and sends a function g to the function that maps g over every element of an input list:
instance Functor [] where
fmap g [] = []
fmap g (x:xs) = g x : fmap g xs
The identity law, fmap id xs = xs, holds by induction on the list: the empty list stays empty, and consing id x onto a recursive call gives back x consed onto the original tail. The composition law, fmap (g . f) xs = fmap g (fmap f xs), holds by an analogous induction. Haskell programmers rely on both equations so often that they usually go unwritten, but writing them down is just making the functor laws explicit.
A second example runs along the same lines:
instance Functor Maybe where
fmap g Nothing = Nothing
fmap g (Just x) = Just (g x)
The Maybe constructor sends a type a to the type of possibly an a, and fmap threads a function through the Just case while leaving Nothing fixed. Every typeclass instance that claims the name Functor in Haskell is asserting the two laws, and a library that supplies an fmap that violates either of them is considered buggy. The convention is strong enough that the Haskell community refers to them simply as “the functor laws”, without having to say which functor.
The instance functor in panproto
Haskell’s examples are functors from to itself. The more interesting case, and the one that matters most for this book, is a functor between two genuinely different categories. Panproto’s engine is built around one.
Fix a protocol . The category has panproto schemas as its objects and migrations as its morphisms. Let be the category of sets and functions. Consider the assignment
whose object part sends a schema to the set of its instances, and whose morphism part sends a migration to the function that lifts instances of into instances of . The Rust representation of the object part is panproto_inst::Instance parameterised by a schema; the morphism part is the lift function in panproto_mig::lift.
The claim is that is a functor, and we owe the reader an unpacking of what the two laws say in this setting.
The composition law demands that, given two composable migrations and , lifting along the composite equals lifting first along and then along . Back to the running example from the previous chapter: if we add phone to an -record and then rename email to contact_email, we end up with the same record we would get by applying the composed “add-phone-and-rename-email” migration in one step. In Rust this holds by construction, because panproto_mig::compose is written so that the lift of a composite is the composition of the lifts; were it not so written, a careful reader running the two migrations in different orders would get different records, which would be worse than useless.
The identity law is the simpler of the two. Lifting along an identity migration returns its input, because the identity migration was constructed to make this hold: its lift is, concretely, the Instance-typed identity function.
Treated together, the two laws are not decorative but load-bearing. The engine uses functoriality to prove that different orderings of the same sequence of migrations produce the same output, and the proof is exactly the equation applied pointwise. Without it, “apply this migration, then that one” would not reliably agree with “apply the composite”, and a user chaining migrations would be at the mercy of the engine’s choice of evaluation order.
At this point an experienced reader might object that we invoked migration composition in the previous chapter without mentioning functoriality. The gap was pedagogical: composition of migrations is easy to state at the level of alone, while the claim that lifting is functorial requires a second category () in which to land the lifted data and a comparison between the two. We needed the language of this chapter before the claim could be stated.
Natural transformations
A functor is a map between categories. The next question, once we have two functors sharing a source and a target, is how to compare them.
A natural transformation from a functor to a functor , both , is an assignment sending each object to a morphism
in . The morphism is called the component of at . The assignment is required to satisfy one compatibility condition, called naturality: for every morphism of , the square
Figure 2.2: the naturality square. The two paths from to are (across the top, down the right) and (down the left, across the bottom), and the condition requires them to agree.
commutes in . The notation for “ is a natural transformation from to ” is , with a double arrow.
The adjective natural is worth pausing on, because it is not a vague gesture of approval. A family of morphisms is called a natural transformation only when the square above commutes for every morphism of . A family that fails this condition is not a natural transformation and has no standard name, because the ambient theory does not support the families that fail it. The word earns its keep.
A first-time reader may feel the naturality square is opaque. The intuition to take from it is that the square forces the transformation to be compatible with the morphisms of the source category: we can apply a -morphism first and then cross to the side, or cross to the side first and then apply the -morphism’s image there, and in either case land at the same point of . A transformation that fails naturality privileges one side of the square over the other and loses information about the morphism structure of . Concrete examples below make the content of the condition easier to see.
Naturality in Haskell
The Haskell function
safeHead :: [a] -> Maybe a
safeHead [] = Nothing
safeHead (x:_) = Just x
is a natural transformation from the list functor to the Maybe functor. Its component at a type a is the function safeHead :: [a] -> Maybe a. The naturality square, specialised to this pair of functors and this family of components, demands that for every function g :: a -> b,
In words: taking the head of a list and then applying g inside the resulting Maybe is the same as applying g to every element of the list first and then taking the head. Both sides yield Just (g x) when the list starts with x and Nothing when the list is empty. The equation is not an accident and it is not special to safeHead: the parametricity theorem of Reynolds (1983), and the readable consequence developed by Wadler (1989), imply that every polymorphic function of the right type shape is a natural transformation. A Haskell programmer writing swap, fst, reverse, or concat gets naturality for free from the type system, without having to prove anything. The naturality condition, for a category-theoretically-inclined programmer, is both a correctness property and a consequence of the type alone.
Naturality in panproto
A bidirectional lens between two schemas is, in categorical terms, a natural transformation between two functors from to . The development will come in Bidirectional lenses; the preview to take from this chapter is that the forward component get of a lens is the natural-transformation side of the story, and the round-trip law put(get(a), a) = a is what the naturality square commutes to in the lens setting. The automation Haskell’s type system provides for polymorphic functions does not apply here, because the categories of schemas involved are not polymorphically typed in the same way is; panproto’s lens-law checker therefore verifies the naturality condition explicitly, by property-based testing in panproto_lens::laws.
This is one of the places where the book’s claim that programming-language theory and data-migration theory are two presentations of the same subject becomes concrete. The naturality equation a Haskell library documents in its README, and the naturality equation panproto’s crate verifies at build time, are the same equation applied in different categories.
The functor category
For any two categories and , the functors are the objects of a new category, and natural transformations between them are its morphisms.
Composition of natural transformations is componentwise. Given and , the composite has component
at each object . Naturality of the composite follows from pasting the two naturality squares together along their shared -edge; the outer rectangle commutes because both squares do.
This category is written , or sometimes . Its identity morphisms are the natural transformations whose components are all identities. Its two axioms are inherited from : composition of components is associative and identity-unital in , so the same holds componentwise.
Functor categories are the setting in which most category-theoretic constructions are compared in the rest of Part I. Universal properties, the subject of the next chapter, are phrased as existence and uniqueness conditions on natural transformations into or out of specific functors. Limits and colimits will be objects defined up to isomorphism by the naturality of certain comparison maps. The whole language of “characterising a construction up to isomorphism by its universal property” lives in functor categories.
Further reading
Mac Lane (1998), chapter I (“Categories, Functors, and Natural Transformations”), is the canonical treatment. Awodey (2010) introduces functors in chapter 1 and dedicates chapter 7 (“Naturality”) to natural transformations, functor categories, and the Yoneda lemma. Riehl (2016), chapter 1, covers the same material in a modern register and is our recommendation for a reader wanting a single reference. Leinster (2014), chapter 1, is the shortest of the four and the one to try first if the others feel ceremonial.
For the parametricity-as-naturality story specifically, Reynolds (1983) is the technical root and Wadler (1989) the readable consequence. Both repay the detour.
Closing
The next chapter introduces universal properties, the pattern by which a categorical construction is pinned down up to isomorphism by the morphisms into or out of it.
Universal properties
Disclaimer. The content of this page is largely LM-generated. It was written as a stopgap to make the panproto system legible while we work through the book verifying and editing the content by hand. When a chapter has been verified or edited by a human, the parts that were verified or edited will be noted at the head of the chapter.
Any construction in a category can be specified in two ways: by saying what it is (the Cartesian product of two sets is a set of ordered pairs; the disjoint union is a set of tagged alternatives), or by saying how morphisms into or out of it look. The first way is concrete; the second is abstract. The abstract way turns out to be more useful, because it pins the construction down up to isomorphism without committing to any particular representation, and because a single abstract specification can be instantiated in many different categories at once. The abstract way is what we mean by a universal property.
The present chapter develops the two smallest examples: products and coproducts. Generalisation to limits and colimits, which the next chapter covers, is immediate once the pattern of argument is clear in the small case; and essentially every construction in Parts II and V of this book will turn out to be universal.
Products
We start with the example every reader already knows: the Cartesian product of two sets.
Ordinarily one defines by declaring its elements to be the ordered pairs with and , and one observes that two projection functions come along for free: and . That definition is perfectly adequate for computations in set theory, and one can hardly fault it. A category theorist would nonetheless ask a different question: what role does play among all sets that come equipped with a pair of functions to and ?
Once the question is phrased this way, the answer writes itself. Take any set together with a function and a function . Then there is one and only one function compatible with the projections, namely the function sending to the pair . Every other function that claims to respect the projections — that is, whose composition with returns and whose composition with returns — has to agree with this one at every , because the two components and determine the pair they produce. The Cartesian product is therefore not just some set with projections but the universal such thing: every competing candidate factors through it, uniquely.
This is a universal property. It does not describe by saying what its elements are. It describes it by saying what morphisms into it look like, and by asserting that those morphisms are uniquely determined by their projections. The shift of perspective — from “what is this thing made of” to “how does this thing relate to everything else” — is what categorical thinking does, and it turns out to generalise well beyond the setting of sets.
Stating the definition
Let be a category and let be two of its objects. A product of and is an object together with two morphisms
(the projections) such that for every object of and every pair of morphisms , , there exists a unique morphism making both
hold. The standard picture of the situation is
Figure 3.1: the universal property of the product, drawn with the cone object at the top, the two arbitrary morphisms and fanning downward (not drawn), and the factoring morphism into in the middle column. Any such cone factors through by exactly one morphism.
The definition is doing two things that a reader should notice separately. Existence says some factoring morphism is always available; uniqueness says at most one is. Drop existence and some cones over and will have nowhere to go inside the category. Drop uniqueness and calling “the product” becomes ambiguous, because different choices of factoring morphism give different squares, and the commutativity of the square no longer pins down.
Three examples
Sets. In the category , the product of two sets is the Cartesian product with the standard projections, and the factoring morphism is . The categorical definition has recovered the set-theoretic one. As it should: the reason to prefer the categorical version is not that it produces different answers in familiar cases but that it generalises to categories where “element of a set” is not the right notion.
Haskell types. In , the product of two types a and b is the pair type (a, b), projected out by fst and snd:
fst :: (a, b) -> a
fst (x, _) = x
snd :: (a, b) -> b
snd (_, y) = y
The factoring morphism is the two-function paired application:
pair :: (c -> a) -> (c -> b) -> c -> (a, b)
pair f g z = (f z, g z)
Uniqueness is forcible by pointwise reasoning. Any function h :: c -> (a, b) that agrees with f through fst and with g through snd has to satisfy h z = (f z, g z) for every z, and two functions equal on every input are equal; so h = pair f g. The universal property is not magic — it is a constraint provable in one line, and its work is to rule out the alternatives a less careful definition might allow.
Panproto schemas. In , the category of schemas under a fixed protocol, the product of two schemas and is a schema whose instances are pairs: one instance of alongside one instance of , with the two projection migrations extracting the two halves. Using the running example from the previous two chapters, the product of the name-and-email schema with the name-and-email-and-phone schema has instances that are pairs — one record of each kind — and a migration into the product from some other schema is determined by a migration into each factor. The universal property holds by construction in panproto_schema::colimit, which holds both products and coproducts, limits and colimits being dual.
Products of schemas are not the most common operation panproto performs on its category of schemas: pushouts, covered in the next chapter, are more frequent. But they are available, and their universal property behaves exactly as the set-theoretic version prepared us to expect.
Uniqueness up to isomorphism
A universal property does not specify a unique object on the nose. It specifies an object up to a canonical isomorphism, which is almost as good, and the argument establishing this is worth walking through once because we will use the template of the argument again and again.
Suppose and are both products of and . The universal property of , applied to the cone , produces a unique morphism satisfying and . The universal property of , applied the other way, produces a unique morphism . Compose the two in either order: and are each morphisms that factor their own target through itself in a way that respects the projections. Uniqueness forces each to be the identity, and therefore and are mutually inverse.
Two constructions claiming to be “the product”, therefore, are canonically identified by a unique isomorphism. We speak of the product of and , knowing different representatives yield isomorphic objects, and the isomorphism between them is itself forced. This is the template for every universal-property argument in the book: two candidates, each universal; apply each candidate’s property to the other; derive a unique iso.
Coproducts
The coproduct is the dual of the product. Everything we have just said about products applies with all the arrows reversed: injections replace projections, factoring out replaces factoring in, and the universal property characterises the smallest object through which pairs of morphisms out of and must factor.
A coproduct of and in a category is an object together with two morphisms
(the injections) such that for every and every pair of morphisms , , there exists a unique morphism with and .
Three examples again
Sets. The coproduct of two sets in is the disjoint union , the set whose elements are tagged versions of the elements of and , arranged so that a shared element of and becomes two distinct elements of the union. The injections are the two tagging inclusions. The factoring morphism out of the coproduct is case analysis: apply to a tagged- element and to a tagged- one.
Haskell types. In , the coproduct of a and b is the sum type Either a b:
data Either a b = Left a | Right b
The two constructors Left and Right are the injections. The factoring morphism is the library function either:
either :: (a -> c) -> (b -> c) -> Either a b -> c
either f _ (Left x) = f x
either _ g (Right y) = g y
Uniqueness here reads: any function out of Either a b that agrees with f on Left values and with g on Right values has to be either f g. The argument is again pointwise, and again the universal property is not magic but a forcible constraint on implementations.
Panproto schemas. In , the coproduct of two schemas is the disjoint-union schema: a schema whose instances are tagged, either an instance of the first summand or an instance of the second. A migration out of a coproduct schema is forced by a migration out of each summand, as the universal property demands. Coproducts of schemas, like products, live in panproto_schema::colimit; they will appear again in Merge as pushout as one ingredient of the more general pushout construction.
Initial and terminal objects
Two further universal constructions round out the picture: the universal cocone under an empty diagram (an initial object) and the universal cone over an empty diagram (a terminal object).
A terminal object in a category is one into which every other object has exactly one morphism. An initial object is one out of which every other object has exactly one morphism. Both are defined up to unique isomorphism by their universal properties, with the same template of argument as the one we gave for products.
Concretely: in , any singleton is terminal (every set has exactly one function into it, the constant function) and the empty set is initial (every set has exactly one function out of it, the vacuous function). In , the unit type () is terminal and Void is initial. In , the schema with no fields is terminal and the schema with no instances is initial. Neither is a very exciting object on its own, but both become essential when we turn to diagrams larger than two objects, which is the next chapter.
The general pattern
Products, coproducts, terminal objects, and initial objects are four instances of a single pattern. Each fixes a shape of diagram (two objects, two objects, the empty diagram, the empty diagram) and declares the universal object that receives or produces the diagram’s data. Generalised to arbitrary shapes of diagram, the universal-cone construction is called a limit and the universal-cocone construction is called a colimit. A product is a limit (of a two-object discrete diagram); a terminal object is a limit (of the empty diagram); a coproduct and an initial object are colimits of the same two shapes. The next chapter develops the general theory, and the case that will matter most for the rest of the book — the pushout, colimit of a three-object span — gets a section of its own.
What the present chapter buys is that every construction in later parts of the book pinned down by a universal property enters with its uniqueness built in. The migration functors of Theory morphisms and instance migration are defined by universal properties ( as a left adjoint, as a right adjoint, each forced up to isomorphism by what morphisms do to cones). The pushout of Merge as pushout is a colimit. The cross-protocol translations of Protocol colimits are computed by colimits in the category of protocols. Each of these is an instance of the pattern of this chapter, and each inherits its uniqueness from the universal-property machinery we have just written down.
Further reading
Universal properties are the central theme of Leinster (2014), whose chapter 4 (“Adjoints, representables, limits”) and chapter 5 (“Limits”) are the most direct treatment in print. Awodey (2010) introduces products in chapter 2 (“Abstract Structures”) and coproducts and duality in chapter 3 (“Duality”). Mac Lane (1998) develops universals in chapter III (“Universals and Limits”) and general colimits in chapter V (“Limits”). Kelly (1982) develops the enriched generalisation, which panproto does not use directly but which sits behind the profunctor-optics literature cited in Bidirectional lenses.
For the calculational use of universal properties as equations available for rewriting programs, Bird & de (1997) is the book-length treatment and is worth the detour for a reader with a functional-programming background.
Closing
The next chapter develops colimits and pushouts, generalising the universal-cocone construction of this chapter to arbitrary diagrams.
Colimits and pushouts
Disclaimer. The content of this page is largely LM-generated. It was written as a stopgap to make the panproto system legible while we work through the book verifying and editing the content by hand. When a chapter has been verified or edited by a human, the parts that were verified or edited will be noted at the head of the chapter.
The coproduct of the previous chapter combined two objects without asking anything about how they might already be related. In the setting of panproto, and indeed in most settings where structured things have to be combined, that is not what one usually wants. Two schemas that both descend from a shared ancestor should combine in a way that identifies the shared piece; two protocols with a common sub-vocabulary should combine in a way that identifies the common vocabulary. The construction that handles this is the pushout, and the general theory it sits inside is that of colimits.
The chapter has three tasks. The first is to define a diagram — the mathematical gadget for naming a shape of combination — and the general notion of colimit over a diagram of any shape. The second is to specialise to the three-object diagram called a span, whose colimit is the pushout. The third is to show the pushout at work in panproto, where it governs both protocol composition and the three-way merge operation at the heart of schematic version control.
Diagrams
Before defining the general colimit we need a vocabulary for “shape of combination”. The relevant gadget is straightforward once one has functors in hand: a shape is itself a small category, and a diagram of that shape in the target is a functor from the shape into the target.
A diagram in a category is a functor from some small category , called the shape of the diagram, into . The small category encodes the pattern of things to be combined and the relations among them; the functor picks out specific objects and morphisms of that realise the pattern.
A reader who has survived the last two chapters may feel the definition is understated — a diagram is just a functor, which is already a familiar object. That is the force of the definition. Everything we know about functors applies to diagrams, including the whole apparatus of natural transformations and morphisms between diagrams, and the theory of colimits inherits the universal-property style of argument from the previous chapter without new machinery.
Two examples of shape categories will carry most of the weight.
The discrete shape on two objects, call it , has two objects and no non-identity morphisms. A diagram of this shape in picks out two objects of and says nothing about how they relate. Colimits over diagrams of this shape are coproducts, which we have met already; -object discrete shapes give -fold coproducts; and the empty shape gives an initial object.
The span shape, call it , has three objects and two non-identity morphisms, one from to and one from to , with no composites beyond identities. A diagram of this shape in picks out three objects and two morphisms and . Colimits over diagrams of this shape are pushouts, and they are the construction that does most of the book’s work.
Other shapes give other well-known colimits — a parallel-pair shape gives a coequaliser, an -chain shape gives a sequential colimit — but we will not need those here. The span is the one to keep in hand.
Cocones and colimits
The universal property of the coproduct, read carefully, generalises to any diagram. A coproduct of and was an object equipped with two morphisms out of and , universal among such equipments. In the general case the equipment is a family of morphisms indexed by the objects of the diagram’s shape.
A cocone under a diagram with apex is a natural transformation from to the constant functor that sends every object of to and every morphism to . Less economically: a cocone is a family of morphisms , one for each object of , subject to the requirement that for every morphism in we have
The are the cocone’s leg morphisms, and the equation says the legs are compatible with the shape’s arrows: if prescribes an arrow from to , then the leg at must factor through the leg at via the image of that arrow.
A span diagram in has three legs, , , , and two compatibility conditions, and . The two equations force to be determined by the other two legs, so a span cocone is effectively a pair satisfying . Most practical specifications of a pushout skip straight to this pair-with-equation formulation and omit the explicit third leg.
A colimit of a diagram is a universal cocone: an apex with leg morphisms such that every other cocone factors through by a unique morphism with for every . The uniqueness-up-to-isomorphism template of the previous chapter applies verbatim: two cocones both satisfying the universal property are canonically identified, and we speak of the colimit.
Coproducts drop out of this as the case when the shape is discrete: the universal factorisation from the previous chapter is exactly the unique morphism the general definition produces. Initial objects drop out as the case when the shape is empty. The general pattern includes them as the smallest cases, and the next section specialises to the three-object case that will get most of our attention.
Pushouts
A pushout is the colimit of a span.
Given a span , a pushout is an object together with morphisms and satisfying
and universal among such: every other pair into some object that satisfies the analogous equation factors through by a unique morphism . The picture is
Figure 4.1: the pushout square. The equation says the square commutes; universality says every other commuting square factors through this one by a unique .
One sentence of intuition. The pushout glues and together along their shared image of . If and are absent — if the span is really a discrete pair — the pushout collapses to the coproduct , which does no gluing. What and add is the requirement that the two images of , one in and one in , be identified in the pushout. is the smallest object in which they are, which is what one usually means by “gluing two things together along a shared part”.
Pushout in
In the category of sets, the pushout has an explicit construction. Take the disjoint union . For each , identify the element with the element . The quotient of the disjoint union by the equivalence relation generated by these identifications is the pushout, and the two injections and send an element to its equivalence class.
A minimal example: , , , , . The disjoint union has two elements, and ; the identification equates them; the pushout is a one-element set. Both injections land at the single element, and the original elements of and have been glued together along the common image from .
A slightly larger example makes the pattern visible. Let , , , with and . The disjoint union has six elements. The identifications equate with and with , but the starred elements of and are unaffected, since nothing in maps to them. The pushout has four elements: , , , .
This is what “gluing along” means concretely in : two copies of something, sharing a subset, welded at the subset. The same picture will return in Merge as pushout, where the “subset” is a common-ancestor schema and the “two copies” are the two branches’ schemas. There is no new construction there; there is only a different choice of category in which to take the same pushout.
When pushouts do not exist
Not every category has every pushout. does, and so do the categories of groups, topological spaces, vector spaces, and panproto schemas. does not in general, because the sum type that a pushout would demand might carry an equational constraint Haskell’s type system does not allow the programmer to express. A category has all pushouts if every span in it admits a pushout; the word for a category with all (small) colimits is cocomplete. Most of the categories we work with are cocomplete for the same reason is: their definitions allow the explicit constructions the abstract machinery asks for.
Pushout of theories
In the category of generalised algebraic theories — the subject of the next chapter — the pushout glues two theories along a common sub-theory.
Given a span , the pushout is a theory whose sorts and operations are the disjoint union of those in and , with sorts and operations in the image of identified along and . The construction parallels the set-theoretic one; the extra work is that the theories’ equations must be inherited, which functoriality of the translation takes care of.
This is the setting worked out at length in the institutions framework of Goguen & Burstall (1992), developed precisely to handle parametric combinations of logical and algebraic theories. An institution is, roughly, a category of theories together with a functor to its category of models, set up so that colimits of theories lift to operations on models in a controlled way. Panproto’s treatment of protocol composition is institutional in this sense, though we will not develop the full institutional machinery here; a reader who wants it can find it in Goguen and Burstall.
Pushouts in panproto
The two places where pushouts dominate panproto’s engineering are protocol composition and three-way merge.
Protocol composition
A protocol is a generalised algebraic theory together with a parser, an emitter, and a registration, as Protocols as theories, schemas as instances develops in detail. Two protocols that share a common sub-protocol combine by taking the pushout of the theories along the shared sub-theory.
The concrete case that makes this useful is when two protocols need to agree on some shared vocabulary. Panproto ships a protocol for ATProto lexicons and a separate one for Apache Avro schemas. Both represent records with named fields of declared types; both share a common sub-vocabulary of primitive types (strings, integers, booleans). A combined protocol accepting both formats, and translating between them at the boundaries, is the pushout of the two along the shared primitives sub-theory. The pushout is the place where the two vocabularies agree; the rest of each protocol sits above that shared substrate.
Panproto’s implementation lives in panproto_gat::colimit and panproto_schema::colimit. The pushout is verified by the usual property-based test: construct a cocone out of two arbitrary morphisms agreeing on the shared subprotocol, check that it factors through the computed pushout by a unique morphism, repeat across a sampled state space. Protocol colimits develops the machinery in full.
Three-way merge
The second and more visible application is the three-way merge operation at the heart of schematic version control.
Here the running example from Part I reappears. Two developers branch from a common-ancestor schema — our name-and-email address-record schema — and each evolves it independently. Developer A adds a phone field, producing . Developer B, working in parallel, renames email to contact_email, producing . The two branches need to merge.
Spelling out the span: , with the add-phone migration and the rename-email migration. The pushout is a schema containing every field of and every field of , with the fields inherited from (the original name, and email under its new name) appearing once. The injections and are the migrations from each branch into the merged schema; applied to each branch’s data, they yield the same records in the merged form.
Figure 4.2: three-way merge as a pushout. The two branch migrations and fan out from the common ancestor; the pushout glues them together along ’s shared content, and a record on either branch is carried into the merged schema by the appropriate injection.
Not every merge has a pushout. If A renames email to contact_email while B renames the same field to email_addr, the pushout would have to agree with both renamings on ’s email, which it cannot. Panproto’s merge algorithm detects this at the level of the universal property: the two branches present a cocone that does not factor uniquely through any candidate pushout, which means the pushout does not exist as a schema under the protocol. The engineering reading is that the merge has a genuine conflict, and the algorithm reports the exact disagreement — the two migrations cannot both commute with a third — rather than producing conflict markers in a text file.
The same pushout construction appears in the patch-theoretic treatment of textual merges due to Mimram & Giusto (2013), which informs the Darcs and Pijul systems (Roundy 2005). Panproto’s contribution is to apply the construction at the schema level rather than the byte level, which produces merge results that survive changes to the textual representation that have nothing to do with schema content. Merge as pushout specialises the present chapter’s construction to the implementation, including the handling of data stored under each branch.
Further reading
Mac Lane (1998), chapter III (“Universals and Limits”), is the canonical source for limits and colimits, with pushouts treated as a specific case. Chapter V (“Limits”) develops the general theory of diagrams. Awodey (2010), chapter 5 (“Limits and Colimits”), is the same material in a more undergraduate-facing register. Riehl (2016), chapter 3 (“Limits and Colimits”), is the modern reference. Leinster (2014) develops colimits in chapter 5 (“Limits”).
For the institutional perspective on pushouts of theories, Goguen & Burstall (1992) is foundational and Goguen (1991) its accompanying exposition. For the patch-theoretic view of merge, Mimram & Giusto (2013) is the nearest neighbour. The textbook-length account of algebraic specifications and theories combined by colimit is Sannella & Tarlecki (2012).
Closing
The next chapter introduces algebraic and generalised algebraic theories, the mathematical language in which panproto writes down what a protocol is.
Algebraic and generalised algebraic theories
Disclaimer. The content of this page is largely LM-generated. It was written as a stopgap to make the panproto system legible while we work through the book verifying and editing the content by hand. When a chapter has been verified or edited by a human, the parts that were verified or edited will be noted at the head of the chapter.
Three identifications carry the technical content of panproto, and the present chapter is where each of them is made precise. A protocol is a generalised algebraic theory. A schema is a model of one. A migration is a morphism of models. The chapter has the job of making those three sentences land as mathematics rather than slogans, so that every later chapter can invoke them without further justification.
The concepts we are going to use are older than panproto by decades, and most of them are older than most working programmers. Algebraic theories in the modern sense are due to Lawvere (1963), whose 1963 thesis reformulated the venerable notion of an algebraic structure — group, ring, vector space — as a small category with finite products, thereby making “model of a theory” mean “product-preserving functor into the category of sets”. Generalised algebraic theories, which extend Lawvere’s framework to accommodate dependent sorts, are due to Cartmell (1978), whose thesis gave them their first formal treatment; the more widely cited journal version is Cartmell (1986). What this book adds is the engineering claim that the schema languages working programmers use every day — JSON Schema, Avro, Protobuf, ATProto lexicons, SQL DDL, tree-sitter grammars — all fit in this framework, and that once they do, the migration and lens machinery of Part II applies to them uniformly. The present chapter states the mathematical side of that claim; Parts II and IV vindicate the engineering side case by case.
We assume the reader has worked through Categories, Functors and natural transformations, Universal properties, and Colimits and pushouts. The present chapter leans on all four.
Algebraic theories
An algebraic theory specifies a kind of mathematical structure by listing its sorts, its operations, and the equations those operations satisfy. Before anyone called it categorical, the idea was standard in universal algebra: a group, say, was a set equipped with a binary operation, an inverse operation, and a distinguished identity element, satisfying the usual axioms; a ring was a set with two binary operations and a few more axioms; a module over a ring was two sets and three operations across them. The common shape — sorts, operations, equations — was the content that universal algebra abstracted and that Lawvere reformulated in categorical terms.
The data of an algebraic theory are, concretely, a finite list of sorts , a collection of operation symbols each with a declared arity , and a collection of equations between terms built from the operations and free variables. The theory of a monoid is small enough to write in one sentence: one sort , two operations (a constant for the unit and a binary for the multiplication), and three equations for associativity and two-sided unitality. A monoid in the ordinary sense is any interpretation: a set standing for , an element for , a binary function for , such that the three equations hold in the interpretation. Groups, rings, lattices, and vector spaces are given in the same way, with more sorts or more operations.
The categorical presentation
Lawvere’s insight was that the theory of a monoid — or of any algebraic structure — can be packaged as a small category rather than as a list of symbols. The associated category has the finite products of sorts as its objects and equivalence classes of terms (modulo the theory’s equations) as its morphisms. Composition is substitution. The category has finite products by construction, and the operations of sit inside it as specific morphisms between products of sorts.
The payoff of this repackaging is that a model of in a category with finite products is now simply a product-preserving functor . “Product-preserving” means the functor takes products to products () and the terminal object to the terminal object (). When is , a model is what every undergraduate textbook calls “a monoid” or “a group”: an assignment of a set to each sort and a function to each operation that respects the theory’s equations. The equations are respected automatically, because two terms that are equal under the theory have become the same morphism of , and every functor sends equal morphisms to equal morphisms.
The category of monoids, with monoid homomorphisms as morphisms, is the category of product-preserving functors ; the category of groups is similarly ; and the pattern extends. Every algebraic structure the reader has met in an undergraduate algebra class has a presentation of this form, and the categorical tools from Part I — functors between theories, natural transformations between models, colimits of models — apply uniformly across the family.
For a reader to whom the move from “a list of symbols” to “a small category with finite products” still feels abstract: the motivation is that the categorical presentation packages all the information the theory contains (sorts, operations, equations) into a single object of a kind we already understand. The categorical tools of Part I operate on categories, and a theory reformulated as a category is now something those tools can operate on. The payoff in panproto’s setting is that schema languages with very different surface syntax — JSON Schema, Avro, ATProto lexicons — become specific small categories , and migrations between schemas under a protocol become natural transformations in the corresponding functor category. The uniformity is what the engine exploits.
Generalised algebraic theories
Lawvere’s framework is sufficient when every sort is global, meaning that the sort makes sense before any element of it has been chosen. The sort of monoid elements is global in this sense: there is a sort , a fixed set, and its elements are not indexed by anything else.
Many structures arising in programming, type theory, and databases are not of this form. A sort of “vectors of length ” depends on a previously chosen natural number ; one cannot speak of a vector without first committing to its length. A sort of “migrations from schema to schema ” depends on two previously chosen schemas. Sorts of this kind are called dependent sorts, and they do not fit into Lawvere’s original framework because the framework has no way to say that a sort is indexed by elements of another sort.
A generalised algebraic theory, or GAT, is Cartmell’s extension of Lawvere’s framework to dependent sorts. The data are the same three kinds of thing as before — sort symbols, operation symbols, equations — but each is now declared in a context of free variables whose values may appear in what is being declared. A sort depending on a natural number is declared in the context . A sort depending on two objects of a category is declared in the context . Operations and equations are contextualised the same way, with their own free variables whose types may themselves be dependent.
The canonical example of a GAT is the theory of a small category. It has two sorts, (global) and (dependent on two objects, in the context ); it has two operations, composition and identity, each contextualised over triples or singletons of objects; and it has the associativity and identity equations we met in Categories, each stated in the appropriate context. A model of this GAT is, unsurprisingly, a small category. The fact that the theory of a category is naturally a GAT rather than a Lawvere theory is a sign that dependent sorts are not an edge case but a routine feature of the theories one actually encounters.
Contextual categories
Cartmell (1986) gives GATs their categorical face. A GAT corresponds to a contextual category, a small category equipped with the additional structure needed to interpret dependently-sorted syntax internally: a distinguished class of context objects, a notion of type in a context, and an extension operation that lets a context be enlarged by a type. The definition is a careful piece of machinery, and we will not repeat it here; Hofmann (1997) is the place to go for the full account.
The move from syntactic GAT to categorical contextual category plays the same role for dependent theories that the Lawvere-theory-as-small-category move plays for algebraic theories: it makes models into functors and migrations into natural transformations, and it brings all the tools of Part I to bear. A model of a GAT in a contextual category is a structure-preserving functor ; when is with its canonical contextual structure, a model is a family of sets (one for each context-indexed sort, indexed by the interpretations of the context’s variables) together with the functions the operations require.
GATs and type theory
The broader programme GATs belong to is the categorical semantics of type theory, a line of research with its own substantial literature. The simply typed lambda calculus of Church (1940), the polymorphic System F of Girard (1972), the dependently typed framework of Martin-Löf (1984), and the homotopy-type-theoretic extension of The Univalent Foundations Program (2013) can all be cast as generalised algebraic theories, with types, terms, and judgments playing the roles of sorts and operations. Associated categorical work develops the connection in several directions: Hofmann (1997) works out the syntactic side in detail; Jacobs (1999) is the encyclopedic reference; Dybjer (1996) introduces categories-with-families, a variant of contextual categories that many working type theorists find easier to use; Fiore et al. (1999) extends Lawvere’s framework to algebraic theories with variable binding, complementing Cartmell’s direction. Panproto does not use the type-theoretic machinery explicitly, but the engine’s internal representation of theories follows the contextual-category pattern these sources establish.
How panproto-gat represents dependent sorts
The abstract story above has a concrete counterpart in the panproto-gat crate, and the correspondence is worth stating explicitly because every later chapter takes it for granted.
A sort declaration is a Sort value: a name, a list of SortParam entries for its parameters, and a kind (structural, value, coercion, merger). An operation is an Operation: a name, a list of (param_name, SortExpr) inputs, and a SortExpr output. A SortExpr is either a bare name or a head name applied to argument terms (the dependent case); Name(n) and App { name: n, args: vec![] } are two spellings of the same value, enforced by the Eq and Hash impls so that the two spellings never split a hashed carrier.
The parameter names in an operation’s input list are binders: each name is in scope in the sort expressions of later inputs and in the output sort expression. This is the syntactic feature that lets express the middle-object constraint directly: y in the second input position is the same y that appears in the third, tying f and g together.
Typechecking an application f(a_1, \ldots, a_n) walks the declared signature of f while accumulating a substitution that maps each input parameter name to the corresponding concrete argument term. The expected sort for argument is inputs[i].1.subst(&θ), computed from the running substitution so that earlier arguments flow into later input sorts. The output sort of the whole application is output.subst(&θ). The relevant entry points are typecheck_term and typecheck_theory.
Equations introduce one extra wrinkle. The free variables in an equation are universally quantified, and their sorts are not supplied at the call site; they have to be inferred. panproto-gat does this by collecting sort-expression constraints across both sides of the equation and running Robinson unification on the argument terms. A variable appearing in two uses must have heads that match; the arguments pile up as term equations whose most general unifier fills in the variable sorts. This is why the associativity equation for compose typechecks even though the context is never written explicitly.
The free-model generator organises its carrier sets by instantiated sort rather than by head name. A fiber of is indexed by the pair , and terms are filed under the fiber determined by their instantiated output sort. The upshot is that a theory with two parallel generators produces rather than the Cartesian product of candidates a head-indexed generator would produce, because compose(f, g) would require the middle object to be both b and a at once. The fiber matching rules out the non-composite.
Three further features were added to the representation in the 0.37 release. The first is the Implicit tag on each operation input, which marks a parameter as recoverable at the call site by unification rather than supplied explicitly; an operation whose first three inputs are implicit is written app(f, x) at the use site even though its declaration reads app : (G, A, B, f : Tm(G, arrow(A, B)), x : Tm(G, A)) -> Tm(G, B). The second is the SortClosure attribute on each Sort, which marks a sort as closed with a fixed constructor list. The third is Term::Case, the case-analysis term whose coverage the typechecker verifies against the scrutinee’s closure. The three features sit alongside the dependent-sort machinery above and are what make the encodings of Part II read the way a textbook writes them.
Pattern matching on closed sorts
A closed sort names every inhabitant its model can produce: the theory of natural numbers closes with ; the theory of booleans closes with ; the theory of an inductive list closes with . Once a sort is closed, a case analysis on an inhabitant is decidable: every inhabitant is one of finitely many constructors, and a case expression lists one branch per constructor.
panproto-gat represents a case expression as a Term::Case { scrutinee, branches } whose branches field is a Vec<CaseBranch>. Each CaseBranch names the constructor it handles, binds one local name per input of that constructor, and carries a body term. The typechecker checks two conditions. Coverage: the branch list contains exactly the constructors of the scrutinee’s closure, each one exactly once. Uniformity: every branch body typechecks to the same output sort, under a context extended by the branch’s own binders. An incomplete branch list, a duplicated constructor, or two branches with mismatched output sorts are all errors surfaced at the Case site. This is the feature the inductive-type shorthand in panproto-theory-dsl and the inductive! macro in panproto-gat-macros are built on: the shorthand closes a new sort with its constructor list, and every case expression against that sort is now a well-posed question about coverage rather than a pattern-matching guess.
Local let bindings
The third feature added to the term syntax in 0.37 is Term::Let { name, bound, body }, which binds the value of a term to a local name in the body. A let binding is the term-level analogue of the type-theoretic let x = e in b construct: the name x is in scope in b and nowhere else. The typechecker infers the bound term’s sort, records it in a SortScheme that carries both a sort expression and a list of quantified sort metavariables, and extends the local context with the pair before typechecking the body.
GAT signatures are first-order and their sorts have no free sort-metavariables, so the scheme recorded at a let site in the 0.37 release is always monomorphic: the metavars list is empty and the scheme is a single sort expression wearing polymorphic clothes. The type-level machinery is in place so that a later pass that introduces sort-level metavariables at the signature level can populate the scheme nontrivially without reshaping the term syntax. Until that pass lands, a let binding is a readability aid that saves the user from writing the same term twice; it is not yet a polymorphism primitive.
Directed equations
The engine also carries a third layer of equations, independent of the undirected equation list the theory declares: a list of DirectedEquation values, each an oriented LHS-to-RHS rewrite used for coercion round-trips, for the REPL’s :normalize command, and for the typecheck_equation_modulo_rewrites path that accepts an equation whose two sides share a normal form under the directed system. Directed equations need guarantees an undirected equation list does not: termination (so normalization halts) and confluence (so the normal form is unique). Rewriting, confluence, and termination is the chapter that works out what those guarantees mean and what the engine checks to supply them.
Models and morphisms
With GATs in hand, we can state the categorical facts about schemas and migrations the rest of the book will use.
A model of a GAT in a contextual category is a structure-preserving functor . The structure preservation is stronger than for ordinary functors, because the functor has to respect the dependent-sort structure as well as composition and identity; in the contextual-category presentation this amounts to respecting the extension operation. In practice is usually with its slice structure, and a model is a family of sets together with the functions the theory’s operations require.
A morphism of models is a natural transformation between the functors that preserves the contextual structure. Concretely: for each sort of the theory, a function from ’s interpretation of to ’s, compatible with the theory’s operations — which is to say, applying the operation under and then crossing to is the same as crossing to first and then applying the operation under . The condition is naturality from Functors and natural transformations, generalised to the dependent setting.
Models of and their morphisms form a category, written . In panproto, the objects of are the schemas under the protocol whose theory is , and the morphisms are the migrations between schemas. The category we have been calling is for the GAT of the protocol .
Panproto’s equation
The technical content of panproto can now be stated in a single equation, which every chapter of Parts II through V will unpack.
A protocol is a generalised algebraic theory. In Rust, a protocol’s theory is a Theory value from panproto-gat, holding sort declarations, operation declarations with their contexts, and equations, plus a type-checker that verifies well-formedness.
A schema under a protocol is a model of the GAT corresponding to . The Rust representation is a Schema value from panproto-schema, built by a SchemaBuilder that supplies interpretations for each sort and operation and validates the result against the protocol’s equations.
A migration from a schema to a schema (both under the same protocol) is a morphism of models in . The Rust representation is a Migration value from panproto-mig.
Three identifications; three Rust types. That is the equation. The rest of Part II takes the three identifications apart one at a time: Protocols as theories, schemas as instances handles the first two; Theory morphisms and instance migration handles the third.
Why this definition
The claim that a panproto protocol is a GAT is a strong one, and the claim that every schema language panproto supports fits into the framework is stronger still. The framework has to be expressive enough to cover ATProto lexicons, Avro records, Protobuf messages, JSON Schema documents, SQL DDL, FHIR resource profiles, and tree-sitter-derived programming-language grammars, and it has to be restrictive enough to admit the constructions Part II relies on — pushouts in the category of schemas, lifting of instances along morphisms, lenses between schema-indexed families.
Neither requirement is obvious in advance. Cartmell’s 1986 paper establishes the framework and shows that first-order signatures with equational axioms embed into it, which is a start, but the embedding does not automatically extend to the schema languages in production use. What it takes for the framework to cover them, and where the fit is exact versus approximate, is the subject of a dedicated chapter in Part IV; reading that chapter alongside Protocols as theories, schemas as instances is the right way to check the claim against a specific protocol one cares about.
Further reading
For algebraic theories themselves, Lawvere (1963) is the original source, republished in the TAC Reprints series and freely available there. Lambek & Scott (1986) and Jacobs (1999) are two book-length treatments of the categorical view of logic and type theory in which algebraic theories sit; both are demanding but rewarding.
For GATs specifically, Cartmell (1986) is the foundational journal paper. Dybjer (1996) develops the categories-with-families variant of contextual categories that many working type theorists prefer. Hofmann (1997) gives the syntactic–categorical correspondence in full. The type-theoretic programme GATs serve is covered in depth by Martin-Löf (1984) (the original dependently-typed setting), The Univalent Foundations Program (2013) (the homotopy-theoretic extension), and Harper (2016) (the programming-languages angle). Sannella & Tarlecki (2012) is the encyclopedic treatment of the algebraic-specification tradition that GATs sit inside.
Closing
One foundations chapter remains: Rewriting, confluence, and termination covers the directed-equation machinery that closes out Part I. Part II then opens with protocols as theories, schemas as instances, which translates the equation of the present chapter into working Rust.
Rewriting, confluence, and termination
Disclaimer. The content of this page is largely LM-generated. It was written as a stopgap to make the panproto system legible while we work through the book verifying and editing the content by hand. When a chapter has been verified or edited by a human, the parts that were verified or edited will be noted at the head of the chapter.
Equations in a GAT are undirected. An equation says nothing about which side is the normal form; it says only that the theory’s models identify the two terms. The engine needs more. When a REPL user asks to normalize a term, or when the typechecker wants to decide whether two sides of a proposed equation are equal up to the existing rules, a symmetric relation is the wrong thing to reason with. The engine needs a direction, and it needs guarantees that the directed relation is well-behaved. This chapter is about the two guarantees that matter, and the two reports in panproto-gat::rewriting that supply them.
A DirectedEquation is an equation with an oriented LHS and RHS: rewriting proceeds from LHS to RHS, and the inverse direction is optional. The engine uses directed equations for three distinct purposes. Coercion round-trips have been the original motivating case: a coercion and its inverse form a directed pair, and the engine rewrites away the compose-with-inverse on normalization. The REPL’s :normalize command is the second case. The third, and the one the 0.37 release adds, is the new typecheck_equation_modulo_rewrites path: a proposed equation is accepted if its two sides share a normal form under the theory’s directed system, not only if they are already alpha-equivalent. Each of the three uses depends on the directed system behaving well; the rewriting module is where well-behaved is made precise.
Termination
A rewrite system terminates if no term admits an infinite reduction sequence. A non-terminating system is one that can rewrite forever, and the engine calling normalize on a term under a non-terminating system will loop. The textbook check is the lexicographic path order (LPO), the simplification-order family introduced by Dershowitz in the early 1980s: pick a total order on the operation symbols (the precedence), then extend it to a term order by comparing heads first and, on tied heads, comparing argument lists lexicographically. A rewrite rule is LPO-compatible if for every ground substitution. A system whose rules are all LPO-compatible for a single precedence is guaranteed to terminate.
check_termination_via_lpo takes a list of directed equations and an OpPrecedence and returns a TerminationReport. The report lists any rule whose LPO comparison fails, with a RuleViolation identifying the offending rule and the specific LHS-versus-RHS comparison that did not go the right way. The raw comparator lpo_greater is exposed for callers who want to order terms directly.
The precedence is the user’s design choice: a rule needs in the precedence, which matches the intuition that composition is the heavier symbol and identity is lighter. A precedence that puts the two in the other order will fail the LPO check, and the user is expected to notice and fix their rule direction. The engine does not invent the precedence; it checks the one the user supplies.
Confluence
A rewrite system is confluent if every term that can reduce by two paths can eventually rejoin at a common reduct. A non-confluent system is one where the normal form depends on the order in which rules fire, which is a semantic bug: asking whether two terms are equal under the theory has multiple answers depending on how the engine chose to reduce them.
For directed equational systems without unrestricted left-hand sides, local confluence is decidable by the Knuth-Bendix critical-pair construction. Every pair of rules is examined at every overlap of their left-hand sides; the two reducts of the overlapped redex form a critical pair; and the system is locally confluent iff every critical pair joins, meaning that the two reducts reach a common normal form under the rest of the system. Newman’s lemma then promotes local confluence to full confluence in the presence of termination; together with the LPO check above, local confluence is what the engine needs.
check_local_confluence computes the critical pairs of a directed system and returns a ConfluenceReport listing every critical pair that does not join. Each CriticalPair carries the two rules that overlapped, the unified LHS they share, the two reducts, and a flag indicating whether the pair joined. A theory author reading a non-empty critical-pair list has concrete evidence of which pair of rules competes: the remedy is to orient one of them the other way, add an equation that forces the two reducts equal, or rewrite one rule’s LHS so the overlap disappears.
Why both together
A system that is confluent but not terminating can still reduce forever; a system that terminates but is not confluent produces different normal forms depending on reduction order. The engine needs both. The two reports are independent, and the typical workflow is to run check_termination_via_lpo first (a termination failure often masks a confluence problem by making the critical-pair completion diverge), then check_local_confluence, then iterate until both reports are clean. Once they are, normalize is a well-defined operation, typecheck_equation_modulo_rewrites has a terminating algorithm, and the REPL’s :normalize command produces the unique answer the theory’s equations demand.
Further reading
The standard book-length treatment of term rewriting, confluence, and termination orderings is Baader and Nipkow’s Term Rewriting and All That (1998), which covers every construction used in this chapter at the level of a working proof. The path-ordering family that LPO belongs to was introduced by Dershowitz in the early 1980s; the critical-pair / completion procedure is Knuth and Bendix’s Simple Word Problems in Universal Algebras (1970). Bachmair and Ganzinger’s work in the late 1980s gives the modern ordered-completion refinement that underlies most production implementations.
Closing
With a well-behaved directed system in hand, the engine can normalize terms and check equations modulo rewriting. The next chapter opens Part II by translating this chapter’s mathematics into the language of the schemas working programmers use every day.
Protocols as theories, schemas as instances
Disclaimer. The content of this page is largely LM-generated. It was written as a stopgap to make the panproto system legible while we work through the book verifying and editing the content by hand. When a chapter has been verified or edited by a human, the parts that were verified or edited will be noted at the head of the chapter.
Part II opens here, and its job is to take the single equation stated at the end of Part I — protocol = GAT, schema = model, migration = morphism of models — and work each side of it out in enough detail that a reader can find their way from a Rust value in panproto-schema to the mathematical object it represents, and back again.
The present chapter handles the first two identifications. A reader comfortable with the mathematics of Part I already has most of what is needed; the new material is the mapping between the mathematical vocabulary and the names of Rust types in the crate, together with a worked small protocol concrete enough to touch.
Protocols as GATs
A panproto protocol is a GAT together with the auxiliary data a working system needs: a parser that reads the protocol’s native surface syntax, an emitter that writes it back out, and an entry in the registry that lets other parts of the engine look the protocol up by name. The mathematical heart of a protocol is the GAT; everything else is engineering.
The Rust representation of the GAT is a Theory value from panproto-gat. A Theory carries three pieces of data that correspond exactly to the three pieces of a GAT from Algebraic and generalised algebraic theories: a list of sort declarations, each with its context of dependent variables; a list of operation declarations, each with its context and a dependent arity specifying argument sorts and result sort; and a list of equations between terms, each stated in its own context. The type-checker in panproto_gat::typecheck verifies that the declared data form a well-formed theory: every sort’s context is a well-formed sequence of dependencies, every operation’s arity is coherent, every equation is well-typed in its context. A Theory value whose type-check fails is rejected with a diagnostic pointing at the specific declaration at fault.
The engineering pieces sit around the Theory. A parser takes a byte slice in the protocol’s native surface syntax — ATProto Lexicon JSON, Avro IDL, Protobuf .proto, Rust source, SQL DDL — and returns a schema; an emitter does the reverse. A registration binds the theory, parser, and emitter together under a protocol identifier and installs them in panproto-protocols. The specific protocols — atproto, avro, relational, document, tree-sitter-derived — each live in their own module there; the case studies in Part IV walk through several in detail.
The relation of a protocol to its schemas is worth spelling out, because a reader coming from a database or serialisation-format background may have the words “schema” and “protocol” wired in the opposite direction. A protocol fixes the framework — what sorts a schema has available, what operations those sorts carry, what equations the operations satisfy. A schema fixes the choices inside that framework. Every ATProto lexicon ever written is a schema under the same protocol; every Avro record definition is a schema under a different protocol; the parser produces the Schema, the theory validates it.
Schemas as models
A schema under a protocol is a model of ’s GAT, in the sense of Algebraic and generalised algebraic theories. A model of a GAT in a contextual category is a structure-preserving functor from the theory’s syntactic category into ; for panproto, is with its canonical contextual structure. Concretely, then, a schema under is an assignment of a set to each sort of ’s theory (or, for dependent sorts, a family of sets indexed by the interpretations of the sort’s context), an assignment of a function to each operation, such that the theory’s equations hold in the interpretation. The equations are respected because they are baked into the morphisms of the syntactic category, and every functor sends equal morphisms to equal morphisms — a property we wrote down once in Functors and natural transformations and do not have to re-derive.
The Rust representation is a Schema value from panproto-schema. A Schema is constructed by a SchemaBuilder in two phases: sort and operation interpretations are supplied one at a time, and when the builder finishes, the validator in panproto_schema::validate checks the theory’s equations against the interpretation the builder has assembled. A schema that fails validation is not a model — something in its interpretation violates the protocol’s axioms — and the validator’s error message names the specific equation violated.
The two-phase construction is worth a moment. A reader might wonder why validation runs at the end rather than as each interpretation is added. The answer is that the equations a GAT imposes often refer to multiple operations, and an equation involving two operations cannot be validated until both are declared. The builder’s job is to accumulate the data; the validator’s job is to check it once the data is complete.
The category of schemas under a protocol is, in this setup, the category of models of ’s GAT. Its morphisms are migrations, which the next chapter takes apart. The relational antecedent of the whole picture is Codd (1970), whose relational model is the special case in which the protocol is the theory of a relational schema; the same machinery handles the non-relational protocols of Part IV without modification.
A small worked protocol
An explicit small protocol will make both halves of the identification touchable. Consider a minimal address-book format, a protocol whose theory has two sorts and four operations.
The two sorts are and , both in the empty context (both global). The four operations are , , , and . No equations are imposed beyond the well-typedness conditions every theory gets for free.
The theory says what an address book consists of; it does not commit to a particular population of people. Two very different populations are schemas under the same protocol. A tiny population — two people sharing an address — is one:
- , ;
- , ;
- , ;
- ;
- .
A database of ten million subscribers, assigned to streets across the country, is also a schema under , differing from the tiny one only in the size of the sets and the domains of the operations. Both are Schema values in the same Rust type; what they have in common is the Theory reference they share.
In Rust, the theory is constructed once and registered:
#![allow(unused)]
fn main() {
let theory = Theory::builder()
.sort("Person")
.sort("Address")
.op("name", "Person -> String")
.op("email", "Person -> String")
.op("street", "Address -> String")
.op("lives_at", "Person -> Address")
.build()?;
let protocol = Protocol::new("addr", theory, parser, emitter);
panproto_protocols::register(protocol);
}
A schema under the registered protocol is constructed by a separate builder call that refers back to the protocol:
#![allow(unused)]
fn main() {
let protocol = panproto_protocols::get("addr")?;
let schema = SchemaBuilder::new(&protocol)
.interpret_sort("Person", vec!["alice", "bob"])
.interpret_sort("Address", vec!["home"])
.interpret_op("name", /* function */)
// ...
.build()?;
}
Two schemas under the same protocol share the same Theory by reference, and are therefore directly comparable as models. Two schemas under different protocols carry different theories and are not; comparing them requires a theory morphism between the protocols, which is the subject of Protocol colimits.
The instance functor
A schema’s sort interpretations are sets, and the elements of those sets are the records a working developer calls “the data”. For the tiny address-book schema above, the records are and in — together with the values of the four operations applied to them — and in together with its street value. Panproto calls a record-set-under-a-schema an Instance and implements it in the panproto-inst crate.
The assignment sending each schema to its set of instances, and each migration to the function that lifts -instances to -instances, is a functor
This is the instance functor of Functors and natural transformations. Its object part is what Instance represents; its morphism part is the lift function in panproto_mig::lift. The functoriality laws — composition of migrations corresponds to composition of lifts, identity migrations correspond to identity lifts — hold by construction, because panproto_mig::compose is written so that they do, and are verified by the crate’s test suite on a sampled state space.
Every construction in the rest of Part II either acts on directly or acts on instances through , and no third mechanism is needed. Lenses, protolenses, colimits of schemas, merge-as-pushout — all of them are built on top of the two identifications of this chapter and the functor connecting them.
Further reading
For the general framework of categories of models of a theory, Lambek & Scott (1986) develops the setting of higher-order categorical logic and Jacobs (1999) is the encyclopedic reference. Sannella & Tarlecki (2012) covers the algebraic-specification tradition in which schemas-as-models sits natively.
For the specifically panproto-relevant lineage, the functorial-data-migration programme of Spivak (2012) and Spivak & Wisnesky (2015) develops the category-of-schemas setting in the relational case and proves the functoriality results the engine relies on. Wisnesky (2013) is the companion thesis that works out the implementation in CQL. Panproto differs in handling GAT-based rather than only Lawvere-theoretic protocols, and in integrating with version control, but the category-theoretic backbone is the same.
For the relational antecedent, Codd (1970) is the founding paper of the relational model, whose vocabulary most subsequent treatments still inherit.
Closing
The next chapter introduces theory morphisms and instance migration, developing the third of the three identifications: migrations as morphisms of models, and the triple of functors on categories of models that a morphism of theories induces.
Dependent sorts in practice
Disclaimer. The content of this page is largely LM-generated. It was written as a stopgap to make the panproto system legible while we work through the book verifying and editing the content by hand. When a chapter has been verified or edited by a human, the parts that were verified or edited will be noted at the head of the chapter.
The GATs chapter said, abstractly, that a sort may be indexed by an earlier sort’s inhabitants. This chapter pins down what that buys the working engineer, using the simply-typed lambda calculus as a worked example, and shows how panproto-gat typechecks a theory that exercises every dependent-sort feature the engine supports.
The point of the example is not that panproto is a type-theory engine; it is that the same dependent-sort machinery the engine uses to keep track of domain and codomain in a category protocol, or source and target vertex kinds in a graph schema, is expressive enough to encode a system programmers already recognise. If the machinery handles STLC, it handles the smaller dependencies that arise in ordinary protocols without strain.
STLC as a GAT
The theory has three sorts. The context sort and the type sort are global. The term sort is parameterised by a context and a type: it stands for the collection of closed terms in context at type . A term in the empty context at type is a different inhabitant from a term in a one-binding context at the same type, and the indexing records the difference.
The operations are familiar:
arrow : (A : Ty, B : Ty) -> Ty
extend : (G : Ctx, A : Ty) -> Ctx
emptyCtx : () -> Ctx
var_zero : (G : Ctx, A : Ty) -> Tm(extend(G, A), A)
lam : (G, A, B, body : Tm(extend(G, A), B)) -> Tm(G, arrow(A, B))
app : (G, A, B, f : Tm(G, arrow(A, B)), x : Tm(G, A)) -> Tm(G, B)
subst : (G, A, B, body : Tm(extend(G, A), B), x : Tm(G, A)) -> Tm(G, B)
Every output sort references the input parameters explicitly: var_zero lives in the extended context with the fresh variable’s type on the right; lam lives in the original context at the arrow type; app lives in the original context at the range. The meta-theoretic invariants that would require a side condition in an extrinsic presentation (e.g., “the source of the substitution matches the context of the body”) are structural in this presentation, because the sort itself names the context.
There is one equation, the -law stated as a rewrite between two already-well-typed terms:
app(G, A, B, lam(G, A, B, body), x) = subst(G, A, B, body, x)
The equation typechecks because both sides have the same output sort, , under a single inferred variable context. There is no metavariable capture to worry about, because the substitution is itself a named operation in the signature rather than a primitive of the meta-language. A type-theorist will recognise this as a “formal” or “explicit” substitution presentation, due ultimately to Martin-Löf (1984) and elaborated categorically by Dybjer (1996); the categorical story of the same encoding is worked out in depth by Hofmann (1997).
What panproto-gat does with this
The typechecker in panproto_gat::typecheck accepts the theory as stated. For each operation, it walks the signature accumulating a substitution from input parameter names to concrete argument terms; the expected sort for each argument is the declared input sort under that substitution, and the inferred output sort of the application is the declared output sort under the final substitution. For the equation specifically, it runs Robinson unification over the two sides’ sort expressions, yielding a single context assignment that types both halves to the same .
Concretely, applying app to a well-chosen argument tuple reduces to a substitution computation. Take f with context-declared sort and x with sort : the typechecker instantiates G := Γ, A := A, B := B, f := f, x := x, and the output sort is . Give x the wrong sort — say when app expects — and the typechecker rejects the application with an ArgTypeMismatch error that pins the failure to the specific argument position.
The integration test at tests/integration/tests/stlc_gat.rs exercises exactly this flow, end to end, including the equation and a deliberate ill-typed rejection.
Implicit type arguments
The signature listed above is the canonical one, and it is the signature the typechecker accepts. It is also awkward to write at a use site. A working developer wants app(f, x), not app(G, A, B, f, x): the three context-and-type arguments are recoverable from the sorts of f and x by unification, and asking the user to spell them out is redundant.
The 0.37 release added an Implicit::Yes tag on each operation input, which marks a parameter as recoverable at the call site rather than supplied explicitly. The STLC fixture marks G, A, B implicit on lam, app, and subst, and the signatures now read
lam : (G, A, B {implicit}, body : Tm(extend(G, A), B)) -> Tm(G, arrow(A, B))
app : (G, A, B {implicit}, f : Tm(G, arrow(A, B)), x : Tm(G, A)) -> Tm(G, B)
subst : (G, A, B {implicit}, body : Tm(extend(G, A), B), x : Tm(G, A)) -> Tm(G, B)
with an associated convention that the three implicit positions are filled by unifying the explicit argument sorts against the declared input sorts. A call app(f, x) typechecks by solving for , , given f’s sort and for the same three given x’s sort; the two solutions agree, the implicit inputs are bound, and the output sort is returned. The original five-argument spelling is still legal, which matters for backward compatibility: every theory that passed 0.36 continues to pass 0.37 without edits.
The mechanism is not specific to STLC. Any operation whose early inputs are determined by the sorts of its later inputs is a candidate for implicit tagging, and the typical signature in a protocol theory has at least one such input. The reward is that the worked examples of Part IV now look the way the working developer would write them rather than the way the theory demands them.
Closed sorts and total case analysis
STLC has no closed sort in the 0.37 sense, because its type sort is open: a theory author can always extend the set of types by declaring a new ty_ operation. Protocol theories of practical interest, however, often do pick a fixed constructor list up front. A protocol that emits Stan programs from a statistical schema, for instance, carries a sort whose inhabitants are exactly the Stan distribution primitives (, , , and the remaining few dozen), and that list does not grow.
Before 0.37 the emitter for such a protocol had to be a partial function: given a Distr it pattern-matched on the head name and fell through to an unreachable! clause the engine could not prove was never taken. With closed sorts and Term::Case, the same emitter is a total case expression. The sort declaration carries a SortClosure::Closed(vec!["normal", "beta", "gamma", ...]), the emitter is a Case whose branches cover exactly that list, and the typechecker verifies coverage at the declaration site rather than deferring the obligation to runtime. The unreachable! is gone and the engine can prove it was never needed.
The feature generalises. Any protocol whose theory picks a fixed constructor list for some sort — the SQL type system’s primitives, the Avro primitive-type set, the ATProto lexicon primitive list — is now expressible as a closed sort, and any transform that dispatches on that sort is now a total Case rather than a string-keyed lookup table. How panproto-gat represents dependent sorts states the typechecker conditions; this chapter’s purpose is to say that the feature applies to the protocols in Part IV and not only to the small inductive types.
Why this encoding sidesteps capture-avoiding substitution
The textbook concern with encoding -calculus as an algebraic theory is that naive substitution captures free variables: substituting a term with a free x into a context that already binds x silently changes the term’s meaning. The usual remedies are de Bruijn indices, nominal sets, higher-order abstract syntax, or ad-hoc freshness side conditions. Each remedy has its own costs and complexities.
The encoding above pays a different price and sidesteps the problem entirely. There is no binder in the meta-language: lam takes a body whose sort names the extended context directly, and var_zero is an explicit variable-projection operation rather than a name that has to be compared structurally. Because the meta-language’s only operation on variables is the one the theory declares, capture is impossible: the engine never has the opportunity to alpha-rename anything. The rule’s right-hand side calls the declared subst operation, whose semantics are whatever the theory’s equations fix; no appeal to a meta-level substitution function is required.
The cost is that the theory has to carry an explicit extend / var_zero / subst triple, and the bookkeeping that a structural substitution would provide for free in a traditional presentation has to be stated as equations. For a language as small as STLC, this is a fair trade. For a full dependent type theory the same style scales up, at the price of more equations, as Dybjer (1996) and subsequent categories-with-families work up in detail.
What this unlocks for protocols
The machinery that makes STLC work is not STLC-specific. Any protocol whose sorts carry indexing — a relational schema where a column’s sort depends on its table, a graph schema where an edge’s sort depends on its source and target vertex kinds, a message schema where a field’s sort depends on a discriminator — uses the same mechanism. The GATs chapter (How panproto-gat represents dependent sorts) states the mechanism abstractly; this chapter is the worked case that makes the abstraction concrete.
The typical indexing depth for the protocols covered in Part IV is shallow (one or two parameters), far short of what STLC demands. Reading the STLC example is therefore a way of confirming that the engine is not near its limits when it handles the simpler cases; it is doing, in production, a much weaker version of what the worked example exercises.
Equality of signatures across theories
Parameter names inside an operation’s signature are local binders, not names a morphism is expected to preserve. Two theories can present the category’s identity morphism as id : (x : \mathrm{Ob}) \to \mathrm{Hom}(x, x) and id : (y : \mathrm{Ob}) \to \mathrm{Hom}(y, y) respectively and still be canonically isomorphic; a morphism between them that maps id to id is categorically well-formed, and check_morphism accepts it.
The mechanism is a positional alpha-rename. When comparing a domain operation f : (p_1, \ldots, p_n) \to T against its image f' : (q_1, \ldots, q_n) \to T' in the codomain, the checker builds the substitution and compares T.subst(σ) against T' rather than T against T' directly. The same substitution is applied to each input sort and to any dependent sort-parameter blocks. Reordering parameter positions (say, swapping (x, y) for (y, x)) is not an alpha-rename and is correctly rejected; the invariance is purely in the choice of binder names, not in their arity or order. The helpers positional_param_rename and signatures_equivalent_modulo_param_rename expose the same comparison to callers that want to perform signature-equality checks outside of check_morphism; the colimit machinery uses them when merging two theories that share an operation name.
Further reading
Cartmell (1986) gives the framework. Martin-Löf (1984) and Dybjer (1996) develop the type-theoretic program GATs serve. Hofmann (1997) works out the syntactic–categorical correspondence in the detail a reader implementing a dependent-type engine would want. The integration tests at tests/integration/tests/stlc_gat.rs and tests/integration/tests/dependent_sorts.rs are the runnable counterparts of the examples in this chapter.
Typeclasses as theory morphisms
Disclaimer. The content of this page is largely LM-generated. It was written as a stopgap to make the panproto system legible while we work through the book verifying and editing the content by hand. When a chapter has been verified or edited by a human, the parts that were verified or edited will be noted at the head of the chapter.
A Haskell programmer reads Eq a => a -> a -> Bool and sees a familiar story: a type a is in the class Eq if it supplies an == operation satisfying the class’s axioms, and a function that demands Eq a is one whose body uses ==. Typeclasses are the mechanism by which Haskell attaches a set of operations, and their laws, to a type; an instance is the evidence that a given type supports those operations.
Typeclasses are not a new mathematical primitive. They are a notational convenience for something we already have: a typeclass is a theory, an instance is a morphism from that theory into a target theory, and the instance’s bindings are the morphism’s action on the class’s operations. This chapter says what that correspondence looks like in panproto, and how panproto-theory-dsl and panproto-gat-macros give it a working surface.
The correspondence
Take Haskell’s Eq class. It declares one operation, ==, of type a -> a -> Bool, together with the reflexivity, symmetry, and transitivity axioms. Written as a theory:
theory ThEq {
sorts { A; Bool }
ops { eq : (A, A) -> Bool }
axioms { refl: eq(x, x) = true }
}
An instance instance Eq Int where x == y = primIntEq x y says: take the class theory ThEq, pick the target theory where the primitive primIntEq lives (a theory of arithmetic), and declare a morphism that sends the class’s sort A to Int and the class’s operation eq to primIntEq. The morphism obligation is the axiom: primIntEq(x, x) = true has to hold in the target theory, which is how panproto’s typechecker verifies the instance.
The correspondence extends uniformly:
| Haskell | panproto |
|---|---|
class C a where ... | Theory declaring the class’s sorts, ops, and axioms |
instance C T where op = expr | TheoryMorphism from the class theory to a target theory |
C a => ... constraint | A precondition that the morphism from C’s theory exists |
| Class axiom | Equation in the class theory, required to hold under morphism |
| Default method | Equation in the class theory giving a derived-op definition |
Functional dependency a -> b | Sort parameter b implicitly determined by a in the class |
The Haskell-to-panproto translation is always the same shape: the class is a theory with a distinguished sort the instance will replace, the instance is a morphism that picks the replacement and supplies implementations for every op the class declares.
Two surfaces
Panproto exposes two authoring surfaces for this correspondence, at opposite ends of the config-versus-code spectrum, both compiling to the same Theory and TheoryMorphism values.
The config surface is the panproto-theory-dsl crate, which parses a TheoryDocument from Nickel, JSON, or YAML and compiles it through the compile_class, compile_instance, and compile_inductive entry points. A class document declares the class’s sorts, ops, and axioms; an instance document declares the source class, the target theory, and the bindings; the compiler resolves the target theory, builds the morphism, and returns a validated TheoryMorphism. This is the surface a user reaches for when theories live as data in a repository and are edited alongside the protocols they describe.
The code surface is the panproto-gat-macros crate, which offers three proc-macros. class! takes a class declaration in a Rust-ish syntax and expands to a theory_<name>() constructor returning the class’s Theory. instance! takes an instance declaration and expands to an instance_<name>(&class, &target) function returning a TheoryMorphism. inductive! takes an inductive-type declaration and expands to the closed-sort Theory it names. The macros are the surface for users who prefer to author theories inline with the Rust code that consumes them.
use panproto_gat_macros::{class, instance};
class! {
ThEq<A> {
eq(x: A, y: A) -> Bool;
axiom refl: eq(x, x) = true;
}
}
instance! {
EqInt: ThEq<Int> in ThArith {
eq = int_eq;
}
}
The choice between the two surfaces is a style question, not a feature question: both compile to the same validated TheoryMorphism and both go through the same typechecker. The fact that the two compile to identical values is what lets a theory author mix them in a single project; a core class in the DSL can have a downstream Rust crate declare an instance via the proc-macro, and the morphism composes with any other morphism in the system by the rules of Theory morphisms and instance migration.
Deriving standard instances
Haskell’s deriving (Eq, Ord) mechanism asks the compiler to synthesize instances of standard classes by structural recursion on the type’s constructors. The panproto counterpart is derive_theory! in panproto-gat-macros, which accepts a theory-declaration block annotated with #[derive(Eq)] or #[derive(Hash)] and emits instance-builder functions for each derivation alongside the base theory. The two derivations cover the two capabilities every panproto sort is expected to carry (decidable equality and hashability), and are the common case that makes the instance! macro worth having. Extending the set of derivable classes is a deliberate, per-class piece of work; the 0.37 release supplies the two that every protocol needs and stops there.
Why a morphism rather than a record of functions
A working engineer could ask why the machinery is a theory morphism rather than a record of function pointers. The answer is that the morphism carries the theory’s equations along with its operations: the instance’s validity is not only that each declared op has a matching target op, but that the target ops satisfy the class’s axioms under the morphism. A record of function pointers is a data-only mapping; a theory morphism is a mapping that the engine can verify preserves the class’s laws. The verification is what makes the instance trustworthy in the same way Haskell’s axioms (informal as they are in the language proper) make a well-written Eq instance trustworthy. In panproto the axioms are part of the class theory, the instance is required to satisfy them under the morphism, and the typechecker rejects an instance whose bindings violate an axiom, with a diagnostic pointing at the specific axiom and specific binding.
Further reading
Typeclasses as we know them in Haskell are due to Wadler and Blott, whose 1989 POPL paper introduced the ad-hoc-polymorphism-via-implicit-dictionaries design that the language adopted; see Wadler (1989) for the antecedent work on parametricity that grounds the design. The categorical story of classes as theories and instances as morphisms is standard in the algebraic-specification tradition; Goguen & Burstall (1992) is the institutions framework where the correspondence is stated most cleanly, and the modular-specification apparatus of the same line of research is what underlies the namespaced-imports machinery in panproto-theory-dsl.
Closing
The typeclass surface is where the panproto-gat machinery meets the idiomatic shapes a working developer expects. A class is a theory, an instance is a morphism, and the two surfaces (DSL and proc-macros) are two ways to write the same objects. The next chapter, The restrict/lift pipeline, turns back to the core migration construction and shows how a theory morphism drives instance migration end to end.
Theory morphisms and instance migration
Disclaimer. The content of this page is largely LM-generated. It was written as a stopgap to make the panproto system legible while we work through the book verifying and editing the content by hand. When a chapter has been verified or edited by a human, the parts that were verified or edited will be noted at the head of the chapter.
Every change of schema in a working system is a migration waiting to happen. Add a field and somebody has to decide what to do for the records that did not have it; rename a field and somebody has to decide how to reconcile the old name with the new; merge two schemas and somebody has to decide what the shared structure means. Doing this by hand, as most teams still do, is how production incidents begin.
The central claim of this chapter, due in its categorical form to Spivak (2012), is that every such change of schema is the pullback of a theory morphism — plus, when the change extends rather than restricts, a choice between two universal strategies for filling in what the source did not supply. The chapter unpacks the claim. Panproto’s panproto-mig crate is the implementation of what the claim prescribes, and the remainder of Part II shows what the implementation looks like stage by stage.
Theory morphisms
A theory morphism from a GAT to a GAT is a translation of the first theory’s vocabulary into the second’s that respects structure on both sides. Concretely, a theory morphism assigns each sort of to a sort of ; each operation of to an operation of with matching arity after translation (the argument sorts and result sort of the -operation, translated through the sort assignment, must match the signature of the -operation chosen for it); and each equation of to a consequence of ’s equations under the translation.
Equivalently, and perhaps more pleasantly, a theory morphism is a structure-preserving functor between the contextual categories the GATs generate. The functor laws reappear as the three conditions above, lifted to the dependent-sort setting; the contextual structure — how sorts and operations depend on free variables — must match across the translation as well.
Panproto represents a theory morphism by a Morphism value in panproto-gat. The type-checker verifies each of the three conditions. Every source sort’s image must exist in the target theory with the right dependencies; every source operation’s translated body must type-check in the target’s context; every source equation must be derivable from the target’s equations under the translation. The last of these is the deepest check and is the one that most often rejects a proposed morphism that the two theories do not actually support.
The full statement of the third condition is derivability: the translated equation must follow from the target’s equational theory, not merely appear verbatim in its list of axioms. The implementation presently enforces a conservative approximation: for every source equation lhs = rhs, the translated pair F(lhs) = F(rhs) is matched alpha-equivalently against the target’s declared equations. A morphism whose image is derivable via directed rewrites or via a chain of other equations, but is not already listed literally, is rejected under the current check. This is sound but incomplete; complete preservation via normalization or congruence closure against the target’s full equational theory is a queued follow-up, tracked against check_morphism. Readers designing morphisms between theories whose axioms are equivalent-but-not-literal should be aware of the stricter spelling the checker currently accepts.
Two shapes that recur
Two kinds of theory morphism come up often enough to be worth naming. An inclusion expresses that extends with new sorts, operations, or equations, with every symbol of mapping to itself in . Adding a new field to a record schema, adding a new table to a relational schema, tightening a constraint on an existing field — all of these are inclusions. A quotient expresses that identifies some symbols of or imposes new equations on them; each symbol of maps to its equivalence class under the new identifications. Renaming two fields to the same name, or adding an equation that forces two operations to agree, are quotients.
Most real migrations are neither pure inclusions nor pure quotients but combinations: a morphism that adds a new field (an inclusion) while renaming an old one (a quotient), for instance. Panproto’s migration engine decomposes each migration into its inclusion and quotient components internally for the existence checker’s benefit, but the developer writes one Morphism value covering both.
The three migration functors
A theory morphism does not just translate symbols; it induces three functors between the categories of models, each with a distinct operational meaning.
The three sit in an adjoint relationship:
In words: goes from -models to -models; its two adjoints and go the other way; and each adjoint is pinned down up to unique isomorphism by the universal property of being left or right adjoint to . The three functors take distinct operational shapes at the data level, and we take them one at a time.
The pullback functor
The pullback is the simplest of the three, and the one panproto uses most. Given a -model , the pullback is the -model obtained by reading through : a sort of is interpreted as ’s interpretation of , and an operation of is interpreted as ’s interpretation of its image. Concretely, if sends the sort of to the sort of , then ’s interpretation of is ’s interpretation of . No data is created, no data is thrown away; the pullback is a relabelling.
Functoriality is immediate: a morphism in induces a morphism with the same underlying assignment, read through . The pullback functor is implemented in panproto_mig::lift and is the cheapest of the three at runtime — no data-level computation beyond relabelling. When a migration reduces a richer schema to a smaller one (a forgetful migration), is the functor doing the work, and the next chapter calls this operation the restrict half of its pipeline.
The pushforward functors and
The two pushforwards go the other way: both are functors . They differ in how they handle the new structure demands but -models cannot supply.
, the left adjoint, is obtained by freely adding whatever new structure the target theory asks for. If extends with a new operation, has the new operation interpreted as a free choice at every element, with every possible value admitted; if extends with a new sort, interprets that sort as the set of all possible values. Formally, is the smallest -model from which can be recovered by pullback.
, the right adjoint, is obtained by universal selection rather than free expansion. Given a -model , is a -model whose elements are precisely those elements of that admit a unique extension compatible with the target theory. Where is maximally permissive, is maximally restrictive: it includes only what is forced.
A reader may ask why two pushforwards are needed. The answer lies in the adjointness. answers the question what is the smallest -model that recovers by pullback? and answers what is the largest -model whose pullback equals ? The two coincide only in trivial cases, and diverge whenever ’s new structure admits ambiguity. In practical terms, corresponds to “fill new fields with defaults” and to “only keep rows whose new fields are fully determined”. Having both available is not a luxury: different migrations want different strategies at different sites, and real schema evolution routinely needs both.
This triple of functors is the framework of functorial data migration, developed in the relational setting by Spivak (2012), refined in Spivak & Wisnesky (2015), and worked out in executable form as the CQL system of Wisnesky (2013). Panproto adopts it essentially unchanged, with one generalisation: Spivak’s original work is stated for Lawvere theories, and panproto extends the same three functors to GATs. The extension is mathematically straightforward, because contextual categories admit the same adjoint structure as categories with finite products, but it is what lets panproto handle schema languages with dependent structure that Lawvere theories cannot express directly.
A worked example
The three functors are easier to read in an example than in the abstract. Take the running case from Part I.
Let be the theory of a one-field record: one sort , one operation , no equations. Let extend with a second operation . The theory morphism sends to and to and declares to be new in .
Start with a specific -model : a three-person address book with names and emails,
The pullback is the same three people with only their names:
The email column has been forgotten, because has no operation for it. That is what a pullback does.
Now go the other way. Start from a -model — the three-name address book without emails — and ask what a -model compatible with it should look like.
, the left adjoint, is the smallest such -model. Because has a new operation that knows nothing about, must supply some email for each person, and it does so freely: every possible email assignment is admitted. The population of contains one entry for every pair (person of , possible email string), which is a very large set.
This is almost never what a developer actually wants, and it is worth understanding why the mathematics gives us an answer a developer would reject in practice. The mathematics wants the universal answer, and the universal answer is to admit every possibility, because any commitment to a specific email would impose structure the source model does not justify. Panproto’s migration DSL therefore accepts a restricted form of : the developer supplies a rule that picks a single email for each person — a default, a computed value, an empty string — and the engine compiles the rule into a -style pushforward that uses the rule at every extension site. The underlying category-theoretic construction is ; the practical construction panproto calls “fill with default” is a restriction of it to a specific choice.
, the right adjoint, is the largest -model whose pullback equals . For the schema as stated, it is empty: no person can be extended to a -record unambiguously, because every email value is allowed. becomes operationally useful when the target theory carries constraints that eliminate ambiguity. If imposes an equation saying every unset email is the value "unknown", then extends by inserting the default; if requires email values to match a pattern that determines them from the name, drops every person whose name does not already force a unique email.
In panproto’s vocabulary, the pullback is the restrict of The restrict/lift pipeline, and the combination of and at various sites of a migration is the lift. The engine does not supply naïve and ; it supplies the restricted forms the developer asks for, under the declarations the migration DSL expresses.
Panproto’s packaging
A panproto migration is more than a theory morphism. It is a theory morphism together with a declaration of what to do at each extension site — which strategy among and , with what specific rule — expressed in the migration DSL.
The Rust representation is a Migration value from panproto-mig. Construction goes through the migration DSL, whose surface syntax belongs to Part III. Compilation — the translation from the symbolic declaration to a runtime lift function — goes through panproto_mig::compile. Execution is panproto_mig::lift, which applies a compiled migration to a specific source instance.
When the user does not already hold a migration and wants the engine to propose one, the panproto-mig::align module seeds the candidate pool with anchor correspondences from a family of pluggable strategies. The 0.37 release broadens the family. edge_label_anchors compares the label multisets of incident edges, which catches two vertices that share the same outgoing edge names even when their own names diverge. suffix_anchors keys on the terminal dotted segment of a namespaced identifier, which lines up app.bsky.feed.post#author with com.example.post#author without a hint. description_anchors runs token similarity over the human-readable description metadata, useful when the two schemas have synonymous names. neighborhood_anchors propagates anchors outward from an already-matched seed, so a single high-confidence correspondence spreads through adjacency. wl_anchors runs Weisfeiler-Leman structural refinement and matches vertices whose WL colors agree at a fixed iteration. The existing preservation-note still applies: the engine validates every anchor against the kinds and constraints of both sides before it enters the candidate pool, and an anchor that violates the target theory’s equations is rejected before compilation begins.
Composition of migrations lives in panproto_mig::compose. The functor axioms of Functors and natural transformations reappear here as the crate’s compilation invariants: compiling the composite of two migrations produces the same runtime function as composing the compiled migrations separately; the identity migration on a schema compiles to the identity function on its instances. Both invariants are load-bearing. A migration engine that fails them is producing different answers depending on how it chose to evaluate a migration chain, which is the worst kind of bug — present intermittently, hard to reproduce, impossible to diagnose without understanding the mathematics the engine is supposed to be implementing.
Further reading
The foundational source for the triple in the database setting is Spivak (2012). Spivak & Wisnesky (2015) refines the construction for the relational case; Wisnesky (2013) is the thesis that works out the implementation in CQL, which is the closest existing system to panproto’s migration engine.
For the broader setting of adjoint functors, Mac Lane (1998) chapter IV is canonical, Awodey (2010) chapter 9 (“Adjoints”) gives the material at undergraduate register, and Riehl (2016) chapter 4 (“Adjunctions”) is the modern treatment.
For the relational-database antecedent, Codd (1970) is the founding paper, and Kleppmann (2017) chapter 2 (“Data Models and Query Languages”) gives the working-developer’s view of the same ideas without category theory. Reading the two together is a useful exercise in recognising how a single idea can be stated at very different levels of abstraction.
Closing
The next chapter, The restrict/lift pipeline, takes a Migration value apart into its compilation stages: existence checking, restrict, lift, compose, invert. Each stage performs one operation on the data assembled above, and the decomposition is what lets the engine diagnose failures at the earliest point a migration can be seen to go wrong.
The restrict/lift pipeline
Disclaimer. The content of this page is largely LM-generated. It was written as a stopgap to make the panproto system legible while we work through the book verifying and editing the content by hand. When a chapter has been verified or edited by a human, the parts that were verified or edited will be noted at the head of the chapter.
The previous chapter handed us a compact mathematical statement: a migration is a theory morphism together with a choice of pushforward at each extension site. That sentence is exactly one step removed from what a developer sees when they use panproto-mig. The crate takes the mathematical object apart into five stages, each performing one sharply-defined operation, so that a migration rejected or misbehaving is rejected or misbehaving at a specific stage with a specific diagnostic.
The decomposition is the whole reason the crate is not one function called migrate. A monolithic migrate operation can report a failure only as the collapsed state of everything at once: the engine knows that the migration is broken and not why. The five-stage pipeline, in exchange for its apparent fuss, gives the engine something to say about where the failure lies — often specific enough that the fix is one line of the migration DSL.
This chapter walks through the five stages in order. The progression through them, and the kind of diagnostic each one produces when it refuses, is the main argument for why the pipeline is structured the way it is.
Existence checking
The first question the engine asks of a user-written migration is whether it can possibly be a morphism of models at all. The answer depends only on the two theories involved and on the declared translation between them: no instance data, no records, no lifting. If the translation is inconsistent with the target theory, the migration cannot be a valid morphism regardless of what data is thrown at it, and the engine should say so before reading a single record.
Existence checking, implemented in panproto_mig::existence, is the stage that does this. A migration fails existence checking when its theory-morphism component cannot be the map it is declared as — a sort translated to a symbol the target theory does not have, an operation translated with mismatched arity, a renaming that violates one of the target’s equations. The failure can also be more subtle: a pushforward declaration that asks for a universal property the source data cannot supply, a chain of operations whose combined dependencies do not resolve under the translation.
In every case the existence checker’s job is to find the smallest theory-level site responsible for the failure and report it. A well-formed migration yields a witness of existence that subsequent stages consume; an ill-formed one yields a structured diagnostic pointing at the specific sort, operation, or equation at fault.
Existence checking touches no instances, reads only the two theories and the migration declaration, and runs in time roughly proportional to the size of the declaration. It is the stage that panproto schema check runs when a developer validates a migration before attempting to apply it, and every subsequent stage presumes that its input has already passed.
Compilation
Compilation turns a migration the user wrote into a representation the engine can actually apply to instances. The stage lives in panproto_mig::compile and produces a Migration value whose internals carry the compiled per-sort lift functions, the user’s pushforward decisions, and the pre-computed data dependencies between fields of the source and target schemas.
The artefact compilation produces, per sort of the source theory, is a small function in panproto’s expression language (which Part III develops). Each such function specifies how a single record of the source is carried into a record — or into a set of records, or into a failure — of the target. Decomposing the migration into per-sort functions keeps compilation’s output compact and makes the runtime lift parallelisable naturally across records.
The three migration functors of the previous chapter all appear during compilation, one per site, assembled from the user’s declaration. A -style free expansion becomes a function that returns the constrained set of extensions the developer’s default rule picks out; a -style universal selection becomes a function that returns at most one extension, namely the unique one compatible with the target theory’s equations; a -pullback becomes a projection that forgets material the target does not use. Each per-sort function is statically typed in the Rust representation, which means ill-typed compositions of stages fail to compile before any record moves.
Compilation is the engine’s last chance to catch structural problems ahead of execution. After this stage, the remaining work is data.
Lifting
Lifting is the only stage that touches instances. The function panproto_mig::lift takes a compiled migration and an input instance and returns an output instance in the target schema.
The implementation walks the input instance’s record graph, applies the appropriate per-sort lift function at each record, and assembles the target instance’s record graph from the results. The walk is ordered to respect data dependencies between sorts — sorts that feed into others’ lift functions are processed first, and the topological ordering is computed at compile time so the walk itself is straightforward.
A number of correctness properties hold by construction rather than by runtime check. The output instance is in the target schema named in the migration, not a possibly-valid approximation of it, because the compile stage arranged the lift function to produce exactly the declared target. Every equation of the target theory holds in the output, because the compile stage arranged for the target’s equations to reduce to source-level equations the input already had to satisfy. The lift respects record-level identity: if the source instance has two indistinguishable records in some sort, the output has either zero or one indistinguishable image of them, with the pushforward choice determining which.
At the mathematical level, lifting is the action of the instance functor on morphisms. was introduced in Functors and natural transformations; here is where we see its morphism part operating on specific input.
Composition
Two migrations whose ends meet combine by panproto_mig::compose. Given compiled migrations and , the output is a compiled migration whose runtime lift is functionally identical to running the two in sequence.
The functoriality invariant of the instance functor reappears at this stage as the engine’s correctness condition:
Panproto’s test suite verifies the invariant on a sampled input space — compute both sides of the equation against the same input, compare record by record — and a discrepancy is a bug in the composition implementation, not in the migrations themselves. Passing is necessary (though not sufficient) for the engine to be a faithful implementation of the framework of Theory morphisms and instance migration.
Identity migrations are what panproto_mig::Migration::identity constructs. Composing any migration with an identity on either side yields the original migration, and the test suite checks this too. Both invariants — composition and identity — are the functor axioms of Functors and natural transformations stated at the implementation level.
A developer may reasonably ask why the engine composes migrations eagerly — producing a single compiled composite — rather than lazily, by running the two separately at lift time. The answer is that eager composition lets the compile stage do work once per composite rather than once per lift, and the per-sort lift functions of the composite are usually simpler than the sequential application of the two components would be: optimisations at compile time can fuse adjacent lifts that cancel each other. For migration chains of any length, eager composition is substantially faster.
Inversion
Not every migration is invertible. A migration that forgets a field cannot be inverted — the forgotten data is not recoverable from its image. A migration whose pushforward freely extends its input loses its image in general — multiple distinct sources can map to the same target, and the inverse would have to choose one. panproto_mig::invert computes an inverse when one exists and reports the obstruction when it does not.
A subtle case worth naming is the migration that is invertible at the theory level but not at the instance level. A renaming of fields is invertible as a theory morphism — the reverse renaming is also a valid theory morphism — but its inverse on instances requires the renaming to be a bijection on field names. If two fields get renamed to the same target field, the theory morphism still exists, but the inverse on instances does not. The inversion stage separates these two questions and reports the failure at whichever level it occurs first.
The adjoint pair of the previous chapter classifies the cases. A theory morphism whose is an equivalence of categories yields a migration invertible in both directions at the instance level. A theory morphism whose is faithful but not full yields a migration invertible only on the subset of instances realising the image of . Every case panproto handles falls under this classification, and invert dispatches on it at compile time.
The round-trip laws that define lenses in the next chapter turn out to be a restricted form of the invertibility inversion produces: a lens is a migration equipped with an explicit reverse map plus the laws saying the two are genuinely inverse on the subset of data where inversion is possible. What inversion supplies is the reverse map when it exists; what the lens machinery of the next chapter does is police the laws.
What the decomposition buys
Each of the five stages isolates one mathematically sharp class of failure, and each produces a diagnostic that a developer can act on.
Existence checking catches theory-morphism problems before any instance is touched — a symbol, an equation, a cycle — and reports them at the level of the theory. Compilation catches any remaining pushforward-site problem before records move, and localises the failure to a specific declaration in the migration DSL. Lifting is where data-level failures surface, and they surface as a single record or a single equation rather than a systemic breakdown. Composition and inversion detect their failures through functoriality invariants and the adjunction-based classification of the previous chapter, and each reports the specific witness a fix can be built around.
The alternative would be to have one operation called migrate that produces a result or an error without internal structure. Such an engine would know that a migration is broken without knowing why. The five-stage pipeline, in exchange for the apparent complexity of its internal API, produces diagnostics a working developer can actually use.
There is a secondary benefit that matters in production. Each stage produces an artefact the next stage consumes, so no single representation of the migration has to hold all the data at once. For a migration chain of any length, this matters operationally: the compiled form of the composite is usually smaller than the sum of the compiled forms of its components, and stage-wise optimisations are what produce the compression.
Further reading
The most direct predecessor is the CQL system of Wisnesky (2013), which organises its migration engine into stages corresponding closely to the five above. CQL is implemented in Haskell, panproto in Rust; the mathematical structure is the same. Spivak (2012) and Spivak & Wisnesky (2015) are the categorical foundations.
For the broader setting of bidirectional data transformations, of which the engine’s lift-and-inverse pair is a specific case, Czarnecki et al. (2009) is the survey and Stevens (2010) the classical overview. Both discuss systems addressing variants of the same problem; panproto’s contribution is to apply the functorial-data-migration framework to GAT-based schemas that earlier systems could not handle directly.
Closing
The next chapter turns to bidirectional lenses, which promote the migrations of the present chapter into paired forward–reverse transformations subject to round-trip laws.
Bidirectional lenses
Disclaimer. The content of this page is largely LM-generated. It was written as a stopgap to make the panproto system legible while we work through the book verifying and editing the content by hand. When a chapter has been verified or edited by a human, the parts that were verified or edited will be noted at the head of the chapter.
A lens is a disciplined way of running a transformation in both directions at once. One half, called get, projects a view out of a larger source. The other half, called put, takes a modified view together with the original source and returns a new source incorporating the modification. The discipline comes from a pair of equations, the round-trip laws, which force put to be a genuine partial inverse of get on the subset of data where inversion is possible.
The concept originates in the programming-languages community, where lenses were introduced as the semantic foundation of the Boomerang line of research by Foster et al. (2007), whose paper “Combinators for Bidirectional Tree Transformations” is the reference most subsequent work returns to. The relational-lens variant of Bohannon et al. (2006) predates the general formulation and supplies a database-flavoured version; the broader bidirectional-transformations landscape is surveyed by Czarnecki et al. (2009).
In panproto’s setting, a migration is already half of a lens: the lift function from the previous chapter plays the role of get. What a lens adds is the other half, plus the round-trip laws to hold it honest. Most schema migrations do not invert exactly — they drop fields, collapse records, add data the reverse cannot unambiguously recover — and the lens formulation is the discipline that handles this reality rather than pretending it away.
The definition
An asymmetric lens from a type (the source) to a type (the view) is a pair of functions
subject to two round-trip laws. The source is the richer object; the view is what a consumer sees. put takes the old source because the view alone does not carry enough information to reconstruct the full source, and the old source supplies the missing pieces.
The first law is called GetPut:
In words: putting a view that was freshly extracted leaves the source alone. An identity modification of the view is an identity modification of the source.
The second law is PutGet:
In words: getting a view back from a just-put source recovers the modification. Modifications survive the round trip.
Taken together, the two laws force put to be a genuine partial inverse of get on the image of get, while leaving the parts of the source outside that image untouched. Drop either law and the pair loses the inversion property that made it worth treating as a single object.
A third equation, PutPut, is sometimes imposed:
PutPut says a second modification overwrites the first cleanly rather than compounding it. Lenses satisfying all three equations are called very well-behaved in the terminology of Foster et al. Panproto’s lenses satisfy PutPut on every schema pair where the target theory’s equations do not force otherwise, and panproto_lens::laws checks all three equations against a sampled state space.
A reader might reasonably ask why the laws take this shape rather than some other. The minimal answer is that without them, “bidirectional” is a type and not a behaviour: a pair of functions with matching types but no relation between them is not a lens in any useful sense. GetPut alone is not enough — one could satisfy it trivially by having put return the old source unchanged, which would then fail PutGet everywhere. PutGet alone is not enough either — a put that accurately places views but also perturbs parts of the source unrelated to the view would satisfy PutGet but violate GetPut. The two laws together are minimal and, together, sufficient.
Two examples
The simplest lens is the projection of a pair onto its first component. Take and :
get :: (X, Y) -> X
get (x, _) = x
put :: X -> (X, Y) -> (X, Y)
put x' (_, y) = (x', y)
Both laws hold by a one-line case analysis. GetPut: . PutGet: . Every pair projection in a language with standard data types is a lens of this shape, and the standard lens library of Kmett (2012) exposes each of them compositionally; the internal encoding is the continuation-passing representation introduced by van (2009).
A lens closer to this book’s subject is the address-record lens. Let be the schema with name and email; let be the reduced schema with only name. A lens has get return the name-only projection of an -instance, and has put take a modified -instance (a new list of names) together with an original -instance (a list of name–email pairs) and return a new -instance. For each name in the modified projection that appears in the original, the matching email is carried over; for each name that is new, a placeholder empty email is chosen.
Both round-trip laws hold. GetPut: an unchanged name list reconstructs the same pairs, because each name is present in the original and its email is carried over unchanged. PutGet: the name field of every pair in the new source comes from the modified view, so extracting the name-only view from the new source recovers the modification exactly.
What this example makes visible, and what the pair example already suggested, is the role of the complement. What put does to the email field is determined not by the view alone but by the original source the view came from. The email field is the lens’s complement: the part of the source outside the image of get that put is required to preserve when possible.
Complements
Every lens has a complement, whether or not the complement is made explicit. For the pair projection, the complement of the first component is the second component, preserved across every put. For the address-book lens, the complement of the name list is the collection of email values, preserved wherever the name survives the modification and defaulted wherever the name is new.
The Cambria system of Litt et al. (2020) tracks complements explicitly as a first-class part of every lens. A Cambria lens has three components — a get, a put, and an explicit complement type such that the source is isomorphic to — and the lens laws reduce to transparent operations on the pair. The move is to make explicit in the types what the Foster et al. formulation left implicit, and to derive the laws as consequences rather than stating them as separate equations.
Panproto follows the Cambria pattern. Every panproto_lens::Lens value carries a complement type as a type parameter, and when the round-trip verifier in panproto_lens::laws finds a law violation, the reported triple names the source state, view state, and complement state that together fail the equation. Explicit complements make violations easier to diagnose, which is one of the main arguments Little, van Hardenberg, and Henry make for the Cambria design.
Asymmetric and symmetric lenses
The definition given above is asymmetric: the source is the larger object, the view is the smaller one, and put takes the old source as an argument so that the complement can be preserved. A substantial part of the lens literature concerns symmetric lenses, in which neither side is smaller than the other and the round-trip laws preserve a shared complement between them.
The standard definition is that of Hofmann et al. (2011), whose “Symmetric Lenses” paper gives symmetric lenses a complement-indexed operation on both sides and shows they compose. A related line of work beginning with Diskin et al. (2011) replaces the state-based formulation with a delta-based one; the delta-lens framework is the direct precursor of panproto’s migration-level treatment of differences between schemas, and Pacheco et al. (2012) is the follow-up that develops the construction for indexed types. The broader survey is Stevens (2010).
Panproto implements both kinds. Asymmetric lenses in panproto_lens::asymmetric are the common case, used whenever data moves through a schema-version bump. Symmetric lenses in panproto_lens::symmetric are used at protocol boundaries, where two protocols each contribute their own structure and neither is the reference for the other. Cross-protocol translation, covered in Part IV, uses symmetric lenses.
Lenses from migrations
A panproto migration, as the previous chapter developed it, is already the get half of a lens: the lift function takes a source instance and returns a target instance. What a lens adds is put, and the engine’s classification of migrations by their underlying functors is what lets put be constructed automatically in most cases.
A migration whose pushforward is -style (a free expansion restricted by a user’s default rule) can be inverted by forgetting the free additions: put reconstructs the original source and replaces any fields from the view that changed. A migration whose pushforward is -style (a universal selection) may not be invertible in general, because multiple source instances can map to the same target instance. For such a migration the put function uses the complement type to disambiguate, and put becomes a partial function from the pair (view, source) into the source type.
Construction of put from a migration happens in panproto_lens::from_migration. When a migration does not admit a put at all, the crate reports the obstruction at the level of the theory morphism, naming the specific pushforward site responsible. That obstruction is the theoretical analogue of the inversion failures developed in the inversion stage of The restrict/lift pipeline; the two mechanisms classify the same phenomenon at different levels.
The practical consequence of automatic derivation is that a developer using panproto does not write lens code by hand most of the time. They write the migration’s theory morphism, the engine derives the lens, the laws are checked against a sampled state space, and the lens becomes an artefact the developer can inspect if they need to but does not have to produce. The cost of this automation is that it rules out lens constructions that only exist for ad-hoc reasons; the benefit is that the cases panproto handles well are the ones most production migrations actually want.
Classifying round-trip fidelity
Not every coercion is a full isomorphism, and panproto’s type of directed equations classifies them by what round-trip guarantee they give. A CoercionClass is one of four tags on an asymmetric coercion, corresponding to a point on a four-element lattice of information loss. An Iso satisfies both forward(inverse(v)) == v and inverse(forward(s)) == s: it is an isomorphism in the fiber category. A Retraction satisfies only the backward composite inverse(forward(s)) == s, meaning the forward map has a left inverse and is therefore injective; information flows forward without loss but the target type has inhabitants the source does not cover. A Projection is the categorical dual: the forward is a deterministic function of the source but has no inverse, so the result is recomposable without being recoverable. An Opaque coercion carries no round-trip claim at all and is the bottom of the lattice.
The classification is declared by the author, and before 0.38 nothing in the engine checked that the declaration was honest. A user could tag a coercion Iso and supply an arbitrary inverse, and the engine would accept it; round-trip failure surfaced only at data time in systems built on top. panproto-lens now ships sample-based law verification in panproto_lens::coercion_laws. The entry point check_coercion_laws takes a forward expression, an optional inverse, the declared class, a slice of sample inputs, and the variable name under which samples are bound. It returns one CoercionLawViolation per failing sample, distinguishing backward-law failure, forward-law failure, non-determinism (for Projection), missing inverse (for classes that require one), evaluator errors, and any unrecognised future class variant.
A CoercionSampleRegistry indexes sample inputs by primitive value kind, so check_theory can sweep every directed equation in a theory and report its violations in one pass. AutoLensConfig::coercion_law_registry makes the check opt-in during automatic lens generation: when set, coerce anchors whose declared class is falsified by the registry’s samples are dropped from the proposal set before the CSP runs, so an honest morphism is preferred over one that would corrupt data on round-trip. The DSL exposes the same check at compile time through compile_theory_with_law_check, and the CLI exposes it as schema theory check-coercion-laws, intended to run in CI as a lightweight gate against dishonest Iso and Retraction declarations.
The check is sample-based, and samples prove nothing about inputs that were not tried. A passing result is evidence, not proof; for exhaustive treatment a symbolic law checker is the right next tool, and panproto_lens::symbolic handles the fragment where that is tractable. The sample-based check catches the common case of an inverse that factually does not round-trip on ordinary data, which is the case that was missing before 0.38.
Checking the laws
A lens whose laws hold vacuously in a library that does not verify them is indistinguishable from a lens with a bug. panproto_lens::laws verifies GetPut, PutGet, and PutPut on every constructed lens using property-based testing against a sampled state space, and treats a law violation as a build-time failure when the sampling finds one. For lenses constructed from migrations, the sampled state space is the set of instances of the source schema; for lenses constructed by hand, the developer supplies a state-space generator.
A lens that passes sampled law-checking is not thereby guaranteed to pass on every input. Property-based testing increases confidence without providing proof. panproto_lens::symbolic supplements the sampled check with symbolic law-checking on a restricted fragment of lens specifications — term rewriting rather than evaluation — and proves the laws on that fragment where it applies. The distinction matters in production: a migration running against a few million records benefits from sampled verification, which covers the kinds of records actually present; a migration distributed as a library to users whose data the author cannot see benefits from symbolic verification, which covers every record at the cost of restricting the migrations it can verify.
Further reading
The canonical source for the asymmetric-lens formulation is Foster et al. (2007), which introduces the combinator language, the well-behavedness conditions, and the compositional lens calculus. Foster (2009) is the thesis-length treatment. For the relational specialisation, Bohannon et al. (2006) is the original paper and is worth reading as a well-motivated case study.
For symmetric lenses, Hofmann et al. (2011) is the foundational paper. For the delta-based reformulation that panproto’s migration-level treatment resembles, Diskin et al. (2011) and Pacheco et al. (2012) are the two papers to read. The broader survey landscape is covered by Stevens (2010) and Czarnecki et al. (2009).
For the complement-based treatment panproto follows, Litt et al. (2020) is the source. For the Haskell community’s profunctor-encoded lenses, Pickering et al. (2017) introduces the idea and Clarke et al. (2024) gives the modern categorical account; the van Laarhoven encoding of van (2009) is the representation the lens library of Kmett (2012) builds on.
Closing
The next chapter introduces protolenses, dependent families of lenses that generalise the fixed-source-and-view lens of the present chapter to a family indexed by a class of schemas.
Protolenses
Disclaimer. The content of this page is largely LM-generated. It was written as a stopgap to make the panproto system legible while we work through the book verifying and editing the content by hand. When a chapter has been verified or edited by a human, the parts that were verified or edited will be noted at the head of the chapter.
The lens of the previous chapter mediates between two specific structures, a source and a view. Schema evolution in real systems rarely produces just one pair. A team that uses panproto writes many migrations, each between two versions of the same logical schema, and wants the lens machinery for the whole family rather than one pair at a time.
Panproto calls a schema-indexed family of lenses a protolens. The name is specific to this book and to panproto-lens; the idea has partial analogues in the wider lens literature but is not, so far as we are aware, treated as a first-class object with the shape panproto gives it. The closest published work comprises the profunctor-optics formulation of Pickering et al. (2017) and the categorical treatment of Clarke et al. (2024), together with the delta-lens constructions of Pacheco et al. (2012) and Diskin et al. (2011). Where panproto’s construction coincides with the published work we note the coincidence; where it appears novel we say so.
Most of the mechanical work of Bidirectional lenses carries over pointwise, so the new material in the present chapter is the indexing, the auto-generation of protolenses from the relational structure of a theory morphism, and the cost-driven selection among implementations of a single specification.
The definition
A protolens is a dependent function
from schemas in some panproto protocol to lenses between two schema-dependent target types and . The constructions are functors from the category of schemas to a target category (usually equipped with some structural shape). Instantiated at a particular schema , the protolens yields an ordinary lens subject to the GetPut and PutGet laws of Bidirectional lenses.
The language of dependent functions is the right one here because the types on either side of the lens depend on the schema. A lens between “the name field of schema ” and “schema with name renamed” does not have a fixed source and target; both depend on which the protolens is being applied to. Encoding this dependence in the protolens’s type is what justifies the name.
The Rust representation is Protolens in panproto-lens. Its two type parameters encode and , and its method surface supplies per-schema instantiation together with the round-trip verification inherited from the previous chapter. A ProtolensChain is a composition of protolenses handled pointwise: the chain at schema is the lens-composition of the constituent protolenses each instantiated at .
Auto-generation
Writing a schema-indexed lens family by hand, for every pair of related schemas a repository carries, would be prohibitive. The panproto_lens::auto_lens module derives a protolens from the relational structure of the source and target schemas automatically. Given two schemas and equipped with a theory morphism , the auto-generator constructs a protolens whose instantiation at any pair of concrete schemas respecting produces a lens with verified round-trip laws.
The algorithm decomposes the theory morphism into its per-sort and per-operation components and chooses an optic for each component. A projection of a field becomes a Lens (one-to-one); a mapping across a repeated structure becomes a Traversal (one-to-many); a choice among alternatives becomes a Prism (one-of-several). The assembled chain of optics is the protolens. Which optic is available at each site is determined by the adjoint structure of the migration functors from Theory morphisms and instance migration: a -style component admits a traversal, a -style component admits a prism, a -style component admits a lens. The classification tells the algorithm which optic fits.
Auto-generation can fail. A theory morphism that forgets information a -style pushforward cannot recover is flagged at the same point the inversion stage of The restrict/lift pipeline would flag it, and the diagnostic is carried over. A developer who wants to proceed despite the loss may supply a manual protolens specification, which panproto-lens-dsl accepts in Nickel, JSON, or YAML form.
The consequence for day-to-day use is that a developer writing migrations does not write lens code. The theory morphism is what they author; the engine derives the lens; the round-trip laws are checked; the lens is an artefact the developer can inspect if they need to but does not have to produce. This is the single largest concession panproto makes to the practical cost of using lenses in production: most lenses are too mechanical to be worth hand-writing, and an engine that insists on hand-writing will see its lens machinery go unused.
When the two theories don’t share enough structure for a total morphism, the engine still finds a morphism on the sub-schema they do share. At Stringency::Lenient and above, sources that cannot participate in any naturality-consistent mapping given the currently-seeded anchors are pre-excluded through sources_without_naturality_compatible_targets, and the CSP searches the remaining sub-schema. The predicate checks, for every outgoing edge of a source, that the target has an edge with matching kind, matching label, and either an anchored child or a kind-compatible child. Before this pre-exclusion ran, two schemas whose common vertex kinds (string, object, record) appear on both sides but whose structural shapes disagree would pass the weaker kind-only test and leave the CSP scope too large to solve; the CSP would bail with no candidates even though the shared-name anchors the heuristics had found were adequate for a partial morphism. Making the exclusion predicate naturality-aware is what lets the alignment stack produce a usable lens in that case, emitting the un-aligned portion as the left leg of a span .
Classification of optics
Panproto classifies each site of a protolens specification as one of three optic kinds, and the classification reflects what the edge of the schema at that site looks like.
A prop edge is a one-to-one field in a record. The optic at a prop edge is a Lens, with both get and put total on the source. The projection-of-a-pair example of Bidirectional lenses is the canonical shape; rename operations, field-type changes, and most ordinary field-level transformations produce lens-shaped optics.
An item edge is a one-to-many relation, a field whose value is a collection. The optic at an item edge is a Traversal, a generalisation of the lens in which get returns a collection of views and put takes a collection of modifications together with a source. Traversals satisfy analogues of the lens laws, quantified pointwise across the collection. Mapping over a list, transforming every element of an array, and applying a field-level rename to every row of a table all yield traversals.
A variant edge is a tagged union, a field whose value is one of several alternatives. The optic at a variant edge is a Prism, an asymmetric optic in which get is partial (it returns Some only if the source is the matching branch) and put is total (any branch can be produced). Prisms satisfy partial analogues of the round-trip laws: GetPut holds; PutGet holds on the matching branch and is vacuous elsewhere.
All three kinds compose pointwise. A protolens specification at a composite site — “traverse every person’s list of addresses, then project the street field” — is a chain combining a traversal (for the list) with a lens (for the field), and the combined optic type is computed automatically from the types of the components. Details of composition live in panproto_lens::optic.
The three-kind classification is panproto’s specialisation of the broader optic taxonomy in the profunctor-optics literature, which supports many more kinds (Iso, Getter, Setter, Fold, and so on). Panproto uses three because GAT-based schemas produce three kinds of edge, and covering more would add complexity without covering cases the engine encounters.
Symbolic simplification
A protolens specification can often be simplified before it is instantiated. A chain consisting of a trivial projection followed by a put that replaces the projected field reduces to the identity lens; a chain of two successive traversals over a structure that is actually a single-element collection reduces to a lens; a prism into the only branch of a one-element union reduces to a lens. Panproto performs these simplifications symbolically at construction time through panproto_lens::symbolic, before any schema is instantiated.
The simplifier is a term-rewriting system over a finite set of algebraic identities on optics. The identities are soundness-preserving (they never change the pointwise meaning of the protolens) and terminating (repeated application reaches a normal form in a bounded number of steps). A protolens whose normal form is the identity is a runtime no-op; panproto recognises this and elides the protolens from the migration pipeline.
The gains are substantial in fleet-application settings. A migration across a versioned family of schemas produces a chain of protolenses, many of which are trivial at most specific schema pairs. The simplifier drops those before the lift machinery sees them, and the resulting per-record runtime cost is typically a small multiple of the minimum dictated by the non-trivial edits alone.
The cost model
When more than one protolens realises the same user-facing specification, the engine has to choose. panproto_lens::cost attaches a cost to each protolens estimating its runtime expense on a typical instance. A pure projection costs per source record; a traversal costs in the size of the source collection; a prism costs per case-analysis; composed protolenses have costs that sum their components, adjusted for short-circuits the composition exposes.
The cost model is a heuristic. It does not promise the minimum-cost implementation of a specification, only that it compares alternatives on a consistent basis. Developers who need guaranteed-optimal implementations may override the engine’s choice by supplying a specific protolens, which the engine accepts without further searching.
A more honest framing is that the cost model is calibrated against the workloads panproto has been benchmarked on. For workloads outside that calibration — pathologically large collections, deeply nested traversals — a developer who cares about performance should profile and, if necessary, override. The engine does not claim to be oracular; it claims to be reasonable most of the time.
Applying a protolens across a fleet of schemas
The concrete use case where protolenses pay off most is a population of schemas under a shared protocol, each a version of the same logical schema with small per-version differences. A migration from any one version to any other is a morphism in ; a protolens respecting the differences between versions instantiates to a concrete lens for any specific pair.
Panproto’s migration engine uses this pattern at protocol boundaries. When an ATProto lexicon evolves across a series of revisions, the engine constructs one protolens covering the family and instantiates it at every pair the repository actually stores. The auto-generation cost is paid once, at the level of the protolens, rather than once per pair of schemas.
The impact is practical. A repository with twenty schema versions has at most one hundred and ninety unordered pairs of versions, but only twenty underlying differences between consecutive lexicon revisions. Hand-written lenses per pair would require one hundred and ninety specifications with no shared structure; a single protolens instantiated per pair requires one specification and cheap instantiations. The savings are in code, in verification time, and in the developer’s attention.
Further reading
Panproto’s protolens construction does not have a direct published antecedent, so far as we are aware. The closest analogues address related problems with different emphases.
Pickering et al. (2017) is the introduction to profunctor optics, which generalise lenses to a family of constructions indexed by a choice of profunctor; the van Laarhoven encoding of van (2009) is the starting point. Clarke et al. (2024) is the modern categorical treatment. The profunctor setting is richer than panproto’s three-optic classification and would in principle subsume it; panproto uses the narrower classification because it covers GAT-based schemas and is simpler to implement.
For the delta-lens side, Pacheco et al. (2012) and Diskin et al. (2011) are the foundational papers. Delta lenses address the bidirectional synchronisation problem with explicit deltas (edits) rather than pre-and-post states; panproto’s lift is effectively a delta application, and reading those papers alongside this chapter makes the conceptual alignment visible.
For the enriched-category-theoretic foundations the profunctor-optics literature builds on, Kelly (1982) is the standard reference. Panproto does not use enriched category theory explicitly, but the reader who wants to understand why optics have the shape they do will want to read Kelly.
Closing
The next chapter closes Part II with protocol colimits, the construction by which two protocols sharing a common sub-protocol combine into a single composite.
Protocol colimits
Disclaimer. The content of this page is largely LM-generated. It was written as a stopgap to make the panproto system legible while we work through the book verifying and editing the content by hand. When a chapter has been verified or edited by a human, the parts that were verified or edited will be noted at the head of the chapter.
Two protocols that share a common sub-protocol combine by pushout in the category of GATs. The composite protocol has every sort and operation of both factors, with sorts and operations in the shared sub-protocol identified along the inclusions. The construction is the same pushout from Colimits and pushouts, applied to a different category.
The chapter has little new mathematics to add beyond that observation. What it owes the reader is a concrete sense of the construction at work on protocols panproto ships, and an honest account of what the universal property buys a developer combining protocols in practice.
Pushouts of protocols
A panproto protocol is a generalised algebraic theory together with auxiliary data, as Protocols as theories, schemas as instances developed. The category of GATs has theory morphisms as its morphisms (from Theory morphisms and instance migration), and this category has pushouts.
Given a span of theory morphisms
the pushout is a theory together with morphisms and satisfying
and universal among such. Explicitly, is built as the disjoint union of the sorts, operations, and equations of and , with sorts and operations in the image of identified along and ; equations from either branch are inherited, and new equations may be imposed at the pushout level.
The Rust implementation lives in panproto_gat::colimit for the theory level and panproto_schema::colimit for the schema level. The type-checker verifies that the resulting theory is well-formed and that the pushout morphisms satisfy the defining equation. A reader familiar with the construction from Colimits and pushouts will recognise this as the same construction given there for , applied now in a different category; that the construction transports without modification is exactly the force of stating it at the level of the general category.
A combined protocol
A concrete case makes the construction tangible. Let be a small shared theory of identifiers: a single sort with an equation declaring it to be a string of a fixed alphabet. Let be a relational protocol using identifiers to name tables and columns, and let be a document protocol — a subset of FHIR, say — using identifiers to name record types and fields. There are obvious inclusions and sending to the identifier sort of each target.
The pushout of the span is a protocol whose schemas have both relational tables (with identifier-named columns) and document records (with identifier-named fields), where the identifier sort on both halves is the same sort. A schema under can use a relational identifier where a document identifier is expected, and vice versa, because the two were identified along .
The technical payoff is that a workflow involving both kinds of schema can be written as a single migration in , rather than as a pair of migrations in and with manual coordination. A developer moving data from a relational store to a document store writes one migration against the combined protocol, and the shared identifier vocabulary ensures names line up across the boundary without hand-written translation.
Pushouts at version boundaries
The more common use of the pushout, operationally, is combining two versions of the same protocol that diverged from a common ancestor.
Consider two teams working independently on extensions of the same address-record schema from Part I’s running example. Team A adds a phone field to the common schema , producing . Team B, working in parallel, adds a birthdate field to the same , producing . Both extensions are honest protocol morphisms out of ; they disagree on nothing declares.
The pushout of the span is a schema containing both additions: name, email from , phone from , birthdate from . The two teams proceed independently, each on its own branch, and the pushout merges their work at schema-level granularity. A developer working on an extension declares it as a theory morphism out of the common ancestor; another developer does the same; the two extensions combine by pushout into a single protocol containing both sets of additions, without sequential hand-coordinated merging.
When the two extensions modify the same sort in incompatible ways, the pushout fails to exist. The failure has a diagnosable form: a sort of whose image in satisfies an equation that contradicts the equation imposed in , reported as “the pushout of these two theory morphisms is not a well-formed GAT: the equation in and the equation in cannot both hold in the quotient.” The report names the specific equations at fault, so the fix can be applied at the smallest site of the conflict rather than at the level of the whole merge.
Universal property
Pushouts are characterised by a universal property, and the characterisation is what makes panproto’s composition reliable across protocols whose authors never coordinated.
Given any other protocol with theory morphisms and satisfying , there is a unique theory morphism with and . Any third protocol respecting the common-ancestor agreement of and must factor through the pushout. A developer writing code against the pushout is writing code that will work against any further unifying protocol respecting the same common ancestor, without re-translation.
What this buys in practice is forward-compatibility with future combined protocols a developer has not yet thought of. The algebraic overhead of working with the pushout — verifying the identifications along , propagating equations — is real; the forward-compatibility is what the overhead pays for. The general framework of colimits of theories, of which the pushout is the smallest non-trivial instance, is developed in the institutions tradition beginning with Goguen & Burstall (1992) and characterised programmatically in Goguen (1991).
Instances of a combined protocol
Once a pushout has been computed, schemas under it are models of the composite GAT. An instance of such a schema consists of the records of the -part, the records of the -part, and a set of identifications between the two parts at the sorts declares to be shared.
Instance construction goes through panproto-schema’s builder, and the engine enforces the identifications automatically: an instance assigning two different values to the “same” -sort record is rejected at build time with a diagnostic naming the offending record. The instance functor of Protocols as theories, schemas as instances extends to the pushout protocol without new machinery; every construction of Part II applies as it did before. Protocol combination does not add new machinery at the model level — it expands the class of schemas and migrations the existing machinery applies to.
Connection to version-control merge
Panproto’s version control is organised around the same pushout construction applied one level down: in the category of schemas under a fixed protocol, rather than in the category of protocols.
A two-branch development with a common-ancestor schema presents a span of schema morphisms, and the three-way merge of panproto_vcs::merge is the pushout of that span in for whatever protocol the repository is under. The construction of this chapter and the one developed in Merge as pushout are the same idea applied in two different categories — once to protocols here, once to schemas there. A reader who has worked through the present chapter has worked through the merge construction up to the choice of category.
This repetition — the same universal construction appearing in several categories the engine cares about — is one of the main arguments for adopting the category-theoretic framework at all. The pushout machinery was developed once, in a general setting, and it applies to protocols, to schemas, to migrations, and to instance types. Each re-application inherits every theorem about pushouts without having to reprove them.
Further reading
The institutions framework of Goguen & Burstall (1992) is the right mathematical home for protocol composition by pushout, and Goguen (1991) is the accompanying exposition of the broader perspective. The canonical references on colimits from Colimits and pushouts transfer directly.
For the algebraic-specification tradition in which protocol composition sits natively, Sannella & Tarlecki (2012) is the textbook-length account, and works through theory combination at length.
Closing
Part II ends here. Part III opens with the expression language the engine uses when a migration needs to compute a value that depends on the contents of a record.
Syntax and semantics
Disclaimer. The content of this page is largely LM-generated. It was written as a stopgap to make the panproto system legible while we work through the book verifying and editing the content by hand. When a chapter has been verified or edited by a human, the parts that were verified or edited will be noted at the head of the chapter.
The migrations of Part II are built around a theory morphism and a choice of pushforward at each extension site. The pushforward is where the engine needs a small programming language: an expression that, given the values visible at a site of the source instance, produces the value required at the target — a default, a computed field, a transformation of an existing value. Part III is about the language in which those expressions are written.
The language, called panproto-expr, is a typed lambda calculus extended with records, lists, pattern matching, and around fifty builtin operations on strings, numbers, collections, and panproto schemas. It is pure, total, deterministic, and serialisable — four properties the engine depends on and which the next two chapters argue for at length. The surface syntax is Haskell-flavoured, not because of a preference for Haskell as such but because a migration expression is usually short, and the Haskell surface is the most compact idiom that still reads as code.
Grammar
A term of the language is one of the following:
e ::= x -- variable
| n | s | b -- literal (number, string, boolean)
| \x -> e -- lambda abstraction
| e1 e2 -- application
| e1 . e2 -- composition
| let x = e1 in e2 -- let binding
| { l1 = e1, ..., ln = en } -- record literal
| e.l -- field access
| [e1, ..., en] -- list literal
| [e1 | x <- e2] -- list comprehension
| case e of p1 -> e1; ... -- pattern match
| builtin args -- builtin operation
The grammar is small enough that most of it needs no comment. Variables and literals do what they do in every language. A lambda abstraction binds its variable in the body and is applied by juxtaposition. Let bindings introduce named intermediates. Composition e1 . e2 is the function-composition operator, pronounced “after”. Record literals and field access work as in Haskell, with the usual dotted syntax.
Two constructs deserve a moment. A list comprehension [e1 | x <- e2] denotes the list obtained by evaluating e1 for each x drawn from e2; it extends to multiple generators and filters in the usual way, familiar from Haskell or from Python’s analogous syntax. Pattern matching deconstructs a value by cases, with patterns that are variables (match anything and bind), wildcards (match and discard), literals (match the value), record patterns (match records with the named fields), or list patterns ([x1, ..., xn] or x : xs).
Numeric literals parse as signed 64-bit integers when they have no fractional part and as 64-bit floats otherwise; string literals are UTF-8 throughout. The grammar above is the concrete syntax the parser in panproto-expr-parser accepts; the abstract-syntax representation used everywhere else in the engine is the Expr enum.
The tradition the surface belongs to is the ISWIM family, tracing back to Landin (1966)’s “The Next 700 Programming Languages”: lambda-calculus-based expression languages where function abstraction and application carry the work, with a small layer of literals and structured values on top.
Types
Every term has a type. The type language is
T ::= Int | Float | String | Bool -- base types
| T1 -> T2 -- function type
| { l1 : T1, ..., ln : Tn } -- record type
| [T] -- list type
| Schema P | Instance P -- panproto-native types
Four base types cover the scalars a migration expression will commonly need. Function types are the usual arrow. Record types are structural: a record type is determined by its field labels and their types, and two record types with the same fields (in any order) are equal. List types are homogeneous — a list of integers and a list of strings are different types, and a list literal cannot mix them. The two panproto-native types are opaque inside the language; an expression cannot inspect the internal structure of a Schema or an Instance, but both can be passed to builtins that know how to manipulate them.
The type system sits in the polymorphic lineage tracing to Girard (1972)’s System F, via the principal-type-scheme theorem of Hindley (1969) and the Hindley-Milner fragment for which Milner (1978) and Damas & Milner (1982) give a decidable inference algorithm. Panproto’s type-checker uses Hindley-Milner rather than full System F; the restriction is enough for migration expressions and keeps type inference decidable and fast.
Type-checking is implemented in panproto_expr::typecheck as a standard bidirectional-style elaboration: checking modes propagate an expected type inward from context, synthesis modes infer a type from the shape of a term and push it outward. Harper (2016) is the modern textbook reference for the relevant typing and evaluation theory, and Pierce (2002) the standard pedagogical source.
Every well-typed term has a principal type the checker produces. A term that fails to check carries an error message naming the specific subterm at fault, produced by the diagnostics in panproto_expr::error. Good error messages are not a decoration here — the expression language is a DSL that working developers write under time pressure, and a diagnostic pointing at the wrong subterm is barely better than no diagnostic at all.
Operational semantics
The evaluator in panproto_expr::eval implements a small-step reduction relation on closed terms, in the style of Plotkin (1975). A term is a value when it matches a value form — a literal, a record of values, a list of values, a closure — and the reduction relation is empty on values. Every non-value term has exactly one reduction step, chosen by a left-to-right, outermost-first evaluation order. The choice is deliberate: left-to-right outermost-first is the eager strategy that keeps intermediate states serialisable and the reduction sequence predictable.
The defining rules are the standard ones. Beta reduction substitutes a value for a bound variable:
Let reduction does the same:
Record projection takes a field:
Pattern-match reduction picks the body of the first clause whose pattern matches the scrutinee, with the pattern’s bound variables substituted into the body. Reduction of a builtin applied to values is whatever the builtin’s specification says: add on two integers returns their sum, concat on two strings returns their concatenation, and so on through the catalogue. Each builtin has a reduction rule in panproto_expr::builtin.
Reduction is deterministic: the same closed term always produces the same reduction trace, because the evaluation order is fixed and no builtin has hidden non-determinism. Running the evaluator on a closed term therefore produces a unique outcome — either a value, when reduction terminates, or a bounded-resource failure, which is the subject of the next chapter.
Builtins
The language ships around fifty builtins. They fall into four groups, and the catalogue below is reference material rather than reading material; a developer who wants the full list can consult panproto_expr::builtin.
Arithmetic and comparison
add, sub, mul, div, mod, neg, abs, min, max, eq, lt, leq, gt, geq. Each has the standard mathematical meaning on Int and Float operands. Division and modulus on integers follow the Rust standard-library convention: truncation toward zero.
Strings
concat, length, toLower, toUpper, trim, split, replace, startsWith, endsWith, contains, format. All are UTF-8-aware and treat strings as sequences of Unicode scalar values. A reader expecting byte-level semantics should be aware that string indexing and length report Unicode scalar counts, not byte counts; byte-level work is done separately through list-of-integer instances.
Lists
map, filter, foldl, foldr, length, head, tail, reverse, sort, zip, unzip, concat, take, drop. The combinators have their standard Haskell meanings; the partiality of head and tail on empty lists is treated as a bounded-resource failure rather than an exception.
Panproto-native
instanceOf tests whether an instance conforms to a given schema. getField and setField access and modify a field of an instance by name. listSchemasUnder enumerates the schemas registered under a protocol. applyMigration applies a compiled migration to an instance. renameField and requireField operate on the instance’s field structure. These are the builtins migrations and field transforms reach for when manipulating panproto values, and each is documented in panproto_expr::builtin.
Every builtin is total on well-typed inputs, modulo the bounded-resource failures covered in the next chapter. A builtin that would fail on ill-typed input fails at the type-check stage, with a diagnostic identifying the specific argument; combined with the type-checker, the builtins become a pure total function from closed well-typed terms to values.
Further reading
For the operational-semantics side, Plotkin (1975)’s call-by-value/call-by-name technical report (later republished in Theoretical Computer Science) is the foundational reference. Pierce (2002) is the standard textbook treatment of small-step semantics. For the type-system side, Hindley (1969) is the original principal-type theorem; Milner (1978) and Damas & Milner (1982) developed the algorithm the type-checker is a variant of. Harper (2016) covers the whole area — operational semantics, typing, polymorphism — in a single modern reference.
For the ISWIM lineage, Landin (1966) is the foundational source.
Closing
The next chapter introduces the step and depth limits the evaluator enforces, and explains what they buy and what they cost.
Totality and termination
Disclaimer. The content of this page is largely LM-generated. It was written as a stopgap to make the panproto system legible while we work through the book verifying and editing the content by hand. When a chapter has been verified or edited by a human, the parts that were verified or edited will be noted at the head of the chapter.
The language of the previous chapter is a small typed lambda calculus. Nothing in the definition so far prevents a well-typed term from running for a very long time, or from consuming an amount of memory that embarrasses the machine it is running on. For an engine that embeds user-written expressions and runs them against millions of records, that is not acceptable; before a migration touches production data the engine has to know, with some confidence, that every expression in the migration will terminate within a known budget.
The mechanism that provides this confidence is two resource limits: a step limit on the number of reduction steps any one evaluation may take, and a depth limit on the nesting of active calls. Together, the two limits make every well-typed evaluation terminate, produce a unique outcome, and fit inside a serialisable record that can be replayed across processes. The present chapter explains the limits, the guarantees they purchase, and the trade-offs they impose; the trade-offs against a Turing-complete alternative are the subject of the next chapter.
The two limits
Every evaluation in panproto_expr::eval runs inside a context that carries a steps_remaining counter and a max_depth bound. The counter decrements on every reduction step, and if it reaches zero before a value is produced, the evaluator returns a StepsExhausted error. The depth bound constrains how deeply lambda applications and list comprehensions can nest at any moment, and an application that would exceed the bound returns a DepthExceeded error without reducing further.
Both limits are configurable at the call site. The defaults — a step limit of , a depth limit of — are set in panproto_expr::eval::Config and are what the migration engine uses when a caller does not specify otherwise. A production caller expecting larger input instances can raise either limit; a caller running user-supplied expressions in an untrusted context can lower them.
The limits are the evaluator’s only side-channel into the environment. Everything else is denied: I/O, mutation, network or filesystem access, thread spawning, callbacks into Rust code outside the fixed set of builtins. Every evaluation is a pure, bounded function from the input closure to an outcome.
Why the limits make the language total
A typed lambda calculus without fixed-point combinators and without general recursion is already total in the sense of Turner (2004): every well-typed term reduces to a value in a finite number of steps. panproto-expr has no let rec, no fix, no self-reference in the term grammar; explicit infinite loops are ruled out by construction.
The limits are there for the implicit blow-ups that still fit inside the typing discipline. A Church-encoded natural number applied to itself can unfold to an exponentially deep term; cleverly constructed beta reductions can produce a term whose size grows polynomially per step. Without the limits, a developer could submit a well-typed term that consumed memory or time far beyond any reasonable budget. With the limits, the evaluator always halts inside the configured budget.
The limits therefore do not make the language total; they bound the total language. A well-typed term reducing to a value in ten steps evaluates to that value whenever the step limit is at least ten. A well-typed term reducing to a value only after ten billion steps never evaluates to that value in practice, regardless of the step limit, and the engine reports StepsExhausted at the outermost redex position when the budget runs out.
A reader might object that a language without recursion cannot express many computations a developer would reasonably want. The objection is true and is the trade-off totality imposes. In practice, migration expressions are short — a handful of builtins composed over a few field accesses — and do not need recursion. Where a migration genuinely needs recursion, the developer writes a Rust function and registers it as a foreign builtin, which sits outside panproto-expr’s total fragment. Very few migrations we have seen in practice require the escape hatch.
Deterministic and serialisable
Two further properties follow from the way the evaluator is written, and both matter in production.
The evaluator is deterministic: the same closed term evaluated under the same limits always produces the same outcome. No builtin has hidden randomness, no reduction step depends on global state, and the fixed evaluation order gives every non-value term exactly one next step. Two machines running the same evaluation produce identical traces, up to the bit representation of intermediate values. Determinism is what lets the engine cache intermediate results — a value computed in one run of a migration can be reused by a subsequent run against the same input.
The evaluator is serialisable: the state of an ongoing evaluation — the current term, the configuration, the counters, the environment — is a serde-encoded record that can be checkpointed, transmitted across a process boundary, and resumed. The serialisation format lives in panproto_expr::eval::State. Serialisability is what lets the engine distribute work across workers: a migration across a million records splits into chunks, each worker handles a chunk, and intermediate states ship between workers without breaking semantics.
How the limits interact with the term forms
A reduction step consumes exactly one unit of budget regardless of the term form. Beta reduction, let reduction, record projection, list element access, pattern match, and builtin application are each a single step. The uniform cost is a design choice: a caller wanting to reason about worst-case budget can multiply the step limit by the maximum work a single step of any term form might do, and every builtin’s single-step work is documented in panproto_expr::builtin.
The depth limit is checked on lambda application and on nested list-comprehension entry. An application of a lambda that is itself the body of another lambda counts as nested; a list comprehension inside another list comprehension counts as nested iteration. Both counts are local to the current evaluation frame, and both are released when the evaluator returns from the inner scope.
Pattern matching and record projection do not count against the depth limit, because neither introduces a new scope. A pattern match that reduces to a body expression replaces the entire match by that body in a single step, and the body is evaluated in the outer frame.
What happens when a limit is exceeded
A StepsExhausted error carries the outermost redex position as a source-location span together with the remaining state at the moment of exhaustion. The migration engine receiving this error reports the specific field transform whose evaluation ran out of budget, and suggests either raising the limit in the caller’s configuration or rewriting the transform.
A DepthExceeded error reports the lambda or list comprehension whose entry would have exceeded the bound. The report is a source-location span and the depth at the time of attempted entry. The common cause is an unintentionally recursive pattern expressed through Church-style encoding, and the error message suggests rewriting with a linear combinator — map, foldl, foldr — that the engine can evaluate without nested recursion. The general recursion-scheme framework those combinators are the simplest cases of is developed in Meijer et al. (1991) and Bird & de (1997); a developer whose expression keeps tripping the depth limit through implicit recursion will benefit from reading either source.
Both errors are fully inspectable in Rust code through panproto_expr::error::EvalError. A caller that wants to retry with a larger budget can do so; a caller that wants to surface the error to a user can format it with the pre-built diagnostic renderer.
Further reading
Turner (2004) is the foundational argument for total functional programming, and the most direct justification of the design choices in this chapter. For the recursion-scheme framework that map, foldl, and foldr are the simplest cases of, Meijer et al. (1991) is the foundational paper and Bird & de (1997) the book-length treatment.
Closing
The next chapter situates the totality guarantees developed here against the main alternatives — Starlark, Dhall, Nickel, CUE — and explains why the engine needs this particular combination rather than one of the nearby options.
Why bounded pure evaluation
Disclaimer. The content of this page is largely LM-generated. It was written as a stopgap to make the panproto system legible while we work through the book verifying and editing the content by hand. When a chapter has been verified or edited by a human, the parts that were verified or edited will be noted at the head of the chapter.
Every choice of DSL in a system as opinionated as panproto invites the question of whether an existing language would have served. The question has a specific answer, and this chapter gives it.
The short version: panproto needs a language that is pure, bounded, deterministic, serialisable, and capable of pattern-matching over panproto’s own schema-indexed types, and no existing candidate satisfies all five. The chapter walks through the requirements, the four closest candidates — Starlark, Dhall, Nickel, CUE — and what each one gives up at the boundary that turned out to matter.
The five requirements
The migration engine embeds a language for one job. At each site of a field transform or pushforward declaration, the engine evaluates a user-written expression that consumes values visible at the site (fields of the current record, possibly a schema handle, possibly an instance value) and produces a value for the target. Evaluation happens inside the engine’s compile stage, outside any user-controlled runtime, and the result is an input to the lift function that will run against every record of the input instance.
Five requirements follow from this context. Each rules out a different class of otherwise reasonable language.
Purity. The language cannot perform I/O. Compile-stage evaluation runs in environments where I/O may not be available or may not be desired, and a language that allows side effects leaks them into places they should not be. Starlark satisfies this; most full-featured scripting languages do not.
Boundedness. Every evaluation must terminate within a configurable budget. A migration that embeds a user expression and runs it across millions of records cannot afford to block on one expression that happens to loop. This rules out Turing-complete languages that cannot make termination a static property.
Determinism. Two evaluations of the same expression on the same input must produce the same output. The engine treats expression evaluation as a pure component of a larger functorial computation, and non-determinism at this layer would propagate to every subsequent stage.
Serialisability. An in-flight evaluation must be suspendable and resumable across process boundaries. Panproto’s batch-migration tooling parallelises evaluation across workers, and lazy evaluation schemes whose suspended state cannot travel cleanly between processes do not work.
Native handling of panproto types. The language must be able to inspect, construct, and pattern-match on schemas and instances without serialising them through an external format on every operation. A language that treats panproto values as opaque data that has to be marshalled in and out on every call pays a translation cost large enough to rule it out for expressions that run millions of times.
No existing configuration or data-transformation DSL meets all five.
Starlark
Starlark is Google’s configuration DSL, developed for Bazel BUILD files and used across several Bazel-adjacent tools. It is deterministic by design, has no I/O in its standard form, and carries a large ecosystem of Python-familiar users. The grammar is a Python subset with mutation disallowed.
Two things rule Starlark out for panproto. It is Turing-complete, with general while loops and function recursion, so an evaluation can fail to terminate without reaching a resource bound; the Bazel runner works around this with a hard time limit rather than promising totality. And Starlark has no native types for schemas or instances. A Starlark-based panproto would have to serialise schemas into Starlark dictionaries on every call and deserialise the result, with no pattern-matching support that would make the cycle pleasant to write.
Dhall
Dhall is a total, strongly typed configuration DSL in the style of the simply typed lambda calculus with extensions. It squarely satisfies purity, totality, determinism, and serialisability, and its design is closest to what panproto needs at the level of the core calculus.
The gap is ergonomic rather than foundational. Dhall’s primary target is configuration, and its standard library is oriented toward producing JSON, YAML, or other text-formatted outputs. It has no notion of types parameterised by a runtime schema, no record-manipulation primitives designed around panproto’s attributed-C-set representation, and no way to compose its user-facing records with panproto’s schema identifiers without an external translation layer. A Dhall-based panproto would work; every migration would involve writing serialisers and deserialisers between Dhall records and panproto instances, which the engine then has to verify for correctness separately.
Nickel
Nickel, developed by Tweag, is a lazily evaluated configuration language with a contract system for runtime validation. It shares Dhall’s purity and determinism, relaxes totality slightly by allowing recursion, and ships a contract system that plays well with JSON-schema-like validations. Panproto’s own lens DSL (panproto-lens-dsl) accepts Nickel as one of its surface syntaxes.
Nickel is a better fit than Dhall for panproto’s external-configuration use cases, and panproto uses it exactly there. For the migration engine’s internal use, two issues arise. Laziness complicates serialisation: a paused evaluation in Nickel is a thunk graph whose full state is not straightforward to move between processes. And Nickel’s contract system is a runtime check rather than a static type check, which means the errors panproto’s engine wants to catch at compile time are visible to Nickel only at evaluation time, leaving the engine to run its own static checks after the fact and defeating part of the reason to use the language.
CUE
CUE is a constraint-based language: values and types are unified in a single lattice, and evaluation means finding the greatest lower bound of a set of constraints. It is pure, deterministic, and its constraint system is genuinely novel for schema validation. Panproto borrows ideas from CUE in its own constraint machinery.
The gap is that CUE is not a function language. Expressing a migration’s pushforward computation as a constraint-solving problem, rather than as a function from input to output, would force the engine to inhabit CUE’s execution model, which does not fit how the migration engine already treats instances and morphisms. Running CUE for validation and something else for computation is a coherent design; replacing panproto-expr wholesale with CUE is not.
What panproto-expr keeps and loses
The language the engine ended up with is tuned for what remains after the requirements have ruled out the alternatives. Step and depth limits replace Turing-completeness with guaranteed termination. The strict eager reduction strategy replaces laziness, at the cost of forcing eager evaluation in some places where a lazier language would have deferred work, but with the benefit that every intermediate state is serialisable. Native handling of panproto’s own types means the language does not double as a general-purpose data-exchange format; Schema and Instance are first-class opaque values.
The losses are real. A user whose pushforward computation genuinely needs a Turing-complete host language cannot express it in panproto-expr; such a user has to write a Rust function, compile it through panproto’s Rust SDK, and register it with the migration engine as a foreign function. A user who wants to configure their migration declaratively without writing evaluated expressions can use the lens DSL in Nickel or JSON form, which compiles through a separate pipeline. Neither loss is hypothetical; both are what the engine’s escape hatches are for.
Further reading
For the totality argument in a broader context, Turner (2004) is foundational. For the lineage of small functional expression languages the panproto-expr surface borrows from, Landin (1966)’s ISWIM paper is the originating reference and Harper (2016) the modern textbook. The documentation for the panproto-lens-dsl crate shows where panproto does use Nickel in production — for declarative lens specifications — and where it does not.
Closing
Part IV opens with Defining a protocol, which catalogues the specific protocols panproto supports and shows how each fits the framework of Part II.
Defining a protocol
Disclaimer. The content of this page is largely LM-generated. It was written as a stopgap to make the panproto system legible while we work through the book verifying and editing the content by hand. When a chapter has been verified or edited by a human, the parts that were verified or edited will be noted at the head of the chapter.
Part IV is where the framework of Part II meets the protocols panproto ships with. The remaining chapters in the part work through specific cases — ATProto, Avro, a relational case study, FHIR, and the tree-sitter-derived protocols — each one an instance of the constructions developed abstractly earlier. The present chapter is the template those cases instantiate: what a protocol registration looks like in Rust, and what obligations the registration imposes on its parser and emitter.
A reader who has followed Protocols as theories, schemas as instances has seen the theoretical content. The present chapter concentrates on what the Rust code actually does.
What a protocol supplies
A protocol supplies four artefacts. A Theory value declares the protocol’s schema language — its sorts, operations, and equations. A parser takes a byte slice in the protocol’s native surface syntax and returns a Schema under the theory. An emitter takes a schema and renders it back into the native syntax. A registry entry binds the three together under a protocol identifier that the engine can look up by name.
All four live in panproto-protocols, organised by category. Serialisation formats live under serialization/; database and storage protocols under database/; document and web formats under web_document/ and domain/; data-science formats under data_science/. Each subdirectory holds one file per protocol plus a mod.rs that collects the registrations.
A toy protocol
The simplest useful protocol is one for a tagged key-value store. The theory has a sort for record tags and a sort for the stored values, with one operation mapping every record to its tag. No equations are imposed beyond well-typedness.
A schema under this protocol fixes a concrete interpretation of both sorts. One schema might choose tags as UTF-8 strings up to 64 bytes and records as JSON values of any shape; another might choose tags as 128-bit UUIDs and records as protobuf-serialised byte strings. Both are schemas under the same protocol; they differ in how they interpret and .
In Rust:
#![allow(unused)]
fn main() {
use panproto_gat::theory::Theory;
use panproto_schema::protocol::Protocol;
use panproto_protocols::register;
let theory = Theory::builder()
.sort("Tag")
.sort("Record")
.operation("tag", ("Record",), "Tag")
.build()?;
let protocol = Protocol::new("toy.kv", theory)
.with_parser(toy_kv_parser)
.with_emitter(toy_kv_emitter);
register(protocol);
}
The theory builder introduces sorts and operations one by one; the Protocol wrapper attaches the parser and emitter; register installs the protocol under the string identifier "toy.kv" for later lookup. Once registered, the protocol is available to every subsequent panproto operation by its identifier, and a developer who writes a schema against "toy.kv" gets type-checking against the theory, a concrete parser for reading existing documents, and an emitter for writing schemas back.
The toy protocol has two sorts and one operation. A real protocol has tens or hundreds of each. The structural shape is the same: declare the theory, attach a parser and an emitter, register under an identifier. What varies between protocols is the complexity of the theory and the subtlety of the parser and emitter, not the shape of the registration itself.
The parser and emitter contract
The parser and emitter are user-supplied functions bound by a small trait. The parser takes a byte slice in the protocol’s native format and returns a Result<Schema, ParseError>; the emitter takes a schema and returns Result<Vec<u8>, EmitError>.
Both are expected to respect the theory they are registered against. A parser returning a schema that fails the theory’s equations is an error in the parser, and the validator in panproto_schema::validate rejects the schema at build time regardless of what the parser thought it was producing.
Most parsers aim to satisfy a round-trip law: emit after parse is the identity on the original bytes. The exact form of the law depends on the protocol. For protocols with unambiguous surface syntax the law is literal. For human-edited formats — YAML, SQL DDL, source code — the law is weakened to “parse-then-emit is the identity up to whitespace and comment layout”, and the panproto-io crate supplies a CST complement that captures the remaining bytes outside the theory’s grip. The machinery is developed separately; the parser/emitter trait itself does not require it.
Parsers for the shipped protocols are implemented by hand against each protocol’s specification. Parsers for programming languages — Python, Rust, TypeScript, and the other 245 tree-sitter-supported languages — are auto-derived from tree-sitter grammars, a process the tree-sitter chapter develops in full.
A real protocol
ATProto is worth reading as a reference. Its registration lives in panproto_protocols::web_document::atproto. The theory declares sorts for lexicons, records, strings, blobs, and the various scalar types ATProto supports; the parser consumes lexicon JSON and produces schemas; the emitter serialises back to JSON. The construction follows the same four-step pattern as the toy protocol, scaled up to the complexity ATProto requires.
The next chapter walks through the scaled-up version in detail. A reader who wants to see how a protocol is defined in practice rather than in the abstract should read it as the immediate follow-up to this one.
Further reading
Sannella & Tarlecki (2012) is the textbook-length reference for the algebraic-specification tradition in which protocol definition sits. For practical references on writing parsers and emitters, the nom parser-combinator crate is what panproto’s hand-written parsers use, and serde is what the emitters use for serialisation.
Closing
The remaining chapters of Part IV document the shipped protocols: ATProto lexicons, Apache Avro, a relational case study, and FHIR as a document case study. A separate chapter on tree-sitter and full-AST parsing explains how programming-language protocols are derived from grammars rather than written by hand.
ATProto lexicons
Disclaimer. The content of this page is largely LM-generated. It was written as a stopgap to make the panproto system legible while we work through the book verifying and editing the content by hand. When a chapter has been verified or edited by a human, the parts that were verified or edited will be noted at the head of the chapter.
The protocol underlying Bluesky and the wider AT network is called ATProto, and the schemas under it are written in a language called Lexicon: JSON documents declaring record types, their fields, and the constraints the fields must satisfy. Reading the Lexicon specification and implementing a protocol for it in panproto was an interesting exercise because Lexicon leaves rather a lot to convention. Where most serialisation formats have formal rules for schema evolution, Lexicon has a specification that is less formal than those rules would like it to be and more formal than a casual reader might expect. The translation panproto performs has to absorb the informality into explicit equations, and the chapter is about where the translation is clean and where it is not.
The code lives in panproto_protocols::web_document::atproto; the mathematical background lives in Protocols as theories, schemas as instances and Algebraic and generalised algebraic theories.
What a lexicon is
A Lexicon is a JSON document with a tree of definitions. Every record type declared in a lexicon carries a name — a reverse-DNS identifier like app.bsky.feed.post — along with a list of fields with types and constraints, and a set of permitted query operations. The primitive field types are strings (optionally constrained by length or regular expression), integers, booleans, content identifiers, blob references, and nested objects. Arrays hold lists of a given type, and unions are tagged alternatives among several record types.
An ATProto schema, as panproto sees it, is a population of records conforming to one or more lexicons. A record names the lexicon it conforms to, and its field values are required to satisfy the constraints that lexicon imposes. The data model is tree-shaped — records may contain sub-records through nesting — and cross-record references go through content identifiers rather than in-line pointers.
Translation to a GAT
Panproto registers a theory for ATProto with one sort for each primitive type (String, Int, Bool, Cid, Blob), one sort for each array type (Array(T) per element sort ), and one sort for each lexicon record type present in the loaded lexicons. Operations translate the field accessors: for a record type with a field text : String, the theory has an operation . Constraints on field values — maximum length, regular-expression pattern, minimum array size — are encoded as equations in the theory, stated in panproto-expr over the record’s fields.
A loaded lexicon becomes a schema under this theory. The sorts and operations are fixed by the theory; the lexicon’s specific choices — which fields exist, which constraints apply, which alternatives a union admits — are the schema’s choices, represented as equations the schema enforces over instances.
What the translation preserves
Record structure transfers field by field. Every field in the lexicon becomes an operation in the schema, with the same name and a corresponding target sort. The JSON form
{"type": "object", "properties": {"text": {"type": "string"}}}
produces an operation from the containing record sort to , and the schema carries the operation as one of its interpretations.
Scalar constraints become equations. A maxLength constraint on a string field produces an equation restricting the length of the string operation’s value; a regex pattern produces an equation demanding the string match the pattern. Both equations are evaluated in panproto-expr against the field’s value at instance-build time, and an instance that violates an equation is rejected by the validator.
For union types, a lexicon union over record types produces a sum sort whose discriminator is the record-type identifier at each occurrence in the schema.
Where the translation is imperfect
The translation is not always clean. Three places where panproto accepts looseness rather than forcing the schema into a stricter form than the Lexicon spec supports are worth naming.
Cross-lexicon references via CIDs are not enforceable at schema-build time. A field whose value is the CID of a record in a different collection carries no schema-level guarantee that the referenced record exists; Lexicon itself treats CIDs as opaque strings with no referential-integrity constraint beyond optional format checks. Panproto faithfully mirrors this: the sort has no cross-sort equations in the derived theory. A developer who wants referential integrity has to impose it as a separate constraint, checked at migration time or at version-control commit time.
The unknown type, which Lexicon permits as a placeholder for fields whose shape is not yet fixed, maps to a sort whose interpretation is opaque. Instances of that sort carry their serialised JSON through the engine without decomposition. Panproto does not complain about this, but operations depending on the shape of an unknown value will not have a theory-level specification to check against.
Certain Lexicon constraints — referential ones over CIDs, some more elaborate validation rules — exceed what panproto-expr’s equations can faithfully represent. Panproto handles the constraints that map cleanly onto its expression language and flags the rest as unchecked. A schema under the ATProto protocol may therefore pass panproto’s validator while failing a Lexicon constraint a dedicated Lexicon validator would catch; in production use, both validators should run.
A concrete example
The app.bsky.feed.post lexicon declares a record with text, a creation timestamp, and optional language and reply references. Its translation is a schema with five sorts — the post record itself, plus String, DateTime, and the CID-based reference types — and operations for each field. The equations encode the constraints the lexicon JSON states: text is at most 300 characters, createdAt is an ISO 8601 timestamp, langs (when present) is a non-empty array of BCP 47 tags.
The parser registered in panproto_protocols::web_document::atproto reads the lexicon JSON, constructs a Schema populated with the sorts, operations, and equations above, and hands it to the engine. The emitter does the inverse: given a schema, it regenerates the JSON form that would round-trip back to the same schema.
Migration across lexicon versions
Lexicons evolve. A field can be added, a constraint can be tightened or loosened, a union can gain or lose an alternative. Each such change is a theory morphism between the two lexicon versions’ schemas, in the sense of Theory morphisms and instance migration. Panproto’s migration engine handles lexicon-to-lexicon migrations through the same pipeline it uses for any protocol, with no ATProto-specific logic: the engine reads the difference between the two lexicon JSON documents, constructs the theory morphism, applies it through the restrict/lift pipeline, and produces a migration whose lift function carries records from the old lexicon to the new one.
Further reading
Bluesky PBC (n.d.) is the authoritative specification, with the Lexicon spec specifically at https://atproto.com/specs/lexicon. The Bluesky application is the largest current deployment and a useful source of example lexicons.
Closing
The next chapter turns to Apache Avro, whose schema-evolution rules are specified formally enough to map onto panproto’s migration primitives without the informalities Lexicon forces.
Apache Avro: schema evolution as migration
Disclaimer. The content of this page is largely LM-generated. It was written as a stopgap to make the panproto system legible while we work through the book verifying and editing the content by hand. When a chapter has been verified or edited by a human, the parts that were verified or edited will be noted at the head of the chapter.
Avro gets schema evolution right in a way few serialisation formats manage, and the rules it states are precise enough to translate directly into panproto’s migration framework. A reader version of a schema is allowed to consume data written under a writer version of the schema when the two versions agree on a small number of concrete rules: which field additions are backward-compatible, which removals are forward-compatible, which type changes cross the compatibility line in each direction. The whole thing is documented. The present chapter works through the rules and shows what each one becomes on the migration side.
Avro is the second of the chapter’s two comparisons with the previous one on ATProto. Where ATProto’s Lexicon specification leaves schema evolution largely to convention, Avro fixes it. The translation into panproto’s framework is correspondingly sharper, and it is worth seeing how sharp it can be when a specification cooperates.
The Rust code is in panproto_protocols::serialization::avro.
What Avro specifies
An Avro schema is a JSON document declaring records, enums, arrays, maps, unions, and primitive types (null, boolean, int, long, float, double, bytes, string). A record is a list of named fields, each typed; field types can be primitives, references to other record types, or compositional types built from the above.
Avro’s schema evolution specification distinguishes three cases. Backward compatibility: a new-version reader can read old-version data (the new schema is a safe extension). Forward compatibility: an old-version reader can read new-version data (the new schema is a safe restriction relative to the old). Full compatibility: both directions hold.
The specification includes concrete rules for when each case applies. Adding a field with a default value is backward-compatible. Removing a field with a default value is forward-compatible. Renaming a field requires the new name to be recorded as an alias; readers look up fields by alias when the primary name does not match. Changing a field’s type is permitted only among compatible type pairs (int to long is fine in one direction; double to float is not).
Translation to migration
The rules above are not an informal convention; they are a specification panproto can translate directly. Each case becomes a specific kind of migration in the sense of Theory morphisms and instance migration.
Adding a field with a default value corresponds to a theory morphism whose image extends the source with a new operation, together with a -style pushforward that supplies the default value for every instance. The migration compiles through panproto_mig::compile like any other; the default becomes the lift function’s constant output for the added field.
Removing a field with a default value goes the other way: a theory morphism whose image drops an operation. On the instance side this is a -pullback, which the lift implements as a projection that forgets the removed field. The default value is used only on the other side of the migration, when code written against the new schema encounters old data; the old data still carries the field, and the new schema’s absent-field interpretation uses the default.
A rename maps the old operation name to the new operation name, with every other part of the theory unchanged. The lift function applies the rename on every record. Avro’s aliases are what record the old name so that readers using the new schema can still find the field in old data; panproto’s migration compiler treats alias lists as the symmetric form of this morphism, where both the old name and the new name resolve to the same underlying operation.
For field-type changes among compatible primitives, the theory morphism alters the target sort of an operation. Avro’s rules about which type pairs are compatible correspond to panproto’s existence-checking stage (The restrict/lift pipeline): a theory morphism that maps int to long succeeds at existence checking, since every int value embeds in the long codomain; a morphism that maps double to float fails, since the inverse embedding is not total. Panproto’s diagnostics for such failures reproduce Avro’s compatibility categories at the term level.
Unions
Avro unions are tagged alternatives among several types, with an explicit resolution order for reader-side disambiguation. A union in Avro becomes a panproto sum sort with three injection operations. The resolution order becomes part of the schema-level equations: the sort’s elimination form takes a union value and dispatches on the tag, and the tag’s ordering is recorded as a priority a reader applies when two types admit the same JSON representation.
Schema evolution on unions follows the general pattern. Adding a new alternative is a theory morphism extending the disjunction; removing an alternative requires the removed branch to be unreachable in the existing data, which panproto’s existence checker verifies by examining the source instance against the reduced sum.
What Avro leaves to the engine
Avro’s “full compatibility” is a conjunction: both backward and forward compatibility hold. Panproto expresses this as a pair of theory morphisms, one in each direction, which together form an isomorphism at the schema level up to the translation of optional defaults. A migration certified full-compatibility in Avro is the round-trippable lens of Bidirectional lenses applied to the relevant pair of schemas.
Avro’s default-value semantics are stated at the field level. Panproto elevates them to the equation level: a field with a default value becomes an operation whose theory carries an equation saying when the field is absent in an instance, the operation returns the default. This equation is evaluated by panproto-expr at record-read time and is part of the schema’s validator, not of the lift function alone.
Further reading
Apache Software Foundation (2024) is the Apache Avro specification and is the authoritative reference for the rules discussed in this chapter. For the broader context of serialisation-format design trade-offs, Kleppmann (2017) chapter 4 (“Encoding and Evolution”) compares Avro, Protobuf, and Thrift and gives the working-developer’s view of the same schema-evolution rules. The relational-lens precursor of the migration pattern used here is Bohannon et al. (2006), though the immediate theoretical backing of panproto’s translation is Theory morphisms and instance migration.
Closing
The next chapter, A relational case study, works through a different protocol family: relational database schemas, where the category-theoretic content is larger (the relational model is closer to Spivak’s original categorical-database treatment) and the migration primitives are denser.
A relational case study
Disclaimer. The content of this page is largely LM-generated. It was written as a stopgap to make the panproto system legible while we work through the book verifying and editing the content by hand. When a chapter has been verified or edited by a human, the parts that were verified or edited will be noted at the head of the chapter.
The relational model is the setting closest to the original categorical-database treatment of Spivak, and the reason is historical: Spivak’s framework was developed as a categorical reformulation of Codd (1970)’s work, which means the translation from relational schemas into panproto’s framework is the least lossy of any protocol family in Part IV. A relational schema is already a small category with finite products, and an instance is already a product-preserving functor into . Panproto’s relational protocols follow this tradition, and the present chapter walks through what the translation actually looks like, using Apache Cassandra and a subset of SQL as the concrete cases.
The code lives in panproto_protocols::database.
What a relational schema is
A relational schema in the standard sense is a collection of tables, each with a declared set of columns and column types, plus a collection of foreign-key constraints linking columns of one table to primary keys of another. The rows of an instance populate the tables; foreign-key constraints are satisfied whenever every foreign-key value refers to a row that actually exists in the target table.
The categorical reading of this structure is direct. The schema is a category whose objects are the tables and whose morphisms are the functional dependencies induced by foreign keys plus the projections that send a row to one of its column values. An instance is a product-preserving functor from the schema into : it assigns each table to its row-set and each morphism to the corresponding function on rows.
Panproto’s relational protocols represent this structure as a GAT. Each table becomes a sort in the theory. Each column becomes an operation from the table’s sort to the column’s type sort. Each foreign-key constraint becomes an equation saying that a particular operation factors through the target table’s primary key. The theory’s models, in panproto_schema’s sense, are the schemas; a schema’s instances are populations of rows satisfying the equations.
A Cassandra schema
Apache Cassandra (Lakshman and Malik 2010) is a wide-column store with a schema language that differs from SQL in a few well-known ways. Its design lineage runs through the Dynamo key-value store of DeCandia et al. (2007). A Cassandra table has a compound primary key made of a partition key (for distribution across nodes) and a clustering key (for ordering within a partition); secondary indexes are optional; cross-partition joins are not supported at the query layer. The underlying data model is still relational, and the categorical treatment above applies with minor adjustments.
Panproto’s Cassandra protocol registers a theory whose sorts include Table, PartitionKey, ClusteringKey, and one sort per scalar type Cassandra supports (Text, Int, Bigint, Uuid, Timestamp, Blob, and the rest). Operations expose the column accessors and the primary-key components. Equations encode two constraints: that every row’s primary-key components are total (no nulls in a primary key) and that partition-key values are hashed deterministically.
A schema under this protocol fixes the specific tables, columns, and primary-key structures of a given Cassandra keyspace. Loading a Cassandra schema from a CQL CREATE TABLE script goes through the parser in panproto_protocols::database, which produces a Schema value whose sorts match the tables and whose operations match the columns.
Migrations across schema versions
Adding a column to a table extends the source theory with a new operation from the table’s sort to the column’s type sort. The migration compiles through the restrict/lift pipeline as any other: a -style pushforward supplies the column’s default value (or NULL, if the column is declared nullable) for every existing row.
Dropping a column is the dual. The theory morphism removes the operation; the pushforward is a -pullback that forgets the column. Data loss is explicit: the dropped column’s values are gone after the migration, and panproto’s inversion stage reports the irreversibility when asked.
A rename maps the old operation name to the new. Unlike Avro, Cassandra does not support column aliases, so the rename is irreversible at the schema level; the migration remains trivial to apply but records no reverse path.
Foreign-key constraints translate to equations in the target theory. The pushforward is empty on the data side (no rows change), but existence checking rejects any source instance that violates the new constraint. The rejection report identifies the specific rows whose foreign-key values do not resolve. The diagnostic is produced by panproto_mig::existence.
A SQL subset
Panproto’s SQL support covers a subset suitable for schema migrations: CREATE TABLE, ALTER TABLE ADD/DROP COLUMN, ALTER TABLE RENAME, and ALTER TABLE ADD/DROP CONSTRAINT. It does not cover trigger, stored-procedure, or view definitions. The subset is wide enough to handle the schema-evolution cases real applications generate and narrow enough that the translation into panproto’s framework is unambiguous.
The SQL parser registers a separate theory from the Cassandra one, since SQL and Cassandra disagree on which types are primitive (SQL has DECIMAL, Cassandra does not; Cassandra has Counter, SQL does not). Two schemas under the SQL theory can be migrated into each other directly; two schemas under different theories require a theory morphism between the protocols themselves, which is the subject of Protocol colimits.
Further reading
Codd (1970) is the founding paper of the relational model, and is still the most readable single source on what makes a schema relational. Spivak (2012) is the categorical reformulation that panproto’s relational protocols follow directly, and Schultz et al. (2017) extends the framework to accommodate the richer data types a working database needs. Wisnesky (2013) is the companion PhD thesis and the most direct precedent for panproto’s engine; the CQL system that grew out of it is the closest existing tool to the relational side of panproto.
For the specific protocols discussed in this chapter, Lakshman & Malik (2010) is the Cassandra system paper; DeCandia et al. (2007) is the Dynamo paper whose techniques Cassandra and several other NoSQL systems draw on. For SQL as a practical schema language, Kleppmann (2017) chapter 2 (“Data Models and Query Languages”) is the most accessible treatment.
Closing
The next chapter turns to FHIR as a document case study. FHIR stresses the representation in a different direction from the relational protocols here: its schemas are deeply nested, heavily constrained, and versioned irregularly. Panproto handles it through the same mechanisms Parts II and III developed, but the specific translation choices are worth tracing in detail.
FHIR as a document case study
Disclaimer. The content of this page is largely LM-generated. It was written as a stopgap to make the panproto system legible while we work through the book verifying and editing the content by hand. When a chapter has been verified or edited by a human, the parts that were verified or edited will be noted at the head of the chapter.
Fast Healthcare Interoperability Resources — FHIR, in practice — is HL7’s standard for exchanging healthcare records, and it is the schema language that gives panproto’s framework the hardest workout of anything in Part IV. Where relational schemas are flat and constraints are few, FHIR’s schemas are JSON or XML documents with arbitrary-depth nesting, elaborate cardinality constraints, terminology bindings, and conformance profiles that layer on top of base resources to narrow their constraints further. The versioning cadence (R4, R4B, R5, R6) is regular and not always backward-compatible. A schema language stressing a framework like this is a good test of whether the framework can take the weight.
The code is in panproto_protocols::domain::fhir.
What a FHIR resource looks like
FHIR organises healthcare data into resources, each a typed record whose shape is defined by the specification. A Patient resource has fields for name, birth date, gender, active status, and identifiers; a Observation resource has fields for the observed value, the subject, the time of observation, and the code that classifies what was measured. Fields can be scalars, nested objects, arrays, or references to other resources through a structured URL. Every field has a declared cardinality (min and max occurrences) and, where applicable, a binding to a terminology (a code set from SNOMED CT, LOINC, or another standard vocabulary).
A FHIR profile further constrains a base resource. A US Core Patient profile, for example, is a Patient with additional constraints: the identifier field must be non-empty, the name field must include at least one family name, the birth date must be present. Profiles are themselves FHIR documents in a language called StructureDefinition, and a profiled resource is one that satisfies both the base resource’s constraints and the profile’s additional ones.
Translation to a GAT
The base FHIR resource types become sorts in panproto’s FHIR theory: one sort per resource type (Patient, Observation, Procedure, and the rest), plus one sort per nested complex type (HumanName, Address, Period, and so on), plus sorts for the primitive types FHIR defines (string, boolean, dateTime, code, uri). Operations are the field accessors: for a Patient resource with a name field of type HumanName, the theory has an operation , since a patient may have multiple names.
Cardinality constraints become equations. A field with minimum cardinality 1 produces an equation demanding the array be non-empty; a field with maximum cardinality 1 produces an equation demanding the array have at most one element (in practice, panproto uses a scalar sort rather than a singleton-array sort in the latter case). Terminology bindings become equations that require a code value to belong to a given value set, checked by panproto-expr against a value-set lookup table the schema carries alongside.
Profiles become theory morphisms. A US Core Patient profile is a theory morphism from the base FHIR theory into a narrower theory whose equations include the extra constraints. Applying the morphism is the -pullback: an instance under the narrower profile is an instance under the base with additional guarantees. Going the other way requires a -style filter that keeps only the base-resource instances satisfying the profile’s constraints.
Cross-resource references
FHIR references are structured identifiers of the form ResourceType/id, sometimes with a version anchor. A reference field in a FHIR resource has a known set of target resource types. Panproto represents reference fields as operations whose codomain is a sum sort over the allowed target types, with the sum-sort tag carrying the resource type of the actual referent.
Referential integrity is not enforced at schema-build time by FHIR itself; a reference may point at a resource that does not exist in the local dataset. Panproto mirrors this: the sum-sort tag records the claimed target type, and the equations do not require the target to be present. A developer who wants referential integrity at migration time applies a separate consistency check, built on top of the instance functor.
Where the translation is imperfect
FHIR extensions are open-ended. Every resource in FHIR may carry extension fields whose shape is not declared in the base resource’s specification. Panproto represents extensions as an opaque JSON-valued sort that preserves them across migrations without decomposition. A migration that needs to transform extension contents must do so through panproto-expr’s field-transform mechanism applied to the extension sort directly.
FHIRPath invariants (expressions attached to resources) form a rich constraint language that panproto’s equations cannot faithfully encode in full generality. Panproto handles the invariants that map cleanly onto panproto-expr (presence checks, cardinality constraints, simple relational comparisons) and flags the rest as unchecked. A schema under the FHIR protocol may pass panproto’s validator while failing a complex FHIRPath invariant the specification requires; for production use, a FHIR-specific validator should run alongside panproto’s.
Resource versioning is per-resource. Some resources (Patient, Observation) have evolved in backward-compatible ways across R4, R4B, R5, and R6; others have had breaking changes. Panproto represents each version’s schema as a distinct schema under the FHIR protocol, and migrations between versions go through the standard pipeline. The engine does not itself determine which version pairs are backward-compatible; the developer marks this explicitly in the migration declaration.
Further reading
HL7 International (2024) is the authoritative specification of the FHIR standard, maintained by HL7; the published resource definitions, profile libraries, and validation machinery live under hl7.org/fhir. For the columnar-storage techniques that make deeply nested schemas like FHIR’s practical to query at scale, Melnik et al. (2010) is the Dremel paper that underlies Apache Parquet; Apache Software Foundation (2024) is the Parquet specification that applies those techniques in the open-source setting. For a working-developer’s view of the design trade-offs in document schemas more broadly, Kleppmann (2017) chapter 2 covers document versus relational models in the context of real systems.
Closing
The next chapter, Tree-sitter and full-AST parsing, closes Part IV with the auto-derivation mechanism: how panproto produces protocols from tree-sitter grammars for the 248 programming languages that do not have hand-written protocol definitions.
Tree-sitter and full-AST parsing
Disclaimer. The content of this page is largely LM-generated. It was written as a stopgap to make the panproto system legible while we work through the book verifying and editing the content by hand. When a chapter has been verified or edited by a human, the parts that were verified or edited will be noted at the head of the chapter.
Writing a protocol by hand for every programming language one might reasonably want to parse is not a serious proposal. Two hundred and forty-eight languages ship with panproto, and we did not hand-write them. They come from tree-sitter grammars, through a mechanical translation that turns a grammar’s node-type metadata into a GAT and a parser. The present chapter explains the translation, what the resulting theories can and cannot express about the language they describe, and when a hand-written upgrade is worth the effort.
Tree-sitter itself is due to Brunsfeld (2018). Its incremental-parsing algorithm descends from Wagner & Graham (1998), and its parsing-expression-grammar support from the packrat-parsing tradition of Ford (2002).
The code is in panproto-parse for the derivation logic and panproto-grammars for the bundled tree-sitter grammars.
What tree-sitter supplies
A tree-sitter grammar is a JavaScript-defined grammar that compiles to a C parser, a node-types.json metadata file, and an optional tags.scm query for identifier scoping. The parser, given source code in the language, produces a concrete-syntax tree whose node kinds are declared in node-types.json. Each node type has a name (like function_definition or binary_expression), a set of named fields (like name, parameters, body for a function), and a type constraint on each field that names which node types may appear there.
This metadata is a GAT in disguise. Each node type becomes a sort; each field becomes an operation from the parent sort to the field’s declared child type; the type constraints on fields become the theory’s well-typedness conditions. Tree-sitter’s grammar format specifies all of this at the metadata level, so the translation is mechanical.
The translation
Panproto’s panproto_parse module reads node-types.json for a grammar and constructs a Theory whose sorts are the named node types, whose operations are the fields of each node type, and whose equations encode the grammar’s type constraints. A register function binds the theory to a parser built on tree-sitter’s runtime, together with an emitter that walks a schema’s instance and renders the source text.
For Rust, this produces a protocol whose sorts include source_file, function_item, struct_item, expression, identifier, and roughly two hundred others. Operations include function_item.name : function_item -> identifier, struct_item.fields : struct_item -> Array(field_declaration), and so on. A Rust source file parsed through this protocol yields a Schema under the derived theory, with every AST node a typed value under the right sort. The same machinery runs for Python, TypeScript, Go, Java, C, C++, and the other 242 languages in panproto-grammars.
Interstitial text
A critical design choice is what to do with interstitial text: whitespace, comments, and the other material between the nodes a tree-sitter grammar recognises. A lossless round-trip (parse then emit produces the original bytes) requires preserving this material. Tree-sitter itself does not declare interstitial text as part of the grammar; it is recovered only through the parser’s byte offsets.
Panproto’s derivation captures interstitial text through a companion structure called the CST complement, developed in Format-preserving parsing and related panproto-io modules. The complement is a sidecar record attached to the schema that records every comment and every whitespace run in its original position. When the emitter writes the source text back out, it composes the schema’s instance data with the complement to produce exactly the input bytes, byte-for-byte. The round-trip law is verified in the panproto-io test suite.
What the derived theory can express
The derived theory captures the structural shape of source code. A function’s name, parameter list, return type, and body are all present as operations. A type’s fields, a module’s imports, an expression’s operator and operands are all there. Structural operations over source code, a rename across an entire codebase or an insertion of a new field into every struct of a given name, are expressible as migrations in the derived protocol.
The derived theory also captures the type constraints tree-sitter itself expresses. A tree-sitter grammar that says a function_item’s parameters field must contain a parameters node produces an equation in the derived theory that rejects any schema whose parameter field is populated with, say, an expression node.
What the derived theory cannot express
A tree-sitter grammar does not declare a language’s type system, its name-resolution rules, or its semantic constraints. A derived theory therefore accepts source code that is syntactically well-formed but semantically invalid: a variable used before declaration, a type mismatch in an operation, a call to a non-existent function. Those classes of error are the job of a language-specific analyser (a Rust compiler, a Python type checker) rather than of a tree-sitter-derived protocol.
For the programmer using panproto, this means that parsing a Rust source file through the Rust protocol produces a schema whose instance passes panproto’s theory-level validation whenever the source is syntactically valid Rust. It does not follow that the code compiles. A developer who wants that guarantee combines panproto’s parse with a call to rustc or a similar language-specific check; panproto is not a replacement for a compiler.
Hand-written protocols as upgrades
For languages whose semantic structure panproto needs to reason about (Rust, when a migration is rewriting trait bounds in a way that requires understanding trait inheritance, for example), a hand-written protocol can be layered on top of the tree-sitter-derived one. The hand-written protocol’s theory imports the derived theory through a theory morphism, then adds new sorts (trait resolution tables, type environments) and operations that encode the semantic structure. Migrations under the hand-written protocol have access to the richer information and can perform transformations the derived protocol alone could not verify.
Panproto’s Rust protocol has this shape today only partially: the sort structure is derived, but a small number of operations (name resolution for use declarations, basic type-environment construction) are added by hand. The expectation is that language-specific semantic protocols will be contributed over time; the derived-from-tree-sitter substrate gives every such contribution a working starting point.
Further reading
Brunsfeld (2018) is the original presentation of tree-sitter and the best overview of its design goals. The incremental-parsing algorithm is in the lineage of Wagner & Graham (1998), and the parsing-expression-grammar tradition it extends is documented in Ford (2002). The tree-sitter project’s documentation at https://tree-sitter.github.io/tree-sitter/ covers the grammar authoring process and the node-types.json schema that panproto’s derivation reads.
For the broader category-theoretic framing of grammars as theories, the institutions framework of Goguen & Burstall (1992) is the formal setting. Panproto’s derivation does not depend on that framework explicitly — the translation is simple enough not to need it — but the underlying identification of a grammar with a specification of a theory is institutional.
Closing
Part IV ends with this chapter. Part V turns to version control and applies Part II’s constructions to a new category: schemas under a fixed protocol rather than protocols themselves.
What git already versions and what it does not
Disclaimer. The content of this page is largely LM-generated. It was written as a stopgap to make the panproto system legible while we work through the book verifying and editing the content by hand. When a chapter has been verified or edited by a human, the parts that were verified or edited will be noted at the head of the chapter.
Nearly every team that writes software uses git, and nearly every such team uses it for the same purpose: versioning the bytes of a source tree, across branches, across time. The internal object model is simple enough that a reader who has never looked inside git is usually surprised at how cleanly it is constructed, and the present chapter lays it out precisely enough to show what git is versioning and what it is not. The gap between those two — what git actually tracks and what a production system often wants tracked — is what the rest of Part V sets out to fill.
Readers familiar with git’s Merkle DAG can skim the first two sections and pick up where the gap is stated. For the deeper account of git’s internals, Chacon & Straub (2014) is canonical.
The three kinds of object
Git stores every piece of repository state as a content-addressed object in the object database. There are three kinds of object in the ordinary case: blobs, trees, and commits. A blob is a byte sequence (the contents of a file, with no filename or permissions attached). A tree is a list of (name, permission, object hash) triples corresponding to the entries of a directory; the object hash in each triple points at either another tree (a subdirectory) or a blob (a file). A commit is a record with a pointer to a tree (the repository state), zero or more parent commit hashes, author metadata, and a commit message.
Every object is identified by the SHA-1 (or SHA-256 in modern git) hash of its contents. The hash is the object’s identity: two blobs with the same bytes are the same object; two trees that list the same entries are the same object. The Merkle-tree construction the arrangement is a special case of is due to Merkle (1988). This content-addressing is what makes git’s storage compact (identical files across branches share storage) and makes its integrity guarantees strong (any corruption of an object’s bytes changes its hash, so references break visibly).
The Merkle DAG
Every commit in a git repository points at a tree (its root directory), and every tree points at blobs and subtrees. Commits also point at their parent commits. The whole graph of objects is therefore a directed acyclic graph whose edges are the pointers between objects; the graph is a Merkle DAG in the sense that every object’s identity is determined by the hashes of the objects it points at.
The Merkle DAG is what makes git’s history immutable by construction. Rewriting any part of the history changes the hashes of every commit from the rewrite point forward. A repository refers to its current state through refs (branch and tag names), each of which is a pointer to a commit hash; a ref’s value can be updated, but the commits the ref points to cannot be altered without producing different hashes and therefore different commits.
What git versions
Git versions byte sequences. More precisely, it versions directory trees of byte sequences, with history tracked at the granularity of commits. A commit encodes the complete state of the repository at a moment in time, together with the lineage that produced it. Merges, branches, tags, diffs, blames, and bisects are all operations on this graph of content-addressed objects.
The operations work well. Merging two branches through a three-way algorithm whose inputs are the two branch tips and their common ancestor produces a merged tree that reflects both branches’ edits wherever those edits commute; where they conflict (both branches modified the same byte range of the same file), the algorithm reports the conflict at the line level and asks a human to resolve it. Diffs are computed on bytes, usually line-by-line with heuristic matching for reordered chunks. Blame traces each line to the commit that introduced it.
All of this operates at the byte level. Git’s idea of the content a repository holds is: arbitrary bytes, organised into named files, organised into named directories.
What git does not version
Three things a working repository depends on are outside the model.
The first is schema. A repository may contain JSON records that conform to some agreed-upon JSON Schema, protobuf messages that conform to a .proto file, or tree-sitter-parsed source code that conforms to a grammar; git sees only the bytes. When the schema changes, git does not know to migrate the data; the developer must do so by hand, usually in a commit that simultaneously updates the schema file and rewrites every affected data file.
The second is interpretation. A CSV file and a TSV file may differ by one byte per row and yet be read differently by every consumer downstream; git has no representation of that difference, and treats the two as unrelated byte sequences. A schema migration that changes a field’s type is visible to git only as a diff to a schema file (if that file is stored at all); the fact that all readers of the repository are now obliged to interpret the affected records differently does not enter git’s accounting.
The third is the lineage that connects one interpretation to the next. A schema’s history is not the history of its file; if the schema file moves, splits, or merges with another, git tracks the file moves but not the semantic lineage of what the file meant. A reader who wants to understand why a field was added in commit , renamed in , and removed in cannot trace that history through git’s default tooling without reconstructing it from commit messages, which are unstructured text.
Git versions byte sequences well. Every category of thing a working repository needs beyond byte sequences (schemas, interpretations, the lineage between schema versions) is either not represented at all or represented incidentally as bytes-that-happen-to-be-a-schema-file. The gap is what panproto-vcs fills.
What panproto-vcs adds
Panproto-vcs keeps git’s Merkle DAG structure intact and adds three new object types to the database: schemas (models of a registered GAT, in the sense of Protocols as theories, schemas as instances), migrations (morphisms of models, in the sense of Theory morphisms and instance migration), and instances (records under a given schema, in the sense of the instance functor). The object types are content-addressed by blake3 hashes rather than SHA-1; they form a DAG alongside the commits; and they carry the schema-lineage information git does not.
Commits in panproto-vcs point at trees whose leaves can be any of the new object types in addition to the ordinary blobs. The result is a repository whose history captures both the bytes and the interpretations the bytes are values of. Merges in panproto-vcs run at the schema level, as pushouts in the category of schemas (see Merge as pushout). Diffs and blames apply to schemas as well as to bytes. The next chapter works through the object model in detail.
Further reading
Chacon & Straub (2014) is the canonical reference for git’s internals; its chapter 10 (“Git Internals”) walks through the object model this chapter summarises. Merkle (1988) is the foundational paper on Merkle trees, the data structure git’s content-addressed storage specialises. For the patch-theoretic alternative to git’s byte-level diff model, Roundy (2005) is the Darcs paper; patch theory is a separate tradition panproto does not follow wholesale but whose ambitions overlap with the schematic-version-control framework of this part. Kleppmann (2017) chapter 5 (“Replication”) gives the working-developer’s view of distributed-state consistency, which is the broader engineering setting git and panproto-vcs both sit inside.
Closing
The next chapter, Objects, refs, and the DAG, specifies panproto-vcs’s object types, its hashing scheme, its storage backends, and the reference structure around them. It is the chapter to read alongside panproto_vcs::object, panproto_vcs::store, and panproto_vcs::dag for a working understanding of the implementation.
Objects, refs, and the DAG
Disclaimer. The content of this page is largely LM-generated. It was written as a stopgap to make the panproto system legible while we work through the book verifying and editing the content by hand. When a chapter has been verified or edited by a human, the parts that were verified or edited will be noted at the head of the chapter.
The object database panproto-vcs sits on top of is a Merkle DAG of content-addressed objects, referenced by refs, in the same overall shape as git’s. What differs is the set of object types stored. Where git has three — blob, tree, commit — panproto-vcs has seven, four of which are the new object kinds that carry the schemas, migrations, instances, and protocol definitions the engine uses to track interpretation alongside bytes. The present chapter specifies the seven types, the hashing, the two storage backends, and the ref structure.
The code lives in panproto-vcs.
Object types
The object types are seven.
A blob is a byte sequence. It is the same notion as git’s blob, used for file contents that have no interpretation panproto’s engine knows about. A blob is serialised as its raw bytes with no header.
A tree is a list of (name, permission, object-id, object-kind) tuples. Each entry in a tree points at another object in the database, with the object-kind tag saying which of the seven types the target is. Trees are git’s trees generalised to carry the extra object kinds below.
A commit is a record with parent references, a root tree, author and committer metadata, a message, and a set of schema commits the working tree depends on. The parent references structure the Merkle DAG of commits; the schema-commits field is panproto-specific and names the schemas this commit is written against.
A schema object is a Schema value, serialised through serde. Two schemas that are equal as models of the same protocol’s theory hash to the same blake3 output, which makes schema deduplication work the same way blob deduplication does in git.
A migration object is a Migration value, also serialised through serde. Every migration references the source and target schemas it operates between, and its hash includes those references; a migration between two identical schemas is the same object.
An instance object is a WInstance or FInstance, again through serde. An instance’s hash includes a reference to its schema, so the pair (schema, instance) is uniquely determined.
A protocol object is a registered Theory together with its parser and emitter identifiers. Protocol objects are usually stored once and referenced many times; their hashes identify the exact protocol version.
The seven types are defined in panproto_vcs::object, each as a variant of the Object enum.
Hashing
Every object is identified by the blake3 hash of its canonical serialisation. Blake3 replaces git’s historical SHA-1 (and the partial SHA-256 migration git is in the process of) with a hash that is both faster and cryptographically stronger. The choice is not ideological: blake3’s streaming API makes incremental hashing during object construction considerably simpler than SHA-family APIs, and its speed pays back in repositories that contain many large instance objects.
The canonical serialisation of each object kind is defined in panproto_vcs::hash. Two objects with the same semantic content have the same canonical serialisation, and therefore the same hash. This extends to panproto-specific deduplication: two Schema values that are model-equivalent under their theory produce the same hash, which a raw Rust == check would not guarantee.
Storage backends
Two storage implementations are shipped. The filesystem backend, panproto_vcs::fs_store, writes each object to a file at a path derived from its hash, under a .panproto directory at the repository root. The layout mirrors git’s (xx/xxxxxx...) for cache-locality; objects are stored without compression by default, with a zstd-compressed variant available through a feature flag.
The in-memory backend, panproto_vcs::mem_store, keeps objects in a HashMap keyed by hash. It is used by the WASM build of panproto, by tests that do not need persistence, and by callers who want to assemble a repository state transiently before committing it to the filesystem. Every operation the filesystem backend supports is also supported in memory, so code written against the Store trait works against both.
Both backends are append-only at the object level. An object once written cannot be modified, only deleted by a garbage collection pass (panproto_vcs::gc) that traces from the live refs and removes any object no ref reaches.
Refs
A ref is a named pointer to a commit. The ref namespace is organised into branches (refs/heads/<name>), tags (refs/tags/<name>), and an implementation-internal set for panproto-specific state (refs/panproto/<name>). Every ref is a mutable mapping: its value can be updated to point at a different commit, though the commits themselves remain immutable. The ref store is implemented in panproto_vcs::refs.
Panproto-vcs keeps a separate ref family for schema history: refs/panproto/schemas/<protocol>/<schema-name>. Each such ref points at a commit whose root-tree leaves include the latest blessed schema for the given name under the given protocol. This is the mechanism schema diff uses to show the evolution of a specific schema across a repository’s history, independently of the file-tree changes that carried the schema through its lifetime.
The DAG
The DAG of commits (connected by parent edges) mirrors git’s structure exactly. Operations on the DAG (topological walks, common-ancestor computation, reachability) live in panproto_vcs::dag and are implemented with the same algorithms git uses. A reader familiar with git’s git-log --graph or git-merge-base has no new graph-level ideas to learn here.
What differs is the DAG of schemas. Every commit references the schemas its working tree depends on, and every schema object references its protocol (and, for schemas produced by migration, the source schema the migration was applied from). The schema-DAG is therefore a parallel structure to the commit-DAG, with its own topological operations. A three-way merge, the subject of the next chapter, operates on both DAGs simultaneously, with the commit-level merge choosing the common-ancestor commit and the schema-level merge computing the pushout in the category of schemas.
Further reading
Chacon & Straub (2014) chapter 10 covers git’s object database in the detail this chapter’s framing parallels. Merkle (1988) is the original paper on the Merkle-tree construction the content-addressing here specialises. For the BLAKE3 hash function specifically, the BLAKE3 reference at the BLAKE3 GitHub organisation is authoritative; BLAKE3 replaces SHA-1 and SHA-256 for panproto’s object hashes on performance and cryptographic-property grounds that the reference documents.
Closing
The next chapter, Merge as pushout, takes the three-way-merge algorithm apart and shows that it is a pushout in the category of schemas (Colimits and pushouts is the mathematical reference). The chapter after that, Data versioning, works through how panproto-vcs automatically infers migrations from schema diffs and lifts instance data across version boundaries.
Merge as pushout
Disclaimer. The content of this page is largely LM-generated. It was written as a stopgap to make the panproto system legible while we work through the book verifying and editing the content by hand. When a chapter has been verified or edited by a human, the parts that were verified or edited will be noted at the head of the chapter.
A three-way merge in a version-control system is, at heart, a reconciliation of two parallel edits against a common ancestor. Git carries out this reconciliation at the level of bytes, which works when the bytes are what the team cares about and falls apart when the bytes are interpreted as something richer. Panproto-vcs performs the same reconciliation at the level of schemas, and the construction it uses is the pushout of Colimits and pushouts, applied now in the category of schemas under a fixed protocol.
The present chapter explains what the pushout identification means concretely, what it buys over git’s line-based merge, and what happens when the pushout fails to exist. The categorical treatment of merge as pushout has precedents: Mimram & Giusto (2013) develop it for textual patches, and Angiuli et al. (2014) give a homotopy-type-theoretic variant. What panproto’s version adds is the shift of the pushout from the category of patches or types to the category of schemas, where the engineering payoff is visible. The code lives in panproto_vcs::merge.
The three-way span
Two branches of a panproto-vcs repository that diverged from a common ancestor present a span of three schemas: the ancestor schema , the left branch’s schema , and the right branch’s schema , together with schema morphisms
where and are the migrations from the ancestor schema into each branch. The span is constructed by panproto_vcs::merge::find_span, which uses panproto_vcs::dag to compute the most-recent-common-ancestor commit and extracts the schemas from that commit and from each branch tip.
The pushout
The pushout of this span in the category of schemas is the schema together with morphisms and satisfying , universal among such. By the general construction of Colimits and pushouts, is obtained from the disjoint union of the sorts and operations of and , with the sorts and operations in the image of identified along and . Equations from either branch are inherited; any new equation the merger wants to add applies to the pushout and is checked there.
Concretely, when the left branch adds a field to a record type and the right branch adds a field to the same record type, the pushout is a schema in which the record type has both fields. The ancestor’s definition of the record type is the shared piece; the two branches’ additions are independent.
The construction is implemented in panproto_schema::colimit, called from panproto_vcs::merge. The result is a schema value written to the object database as any other schema; the merge commit points at a tree whose root schema is this pushout.
Data migration at merge time
The pushout at the schema level is only half the work. The branches carry data under their respective schemas, and that data must be carried into the merged schema. Two lift operations run as part of the merge: one lifting left-branch data along , one lifting right-branch data along . Both are ordinary migrations through the restrict/lift pipeline; the merge commit’s tree contains the lifted instances, not the originals.
Where the two branches modified the same data (a record’s value changed in both branches), the merge reports a data-level conflict. The conflict-free replicated data types of Shapiro et al. (2011) are a separate line of work on automatic conflict resolution that panproto does not adopt wholesale; the pushout-plus-manual-resolution approach taken here trades automatic convergence for finer-grained user control over every disagreement. The conflict is reported as a triple (record, left-branch value, right-branch value) with the path in the schema where the divergence occurred, and the user is asked to resolve it. Conflict resolution at the data level is independent of the schema-level merge: the schema-level pushout may succeed even when data-level values conflict, since the schema structure is compatible but specific records disagree.
When the pushout fails
A pushout may fail to exist. The two branches may modify the same sort or operation in incompatible ways: one branch renames a field, the other adds an equation that references the field under its old name; one branch removes a sort, the other adds a constraint that the sort must satisfy. In such cases the pushout’s universal property cannot be satisfied with a single well-formed schema, and panproto_vcs::merge reports a conflict at the schema level.
The diagnostic is specific. It names the sort or operation at which the pushout fails, the equation in each branch that contradicts the other, and the minimal subset of both branches’ changes that would have to be reconciled for the pushout to exist. A developer sees this as the merge tool asking “the field email in the left branch is being renamed to contactEmail, and the right branch has added a constraint that depends on the name email; which do you want?” with enough structure that a resolution commit (choosing one of the two branches’ versions or a third alternative) can be applied and the merge retried.
What this buys over git’s merge
Git’s merge algorithm operates on lines of text. When two branches both edited the schema definition file, git reports a line-level conflict with no understanding of what schema change each line represents. The developer resolving the conflict sees <<<<<<< markers around a pair of line ranges and is asked to choose between them; what the developer is actually choosing between, most of the time, is two specific schema edits, but git cannot tell the developer that.
Panproto-vcs’s merge operates on schemas. When the same scenario arises, the developer sees a report naming the two schema edits explicitly (left branch renames field to ; right branch adds equation referencing ) and can reason at the level of the edits rather than the level of the lines. The resolution is a commit that adjusts one of the two schema edits and asks the merge to retry; the result is a working merged schema, not a text file with conflict markers that will need a separate schema-validation pass.
The generalisation is that schema merges are closer to the actual operation the developer is performing than line merges are. Line merges are the limit of schema merges when the schema is “arbitrary bytes”; in every case where the content has more structure, operating at the structure’s level produces a more actionable merge diagnostic.
Further reading
Mimram & Giusto (2013) is the nearest categorical treatment of merge for textual patches; its framework and the one in this chapter differ only in which category the pushout is computed in. Angiuli et al. (2014) gives a homotopy-type-theoretic variant and is worth reading for a different perspective on what “merge” can mean. For the CRDT alternative to pushout-based merging, Shapiro et al. (2011) is the foundational paper; panproto does not adopt CRDTs wholesale but the trade-off with automatic-convergence approaches is worth understanding.
For the general category-theoretic background on pushouts, the references from Colimits and pushouts apply. For the specific form of schema merge used here, the implementation in panproto_vcs::merge is its own reference.
Closing
The next chapter, Data versioning, works through the counterpart on the data side: how panproto-vcs automatically infers a migration between two schema versions without an explicit migration declaration, and how the inferred migration interacts with the lift operations described above.
Data versioning
Disclaimer. The content of this page is largely LM-generated. It was written as a stopgap to make the panproto system legible while we work through the book verifying and editing the content by hand. When a chapter has been verified or edited by a human, the parts that were verified or edited will be noted at the head of the chapter.
A schema change in a panproto-vcs commit leaves the data stored under the old schema stranded until something carries it across. The construction that does the carrying is a migration, in the sense Part II developed, and in most cases the migration is uniquely determined or almost uniquely determined by the diff between the two schemas. Panproto-vcs’s auto-migration infrastructure exploits this to produce the migration without asking the user to declare it. The present chapter walks through the inference, the cases it handles, and the cases it refuses.
The code is split between panproto_vcs::auto_mig for the inference and panproto_vcs::data_mig for its application to stored data.
When auto-migration is invoked
A schema change enters panproto-vcs through a commit that stages a new schema object at a path where an old one already exists. The old schema is retrieved from the previous commit’s tree; the new schema is the one being staged. If the commit does not include an explicit migration declaration, panproto_vcs::auto_mig::infer runs against the pair (old, new) and attempts to produce a migration that carries every record under the old schema to a well-typed record under the new one.
Auto-migration is optional. A user whose migration is non-trivial (a new field whose value requires computation, a renamed field whose values must be rearranged, a dropped field whose data must be preserved elsewhere) supplies the migration declaration explicitly through the migration DSL; auto-migration is skipped when such a declaration is present.
The diff
The inference starts from the schema diff. Given the old schema and the new schema , the diff is a structural description of their difference: which sorts are added, which sorts are removed, which operations are added or removed, which equations are added or removed, and which renames have been applied at any of the above levels. The diff is computed by panproto_schema’s differ, which reuses much of the machinery behind panproto-check for breaking-change classification.
From the diff, the inference constructs a candidate theory morphism . Every sort and operation of maps to its correspondent in , with renames where recorded and identity mappings elsewhere. The morphism passes existence checking (panproto_mig::existence) whenever the diff contains no contradictions at the theory level.
The pushforward choice
A theory morphism alone is not a migration; a migration fixes a pushforward choice at every extension site. Auto-migration makes defaults that cover the common cases.
For a field added with a declared default value, the pushforward is a constant lift that supplies the default for every instance. For a field added with no default, auto-migration refuses to proceed; the user must declare the migration explicitly with a non-trivial pushforward. For a field dropped, the pushforward is a -pullback that forgets the field. For a field renamed, the pushforward is the identity lift applied to the field’s values under the new name.
For a sort added, the pushforward is empty on existing data: no records of the new sort appear in the lifted instance, which is consistent with there being no records of the new sort in the source instance. For a sort removed, auto-migration refuses unless the sort is unreachable from the kept sorts in the source schema; a sort referenced by existing records cannot be dropped without explicit user action.
For an equation added, no data-level action is required, but existence checking runs against the source instance to detect any records that would violate the new equation. Records that violate the new equation cause auto-migration to refuse; the user must explicitly declare how to handle the violating records (by deleting them, by rewriting them to conform, by rejecting the migration entirely).
Applying the inferred migration
When inference succeeds, the result is a compiled migration in the sense of The restrict/lift pipeline. panproto_vcs::data_mig::apply applies this migration to every instance the commit’s tree points at, lifting each to its new-schema form. The lifted instances are written to the object database; the commit’s tree is rewritten to point at the lifted instances instead of the originals.
The original instances are not deleted. They remain in the object database, referenced by previous commits, and a checkout of those earlier commits retrieves the data in its original form. This is the same preservation git provides for the byte contents of a file across history: old content is not overwritten, only shadowed by newer content in newer commits.
When auto-migration refuses
Auto-migration refuses in specific, reported cases. A field added without a default value produces the diagnostic “schema adds field X without a default; auto-migration cannot populate this field.” A sort removed while data of that sort exists produces “schema removes sort S; auto-migration cannot discard the records of sort S in the source instance without explicit user declaration.” An equation added while the source contains violating records produces “schema adds equation E; auto-migration found records that violate it (the first three are listed).”
In each case, the commit is held until the user responds. The response is either a migration declaration that covers the refused case, an edit to the staged schema that avoids the refusal (for example, adding a default value to the new field), or an edit to the source data that removes the obstruction (deleting the violating records manually). Panproto-vcs does not apply a partial migration; either the whole commit lifts cleanly or the commit does not land.
Further reading
For the auto-migration machinery in the abstract, the references from Theory morphisms and instance migration apply: Spivak (2012) for the functorial-data-migration framework, Wisnesky (2013) for the implementation tradition. The diff-based approach panproto-vcs uses is complementary to the CRDT-based approach of Shapiro et al. (2011); the trade-off is that panproto surfaces conflicts to the user rather than resolving them automatically, at the cost of requiring user attention at merge time.
For the Cambria perspective specifically, Litt et al. (2020) is the foundational source; panproto adopts Cambria’s complement-tracking approach in its lens machinery (see Bidirectional lenses) but applies it differently in the data-versioning setting.
Closing
The next chapter, The git bridge, covers the interop layer between panproto-vcs and git: how a panproto-vcs repository can be pushed to and pulled from an ordinary git remote, what is preserved on the round trip, and when the two representations diverge.
The git bridge
Disclaimer. The content of this page is largely LM-generated. It was written as a stopgap to make the panproto system legible while we work through the book verifying and editing the content by hand. When a chapter has been verified or edited by a human, the parts that were verified or edited will be noted at the head of the chapter.
A version-control system adopted by a team has to fit into the tooling that team already uses, and for most teams today that means git. A panproto-vcs repository under active development needs to be pushable to a GitHub or GitLab remote running ordinary git, reviewable through git-aware tooling, and pullable back into a panproto-vcs working copy afterwards without losing the schema-level history the push and the review cannot carry. The bridge that makes this work is the subject of the present chapter: panproto-git supplies the bidirectional translation of repository state, and the panproto-git-remote helper — installed as git-remote-panproto — lets git itself speak to a panproto-vcs remote through git push and git fetch.
Two translations
The bridge performs two translations. The import translation takes a git repository (either local or fetched from a remote) and produces a panproto-vcs repository whose object database includes every git commit as a panproto-vcs commit, every git tree as a panproto-vcs tree, and every git blob as a panproto-vcs blob. Schemas, migrations, and instances the git repository did not encode remain absent; an imported git history is an ordinary byte-tree history, not a panproto-enhanced one.
The export translation goes the other way: a panproto-vcs repository is projected onto a git repository by dropping the schema, migration, and instance objects and keeping only the byte-level objects. The export is lossy by construction; what survives is the tree of ordinary files, with schemas serialised to their on-disk form (JSON for ATProto lexicons, Avro IDL for Avro schemas, and so on), migrations serialised to their declaration syntax, and instances serialised through each protocol’s emitter.
The two translations compose on the byte level: importing an exported panproto-vcs repository yields back the same byte history, modulo the on-disk representation of schemas and migrations. They do not compose on the panproto-specific level: importing an exported repository loses the schema-level identifications, since the export has thrown away the schema object kind and produced bytes-on-disk representations that the import layer does not re-identify. A round trip therefore preserves bytes but not the enhanced schema history unless the importer is told explicitly how to re-parse the disk representations.
Functoriality
Both translations are functors in the categorical sense. The import functor sends every git commit DAG to a panproto-vcs commit DAG in a way that respects parent relationships and preserves content-addressing: two git commits with the same hash map to panproto-vcs commits with the same hash (modulo the SHA-1-to-blake3 conversion, which is applied through a deterministic per-object re-hashing pass). The export functor respects the same relationships in the opposite direction.
The functor laws of the Functors chapter apply here. Importing a chain of git commits and importing the last one in the chain produces the same panproto-vcs state, which is the composition axiom. Importing an empty git history produces an empty panproto-vcs state, which is the identity axiom. The same holds for the export functor in its direction. The tests in panproto-git verify both laws on a standard suite of representative git histories.
The git remote helper
panproto-git-remote ships a git remote-helper binary called git-remote-panproto. Git’s remote-helper protocol lets third parties define new transports by shipping an executable named git-remote-<scheme> that speaks the helper protocol on its stdin and stdout. When the user runs git push panproto::/path/to/panproto-vcs-repo main, git invokes git-remote-panproto with the URL and the standard helper commands, and git-remote-panproto translates each command into the corresponding panproto-vcs operation through the crate’s API. The legacy cospan:: scheme is still accepted as an alias.
The result is that a panproto-vcs repository appears to git as an ordinary remote. git push panproto::/path/to/repo main sends the local main branch to the panproto-vcs repository, which runs the import translation on the branch’s commits as they arrive. git fetch panproto::/path/to/repo retrieves panproto-vcs commits and makes them available as git refs, with the export translation running as commits cross the boundary.
The helper binary is installed through cargo install panproto-git-remote. Once installed, git picks it up automatically for any URL of the form panproto::... (or cospan::...).
What the bridge preserves
On the import side, everything git records survives: commits, trees, blobs, refs, authorship metadata, messages. The content-addressing is translated from SHA-1 to blake3, which produces different hash values but preserves the DAG structure.
On the export side, only the byte-level content survives. Schemas, migrations, and instances are serialised to their disk representations and written as files; the panproto-specific object kinds are no longer addressable. A git reader who picks up the exported repository has access to the bytes but not to the schema-level operations.
A panproto-vcs repository that wants to use git as a remote for collaboration typically ensures that its schemas are exported in a format the collaborators can read directly (ATProto lexicons as JSON, for example). A panproto-vcs repository that uses git only for backup can ignore the export concerns: the exported repository is decoration, and the canonical form is the panproto-vcs database on the original machine.
What the bridge does not preserve
Schema-level merges. A three-way merge that succeeds as a pushout in the panproto-vcs setting may not have a byte-level representation that git can merge cleanly. When the export translation runs on a panproto-vcs merge commit, it produces a git commit whose tree reflects the merged schema’s byte form but whose diff against the parents is at the byte level. A git user pulling the exported repository sees a merge commit that touched a schema file and may find that git’s three-way line merge on the schema file’s bytes produces conflict markers; the schema-level resolution panproto-vcs reached is not visible as a git-level merge strategy.
The common mitigation is to encode the panproto-vcs merge as a pair of sequential commits in the exported git history: the left branch is merged first, producing an intermediate byte-level state; the right branch is merged against that intermediate state. This gives git a clean byte-level history even when the panproto-vcs merge operated at the schema level. The exporter does this automatically through a flag on panproto_git::export.
Further reading
For git’s own extensibility mechanisms that the bridge uses, the git-remote-helpers documentation is the authoritative reference. Chacon & Straub (2014) covers git’s object model at the depth the bridge depends on; understanding how git stores commits, trees, and blobs is a prerequisite for understanding what the bridge has to translate and what it cannot.
For the deeper question of what can and cannot be preserved in a bidirectional translation between two version-control systems with different object models, the bidirectional-transformations literature (Bidirectional lenses) applies. The git bridge is itself a lens in the categorical sense: a get (export to git) and a put (import from git) with round-trip laws documented in the chapter above.
Closing
Part V ends with this chapter. Part VI turns to the operational layer: the WebAssembly boundary through which every non-Rust client interacts with panproto, the Rust SDK, the TypeScript SDK, the Python SDK, and the CLI.
The WebAssembly boundary
Disclaimer. The content of this page is largely LM-generated. It was written as a stopgap to make the panproto system legible while we work through the book verifying and editing the content by hand. When a chapter has been verified or edited by a human, the parts that were verified or edited will be noted at the head of the chapter.
The engine is written in Rust, and the non-Rust clients — the TypeScript SDK for browsers and Node, the browser-side of the Python SDK that has since been deprecated in favour of native wheels — cannot share memory with Rust or respect Rust’s ownership discipline from the outside. What they can do is call into a WebAssembly module (Haas et al. 2017) that holds the Rust values internally and exposes a small API that exchanges opaque handles and serialised data. The module is panproto-wasm, built with wasm-bindgen. The present chapter explains the shape of the API and the reasons it takes the shape it does.
Handles, not pointers
Rust types like Schema and Migration carry internal references, lifetimes, and non-trivial invariants. Exposing them directly to a non-Rust caller would require the caller to respect Rust’s ownership discipline from outside Rust, which is not a contract WebAssembly can enforce.
The boundary sidesteps the problem by making every Rust value opaque. When a caller constructs a schema through the WASM module, what comes back is a u32 handle, not a pointer. The handle is an index into a slab allocator that lives inside the WASM module; the Rust value it points at never leaves the module. Subsequent calls that operate on the schema pass the handle back in, and the module dereferences it internally to retrieve the Rust value.
The slab allocator is implemented in panproto_wasm::handle. Each slot is a tagged-union Object that can hold any of panproto’s major types. Allocation hands out the lowest free index; deallocation returns the index to the free list. Handle reuse is not a problem in practice, since every call that produces a handle returns a fresh one; the caller’s code frees handles it is finished with through an explicit drop_handle call, not through GC across the boundary.
MessagePack for data
Values the caller does need to inspect (field values, record contents, error diagnostics) cross the boundary as MessagePack (Furuhashi 2024) byte sequences. MessagePack was chosen over JSON Schema (JSON Schema Organization 2022) for compactness on large instance payloads and over Protocol Buffers (Google 2024) for the absence of a pre-negotiated schema: the contents of a MessagePack message need not be declared in a .proto file ahead of time. Panproto’s own schemas are the things being exchanged, which rules out any representation that would require them to be known to both sides in advance.
Serialisation into MessagePack uses serde on the Rust side and @msgpack/msgpack on the TypeScript side. Both libraries produce the same byte representation for any serde-compatible value. The TypeScript types that model panproto’s values are generated from the Rust type definitions through a short build step that sdk/typescript runs as part of its prepublish.
The entry-point surface
panproto-wasm exposes approximately fifty entry points. They cluster into six groups by functional area: schema construction and validation, migration construction and compilation, instance manipulation, lens construction and law-checking, VCS operations (commit, branch, merge), and protocol registration. Each group maps to a submodule of the crate, and each submodule’s functions follow the same pattern: the arguments are a combination of u32 handles and MessagePack byte slices; the return value is either a new handle, a MessagePack byte slice, or a structured error.
A full catalogue of the entry points is in sdk/typescript’s TypeScript definitions, which are the source-of-truth mapping between the JavaScript-visible function names and the Rust functions they bind to.
Errors
Every entry point returns a Result<T, PanprotoError>. A successful call returns the output; a failed call returns a structured error with a kind discriminator and a human-readable message, serialised as MessagePack. The TypeScript wrapper deserialises errors into JavaScript Error subclasses that preserve the kind and message, so a caller can pattern-match on the specific failure (a parse error, a type-check error, an existence-check failure, a lens-law violation) without parsing strings.
Sizing
The WASM module compiles to about three megabytes after wasm-opt passes and gzip compression. Tree-sitter grammars are the dominant size contributor; a build that disables panproto-parse through the feature flags comes down to about one megabyte. Both sizes are within the budget a typical web application allocates for a first-party module, and the lazy-load pattern (load the module only when a user action requires it) is the common deployment strategy.
What the TypeScript SDK adds on top
The TypeScript SDK (@panproto/core, covered in the TypeScript SDK chapter) layers three things on the WASM boundary: ergonomic wrappers that hide handle manipulation behind ordinary JavaScript object lifetimes; MessagePack serialisation and deserialisation with type-safe bindings generated from the Rust types; and a fluent builder API for common operations (schema construction, migration composition) that would otherwise require many discrete WASM calls.
Further reading
Haas et al. (2017) is the original paper introducing WebAssembly and the clearest source on what the platform guarantees and what it does not. W3C WebAssembly Working Group (2019) is the normative W3C specification. For the Rust-specific machinery the panproto-wasm crate uses, the wasm-bindgen documentation is authoritative; Rust and WebAssembly Working Group (2024) is the BibTeX entry for the project itself. For MessagePack’s wire format, Furuhashi (2024) is the specification.
Closing
The next chapter, the Rust SDK, turns to the cross-language core: panproto-core, the facade crate that re-exports every subsystem’s public API and gates the optional ones behind Cargo features.
The Rust SDK
Disclaimer. The content of this page is largely LM-generated. It was written as a stopgap to make the panproto system legible while we work through the book verifying and editing the content by hand. When a chapter has been verified or edited by a human, the parts that were verified or edited will be noted at the head of the chapter.
A working Rust project using panproto depends on one crate, panproto-core, and gets a facade that re-exports every subsystem worth using from one place. The present chapter is a working guide to the facade — what it exposes, what the feature flags gate, and what idioms recur in code written against it. Readers familiar with panproto’s architecture from Parts II and V will find it short.
The facade
panproto-core depends on and re-exports the major panproto subsystems: panproto-gat for theories, panproto-schema for schemas, panproto-inst for instances, panproto-mig for migrations, panproto-lens for bidirectional transformations, panproto-protocols for the registered protocols, and panproto-expr for the expression language. A use panproto_core::prelude::*; at the top of a file imports the most commonly used types in a form ready for working code.
The crate’s purpose is to save a caller from the ceremony of depending on seven separate sub-crates and keeping their versions aligned. A caller depends on panproto-core with a single version pin; the facade crate in turn pins the sub-crates consistently. Breaking-change propagation follows the semver policy documented in the CI chapter.
Feature flags
Several subsystems are gated behind Cargo features, so that a caller who does not need them pays neither the compile-time cost nor the runtime dependency.
parse(default off). Enablespanproto-parseand the tree-sitter-based programming-language protocols. The feature is gated, sincepanproto-grammarsis a large dependency (hundreds of megabytes uncompressed) and most callers do not need it.project(default off). Enablespanproto-projectfor multi-file project assembly. Needed by tooling that operates on a whole repository rather than a single schema.git(default off). Enablespanproto-gitfor the git bridge covered in The git bridge. Adds the git dependency chain.llvm(default off, experimental). Enables thepanproto-llvmprotocol and lowering morphisms. Requires an LLVM installation at build time.jit(default off, experimental). Enablespanproto-jitfor compiling expression-language evaluations to native machine code through LLVM. Also requires an LLVM installation.
The default feature set (no features enabled) gives a caller the complete schema/migration/lens/protocol pipeline with no additional dependencies beyond what Rust crates panproto itself publishes. Adding a feature is a single entry in the caller’s Cargo.toml:
[dependencies]
panproto-core = { version = "0.32", features = ["parse", "git"] }
Listing 7.1: Opting into the tree-sitter-based parsing and the git bridge through panproto-core’s feature flags.
Idioms
A handful of idioms recur across working code.
Schema construction goes through the builder pattern of panproto_schema::builder::SchemaBuilder. The builder consumes sort and operation declarations, validates each against the protocol’s theory, and returns a Result<Schema, SchemaError>. A caller that wants to construct a schema programmatically rather than from a parsed source file uses the builder directly; a caller loading a schema from disk goes through panproto_protocols::parse instead.
Migration construction and compilation are separate steps. A user-supplied migration declaration (whether written in Rust directly or loaded through panproto-lens-dsl) goes to panproto_mig::compile and produces a compiled Migration ready for application. Running the migration on an instance is panproto_mig::lift.
Error handling uses thiserror throughout. Each sub-crate defines its own error enum in an error.rs module, and panproto-core re-exports them all. Errors are #[non_exhaustive], which allows new variants to be added without breaking callers who match on them; the trade-off is that a caller cannot exhaustively cover every variant with a match. The CLI uses miette to render these errors with source-span diagnostics, and a library caller can do the same through miette::Report.
Async operations do not exist in the core API. Every operation is synchronous, for reasons developed in The WASM boundary (the WASM runtime has no native async support at the panproto layer) and for the further reason that the core operations are CPU-bound rather than I/O-bound. A caller who needs to parallelise large-instance migrations does so at the call site through rayon or a similar crate; panproto’s own internals remain single-threaded at the function level.
When to use something else
Two cases where a caller should reach for a sub-crate directly instead of panproto-core.
Callers whose size budget is extreme (a WASM build with tight constraints) may want to depend only on the sub-crates they use. panproto-core re-exports everything; a direct dependency on panproto-schema (say) keeps the dependency graph smaller.
Callers building a third-party protocol for inclusion in panproto-protocols should depend on panproto-gat and panproto-schema directly, without the facade. The extension chapter (Extending panproto) covers this case.
Closing
The next chapter, the TypeScript SDK, walks through the wrapper layer that runs on top of the WASM boundary for browser and Node callers. The two chapters after that cover the Python SDK and the CLI.
The TypeScript SDK
Disclaimer. The content of this page is largely LM-generated. It was written as a stopgap to make the panproto system legible while we work through the book verifying and editing the content by hand. When a chapter has been verified or edited by a human, the parts that were verified or edited will be noted at the head of the chapter.
The TypeScript SDK, published on npm as @panproto/core, wraps the WebAssembly boundary developed in the previous chapter in a typed, ergonomic API for JavaScript and TypeScript callers. What the wrapping adds is threefold: ordinary JavaScript object lifetimes instead of manual handle management; MessagePack serialisation hidden inside method calls rather than exposed at the API level; and a fluent builder API that bundles the many discrete WASM calls of schema and migration construction into a single round-trip.
The source lives in sdk/typescript/ in the panproto repository.
Initialisation
The WASM module needs to be loaded and initialised before any panproto call. The SDK handles this through a top-level init function that takes either a URL, a Response, or a BufferSource (depending on the environment) and returns a promise that resolves when the module is ready.
import { init, Schema } from "@panproto/core";
await init(); // loads the bundled WASM from the package's own assets
const schema = await Schema.fromLexicon(lexiconJson);
Listing 8.1: Initialising the SDK from the default bundled WASM. The bundled path is a small helper that resolves to the WASM file shipped inside the npm package; callers with specific load strategies (CDN, inline data URI, custom caching) may pass an explicit URL.
The module is loaded once per process. Subsequent init calls are idempotent; the SDK caches the initialised module globally and re-uses it. A caller that needs multiple independent panproto instances (uncommon; mostly for testing) may explicitly construct fresh module contexts, but the default one-module-per-process model fits every production case.
Typed API surface
Every Rust type with a stable API has a TypeScript class that wraps it. The wrapping is thin: each class holds a u32 handle from the WASM module’s slab allocator and exposes methods that dispatch to the underlying WASM entry points, with MessagePack serialisation and deserialisation applied transparently.
The top-level classes are Schema, Migration, Instance, Lens, Protocol, and Repository (the VCS entry point). Each class has a small set of methods covering construction, validation, inspection, and the specific operations the Rust type supports. The method names match the Rust ones where possible, with camelCase substituted for snake_case per TypeScript convention.
Handle lifecycle
JavaScript has no manual memory management, but the WASM module’s handles need to be freed explicitly: a handle that is never released occupies a slot in the module’s allocator indefinitely. The SDK bridges this gap through FinalizationRegistry, which calls a cleanup function when a JavaScript object is garbage-collected.
Every SDK class registers its handle with the registry on construction. When the JavaScript object becomes unreachable, the registry invokes a callback that releases the handle back to the WASM module’s allocator. The pattern relies on the JavaScript GC running in a timely fashion; for long-running services that want deterministic release, every class also exposes an explicit dispose() method that frees the handle immediately. The TypeScript 5.2 using keyword picks up dispose() automatically.
{
using schema = await Schema.fromLexicon(lexiconJson);
// ... work with schema ...
} // schema.dispose() is called automatically when the block exits
Listing 8.2: Deterministic handle release through the TypeScript 5.2 using keyword.
The fluent builder
Schema and migration construction from raw primitives would require many WASM round-trips (one per sort, one per operation, one per equation). The SDK has a fluent builder for these construction cases that accumulates the full declaration in TypeScript and sends it to the module as a single MessagePack-encoded payload.
const schema = await Schema.builder(protocol)
.sort("Person")
.sort("Address")
.operation("name", ["Person"], "String")
.operation("livesAt", ["Person"], "Address")
.build();
Listing 8.3: Schema construction through the fluent builder. The builder accumulates declarations locally and invokes the WASM module once at .build() time.
The builder pattern is also available on Migration and on Lens. For bulk operations, the amortised cost of a single WASM round-trip is substantially lower than the per-operation cost of dispatching each primitive through a separate call.
Type generation
The TypeScript types that represent panproto values (record shapes, pattern-matchable enums, error kinds) are generated from the Rust type definitions at publish time. The generator runs tsify against the WASM crate to produce a .d.ts file that reflects the public types with full generic parameters and discriminated unions. A caller’s TypeScript compiler sees exactly the same type surface the Rust code exposes, minus the lifetimes Rust has and TypeScript does not.
The generation step runs as part of the npm package’s build script. Consumers of @panproto/core do not need to run it themselves; they receive the pre-generated definitions with the package.
Error handling
Rust errors cross the boundary as MessagePack payloads with a discriminant and a message. The SDK deserialises them into JavaScript Error subclasses (PanprotoParseError, PanprotoTypeCheckError, PanprotoLensLawError, and so on) whose prototype chain preserves the Rust error variant information. A caller can check error instanceof PanprotoLensLawError to match on a specific failure; the classes are exported from the top-level SDK module for this purpose.
Every method that can fail returns a Promise<T> that rejects with the appropriate error subclass. The async surface is the natural one for a WASM-backed API and composes with the rest of modern JavaScript’s async machinery.
Closing
The next chapter, the Python SDK, covers the analogous wrapper for Python. Since panproto ships native PyO3 wheels (the pure-Python WASM-backed SDK having been deprecated), the Python story differs from the TypeScript one in several ways worth spelling out.
The Python SDK
Disclaimer. The content of this page is largely LM-generated. It was written as a stopgap to make the panproto system legible while we work through the book verifying and editing the content by hand. When a chapter has been verified or edited by a human, the parts that were verified or edited will be noted at the head of the chapter.
A Python caller reaching panproto uses a native PyO3 wheel, which binds the Rust implementation directly into the CPython process rather than going through the WebAssembly module the TypeScript SDK uses. The choice is a recent one. An earlier pure-Python SDK wrapped panproto-wasm and went through MessagePack on every call; it is deprecated, and the panproto package on PyPI is the native build. The present chapter covers the native SDK and says briefly why the earlier approach was abandoned.
The source lives in crates/panproto-py/ on the Rust side and sdk/python/ on the Python side.
Installation
The package is installed from PyPI in the ordinary way.
pip install panproto
Listing 9.1: Installing the panproto Python SDK. The wheel is pre-built for the most common platform-architecture combinations (Linux x86_64 and aarch64, macOS x86_64 and arm64, Windows x86_64); other platforms receive a source distribution that requires a Rust toolchain at install time.
The package bundles every subsystem the Rust SDK exposes. There is no feature-flag equivalent on the Python side; the tree-sitter-based parsers and the git bridge are both included by default. A Python caller who does not use a subsystem does not pay for it at call time; the only cost is the size of the wheel, which is dominated by the bundled tree-sitter grammars.
Why native and not WASM
The pure-Python SDK wrapping panproto-wasm had two persistent problems. Performance: every call crossed a MessagePack serialisation boundary, and the WASM runtime on CPython was significantly slower than native Rust code running directly in the Python interpreter. Memory: the WASM module maintained its own heap, and large instance payloads were duplicated across the WASM heap and the Python heap for every call.
The PyO3-based native build runs the Rust code directly inside the CPython process, with no MessagePack crossing and no WASM runtime. The cost is that panproto ships a separate wheel per platform-architecture combination, which is the industry-standard solution for Python native extensions and is what numpy, cryptography, and similar projects do.
API shape
Every Rust class with a public API has a corresponding Python class exposed through a #[pyclass] attribute on the Rust side. The Python classes own their Rust data directly rather than holding opaque handles; Python’s garbage collector manages their lifetime through the standard refcounting mechanism.
from panproto import Schema, Protocol
protocol = Protocol.load("atproto")
schema = Schema.from_lexicon(protocol, lexicon_json)
for field in schema.fields():
print(field.name, field.type)
Listing 9.2: A minimal Python session loading an ATProto Lexicon into a schema and iterating its fields. No explicit handle management; the schema is a first-class Python object with standard lifecycle semantics.
The method names follow Python conventions (snake_case, no camelCase), and the methods themselves dispatch to the Rust functions directly. Method signatures are declared with PEP 484 type hints so that type-checkers see the real types.
Serialisation
Values that cross the boundary between Rust and Python use pythonize, which converts between serde-compatible Rust values and native Python objects (dicts, lists, ints, strings). The conversion is zero-copy where possible and deep-copy where the Rust value’s structure requires it.
For large instance payloads the SDK exposes a streaming interface that avoids materialising the whole record set in Python at once. An instance can be iterated lazily, with each record materialised on demand, so that a migration over a million-record instance does not require a million-record Python structure to exist simultaneously.
Error handling
Rust errors become Python exceptions. Each Rust error variant maps to a Python exception subclass (PanprotoParseError, PanprotoTypeCheckError, PanprotoLensLawError, and so on), all descending from a common PanprotoError base. The Python-side exceptions are defined in the SDK’s Python code and raised by the PyO3 bindings when the underlying Rust function returns an error.
A caller can catch either the specific subclass or the base class.
try:
migration = Migration.compile(declaration)
except PanprotoTypeCheckError as e:
# handle the specific case
...
except PanprotoError as e:
# catch-all
...
Listing 9.3: Catching a specific Rust error variant at its Python equivalent.
The error messages preserve the diagnostic information the Rust side produces, with source spans when they are available. A Python caller using rich or similar can render these diagnostics at the same fidelity as the Rust SDK.
Thread safety
The native SDK releases the Python GIL for every long-running Rust operation, so calls into panproto from multiple Python threads run concurrently on multiple CPU cores. The Rust-side objects are Send and Sync where their semantics permit; the SDK’s PyO3 bindings reflect this, and a Python caller using threading or concurrent.futures sees real parallelism rather than the GIL-bound concurrency of ordinary Python code.
Callers using multiprocessing can pickle panproto values across process boundaries through the standard protocol; the SDK’s types implement __reduce__ for this purpose.
Closing
The next chapter, the CLI, covers the schema command-line tool built on top of panproto-cli. The CLI is the highest-level interface panproto ships and is where a first-time user typically begins.
The CLI
Disclaimer. The content of this page is largely LM-generated. It was written as a stopgap to make the panproto system legible while we work through the book verifying and editing the content by hand. When a chapter has been verified or edited by a human, the parts that were verified or edited will be noted at the head of the chapter.
Most first-time users of panproto encounter it through the schema command — a single binary built from panproto-cli that exposes the engine’s operations as subcommands, covering the same ground the SDKs of the previous chapters cover programmatically. The present chapter is a reference for the subcommand surface, organised by workflow rather than alphabetically.
Installation
The binary is installed through cargo install:
cargo install panproto-cli
Listing 10.1: Installing the schema command from crates.io. Pre-built binaries are also published on GitHub Releases for common platforms; consult the installation page of the main documentation for platform-specific instructions.
The binary’s name is schema, not panproto. The choice follows the convention of domain-specific tooling naming the command after the domain object rather than the tool: git operates on repositories, docker on containers, schema on panproto schemas. A developer who finds schema ambiguous may alias it in their shell.
Define a schema
A schema is defined by writing a source file in whichever protocol the schema belongs to (an ATProto lexicon JSON, an Avro IDL file, a Rust source file, and so on) and running:
schema define my-schema.json
The subcommand invokes the relevant protocol’s parser, validates the resulting schema against the protocol’s theory, and stores the schema in the working directory’s .panproto object database. The schema is addressable by the identifier declared in the source file.
Migrate across schema versions
A migration is declared in a separate file (migration DSL, typically in Nickel or JSON) and applied with:
schema migrate --from old-schema --to new-schema --using migration.ncl data.json
The subcommand reads the migration declaration, compiles it through panproto_mig::compile, runs the compiled migration against the input instance (in data.json), and writes the output instance to stdout or to a file named by --output. If the migration cannot be compiled or the input cannot be lifted, the failure is reported at the specific stage responsible, with the diagnostics of The restrict/lift pipeline.
Check compatibility
schema check compares two schemas and reports their compatibility classification, along the lines of Apache Avro’s backward-forward-full compatibility. The subcommand is the front-end for panproto-check and is what the CI integration described in the panproto-breaking-change-ci skill invokes.
schema check old-schema new-schema
The output is a compatibility verdict (backward, forward, full, or incompatible) together with the specific schema changes the classification rests on. The exit code is zero on full compatibility and non-zero otherwise, so the command can gate a CI pipeline.
Version control
A panproto-vcs repository is initialised with schema init, which creates a .panproto directory in the current working directory. Commits, branches, merges, and log inspection work through the sub-tree of schema vcs commands:
schema vcs commit -m "Add email field to Patient"
schema vcs branch --from main feature-branch
schema vcs merge main feature-branch
schema vcs log
The interface follows git’s conventions closely. A reader comfortable with git’s CLI will find the schema vcs subcommands to have the same shape, with the addition of a small number of schema-aware commands (schema vcs diff-schemas, schema vcs blame-schema) that operate on panproto’s extended object model rather than on byte trees.
Convert between formats
schema convert is the umbrella subcommand for cross-protocol translations. It reads data in one format, applies a cross-protocol lens (covered in Part II and the panproto-cross-protocol skill), and writes the data out in another format.
schema convert --from avro --to protobuf schema.avsc > schema.proto
schema convert --from avro-data --to protobuf-data --with schema-map.ncl data.avro > data.pb
The first form converts a schema definition; the second converts instance data against a pair of mapped schemas. Both invoke the same protocol-registry machinery.
Stable and experimental subcommands
Some subcommands are marked [experimental] in the help output. These include schema llvm (lowering to LLVM IR through the experimental panproto-llvm crate), schema jit (native-code compilation of expressions through panproto-jit), and schema xrpc (remote-repository operations through panproto-xrpc). These subcommands are subject to change between releases and are not covered by the semver guarantees of the main API surface. A note in Experimental and feature-gated subsystems gives more detail on each.
Shell completion
The binary emits shell completions for bash, zsh, and fish through schema completions <shell>. A typical installation runs the command once and sources the output in the shell’s rc file.
Closing
The next chapter opens Part VII, which is for contributors. It starts with the workspace layout, then covers the CI pipeline, how to extend panproto with a new protocol or lens combinator, and the experimental and feature-gated subsystems the main chapters have gestured at.
Workspace layout
Disclaimer. The content of this page is largely LM-generated. It was written as a stopgap to make the panproto system legible while we work through the book verifying and editing the content by hand. When a chapter has been verified or edited by a human, the parts that were verified or edited will be noted at the head of the chapter.
Twenty-five crates live in panproto’s Cargo workspace, organised across six layers: foundations, transformation, infrastructure, protocols and auxiliary tooling, SDK bindings, and experimental extensions. The layering is strict — no crate depends on anything strictly above it — and a contributor working on panproto will want to know which layer their crate belongs to before writing their first patch.
The root workspace manifest is Cargo.toml at the repository root; the individual crates are under crates/. The non-Rust packages — the TypeScript SDK and the Python SDK’s Python-side code — live under sdk/typescript/ and sdk/python/.
The layers
The foundations layer is where the core mathematical types live. Theories, morphisms, and colimits live in panproto-gat; the expression language is in panproto-expr; schemas as models and the instances under those schemas are defined in panproto-schema and panproto-inst respectively. These four have no dependencies on each other beyond gat, expr, schema, inst; in particular, none of them depends on anything below this layer.
The transformation layer sits on top of the foundations. panproto-mig is the migration engine; it depends on panproto-gat, panproto-schema, panproto-inst, and panproto-expr. panproto-lens is the bidirectional-lens machinery; it depends on the transformation layer as a whole.
The infrastructure layer is the cross-cutting one. The version-control engine lives in panproto-vcs, instance serialisation and deserialisation in panproto-io, and the breaking-change classifier in panproto-check. The facade crate is panproto-core, and the command-line binary is panproto-cli.
The protocols-and-auxiliary layer is the widest. panproto-protocols holds the 51+ protocol definitions, organised into the category subdirectories documented in Defining a protocol. panproto-parse and panproto-grammars together implement the tree-sitter auto-derivation of programming-language protocols. panproto-theory-dsl and panproto-lens-dsl are the declarative DSLs for theory and lens specifications in Nickel, JSON, or YAML form. panproto-expr-parser is the Haskell-style surface parser for the expression language. panproto-project is the multi-file project-assembly layer.
The SDK bindings layer is what non-Rust callers touch. The WASM boundary is in panproto-wasm, and the Python PyO3 native binding is in panproto-py.
The experimental layer is feature-gated and subject to change. It contains the git bridge in panproto-git (covered in The git bridge), the LLVM IR protocol and lowering in panproto-llvm, the expression JIT compiler in panproto-jit, and the XRPC-based remote-repository client in panproto-xrpc. The stand-alone panproto-git-remote binary (installed as git-remote-panproto) is also in this layer.
The dependency graph
Dependencies run top-down across layers. The foundations layer never depends on any layer above it. The transformation layer depends on foundations. Infrastructure depends on foundations and transformation, with panproto-vcs being the largest infrastructure crate and depending on most of the transformation machinery to implement its merge algorithm (Merge as pushout).
The protocols-and-auxiliary layer depends on everything below it: a protocol needs theories from gat, schema construction from schema, instance construction from inst, and occasionally migration machinery from mig when the protocol ships with built-in migrations.
The SDK bindings layer sits at the top of the graph and depends on panproto-core’s re-exports plus any other sub-crates they bind specifically. The experimental layer is a sibling of the bindings layer, with its own top-level dependencies.
The graph is acyclic by design and by Cargo’s enforcement: any attempt to introduce a cycle is caught at compile time. A cargo-hakari pass in CI verifies that the graph’s workspace-hack crate (workspace-hack, the dependency unification used to improve build times) is up to date.
Crate conventions
Every crate follows the same internal conventions, documented in the rust-conventions skill. The public API lives in src/lib.rs with a module tree under it. Error types go in an error.rs module and use thiserror with #[non_exhaustive] variants. Each crate has a tests/ directory for integration tests and uses #[cfg(test)] modules for unit tests inline with the implementation. Docs tests are run alongside integration tests in CI. Public types are Serialize + Deserialize through serde wherever their semantics permit.
A new crate that wants to live in the workspace adopts these conventions and is added to the root Cargo.toml and to the relevant hakari configuration.
Closing
The next chapter, CI, semver-checks, and release, covers the pipeline that runs against every commit and release, and the invariants it enforces.
CI, semver-checks, and release
Disclaimer. The content of this page is largely LM-generated. It was written as a stopgap to make the panproto system legible while we work through the book verifying and editing the content by hand. When a chapter has been verified or edited by a human, the parts that were verified or edited will be noted at the head of the chapter.
Every commit to panproto’s main branch runs through a CI pipeline whose job is to catch regressions before they land — across formatting, lints, tests, documentation, semver compatibility, and the workspace-hack invariant. The present chapter walks through each check, says why it exists, and gives the local command a contributor can run to reproduce it before pushing.
The pipeline definition lives in .github/workflows/ci.yml.
The checks
Formatting runs first through cargo fmt --all --check. The workspace uses the default rustfmt.toml configuration, with no overrides. A commit that fails fmt --check does not advance to the later stages; the local equivalent is cargo fmt --all.
Linting comes next, through cargo clippy --workspace --all-targets -- -D warnings. Every warning is a CI failure, and contributors do not silence warnings with #[allow] attributes unless the silencing is justified with a comment. The clippy configuration includes clippy::pedantic and clippy::nursery, which catch patterns that the default clippy level lets through. The local equivalent is cargo clippy --workspace --all-targets.
The test suite runs through cargo nextest run --workspace. Nextest replaces the default cargo test for speed: it runs tests in parallel more effectively and produces structured output. The test suite includes property-based tests (through proptest) for the mathematical invariants panproto promises, including lens-law checking and functor-axiom verification.
Documentation is checked with cargo doc --workspace --no-deps --all-features -- -D warnings, which verifies that rustdoc can render the whole workspace without warnings. Broken doc-links and missing-documentation lints on public items are CI failures.
Semver compatibility is checked with cargo semver-checks, which runs against the last published release on crates.io and rejects any breaking API change that is not accompanied by a semver-major version bump. The check catches removed public items, changed function signatures, changed type exports, and added required trait methods. The local equivalent is cargo semver-checks, after installing the binary through cargo install cargo-semver-checks.
The workspace-hack invariant is verified by cargo hakari verify, which confirms that the workspace’s generated workspace-hack crate is up to date. Hakari speeds up builds by unifying dependency features across the workspace; the generated crate has to be regenerated whenever workspace dependencies change, and CI rejects commits that leave it stale. The local equivalent is cargo hakari generate && cargo hakari manage-deps.
The pipeline runs against two toolchain versions: the MSRV (minimum supported Rust version, currently stable N-2) and the current stable. A commit that passes both is merged.
Release
A release follows the workflow documented in the release skill. The steps are version bump (through release-please or the equivalent), changelog update (driven by git-cliff against conventional commits), test-suite dry-run across all SDK builds, publication to crates.io for Rust crates, to npm for the TypeScript SDK, and to PyPI for the Python wheel. Each step is scripted; a human is in the loop for the final version-bump commit and the git tag.
The SDK wheels for Python are built across platforms through a matrix job in .github/workflows/python-wheels.yml. The matrix covers Linux (x86_64, aarch64), macOS (x86_64, arm64), and Windows (x86_64). A source distribution is published alongside the binary wheels for platforms the matrix does not cover.
Pre-commit hooks
Contributors who want local feedback before pushing install the pre-commit hooks through the panproto-pre-commit-hooks skill. The hooks run a subset of CI locally on every commit: formatting, clippy on the changed files, and a fast test subset. A full CI reproduction is still a good idea before pushing to a long-lived branch, but the hooks catch the common cases (a missed rustfmt, a new clippy warning, a regressed test) without waiting for CI.
Reproducing CI locally
The lint and test skills each run a bundled subset of the CI pipeline. lint covers formatting, clippy, and doc warnings; test covers the test suites. A contributor who runs both skills sequentially reproduces CI’s correctness checks; the semver-checks and hakari checks are run separately through the semver and hakari skills.
The sequence that reproduces every CI step is documented in the contributing skill and is worth running end-to-end before a first PR, so that a contributor sees the same pass and fail output the CI pipeline produces.
Closing
The next chapter, Extending panproto, walks through the three most common contributions: a new protocol, a new lens combinator, and a new expression-language builtin. Each has a specific shape that the crate structure documented here encourages.
Extending panproto
Disclaimer. The content of this page is largely LM-generated. It was written as a stopgap to make the panproto system legible while we work through the book verifying and editing the content by hand. When a chapter has been verified or edited by a human, the parts that were verified or edited will be noted at the head of the chapter.
A contributor writing new code for panproto is usually doing one of three things: adding support for a data format or schema language the engine does not yet know, adding a transformation shape the lens library does not yet offer, or adding a small computation the expression language needs but cannot express with its existing builtins. The present chapter walks through each case end-to-end.
A new protocol
Adding a protocol is the most common contribution and follows a predictable shape. The contributor starts by identifying the protocol’s category (serialisation, database, document, or domain) and the subdirectory of panproto-protocols that houses it. A new file under that subdirectory collects the four artefacts the protocol supplies (the GAT, the parser, the emitter, the registry entry) and a mod.rs update adds the file to the crate’s module tree.
The theory is written first, usually through panproto-theory-dsl in Nickel form for readability, though a direct Rust construction through the panproto_gat::theory::Theory builder works equally well. A type-checker pass validates the theory before the parser and emitter are written against it.
The parser is the longest piece. It consumes bytes in the protocol’s native format and produces a Schema value under the theory. Panproto’s existing parsers vary in strategy: the Avro parser is a hand-written recursive-descent parser over the JSON form, the FHIR parser uses serde_json directly and maps the resulting Value tree to schema construction calls, and the tree-sitter-based language parsers run the tree-sitter library and walk the resulting concrete-syntax tree. A contributor picks the strategy that fits their protocol’s format.
The emitter is the inverse. It takes a schema and serialises it back to the protocol’s native format, respecting the round-trip property (the emitter’s output, parsed again, produces a schema equivalent to the original). Round-trip equivalence is verified by a test suite the contributor writes alongside the parser and emitter.
The registry entry is a two-line call to the register function of panproto-protocols, binding the theory, parser, and emitter under a string identifier. The identifier is usually the reverse-DNS name of the protocol’s specification organisation, to avoid collisions with other protocols.
A protocol is considered tested when three checks pass together: the theory validates as well-formed, parse-then-emit reproduces the original bytes up to a declared equivalence, and at least one integration test drives a schema through a trivial migration and inspects the result.
A new lens combinator
Lens combinators extend panproto-lens with new ways of composing lenses or of constructing lenses from ingredients the library did not previously support. A contributor adding a combinator follows the Cambria pattern documented in Bidirectional lenses: the combinator is a function that takes inputs and returns a Lens value whose get and put are derived from the inputs.
The work is in verifying the round-trip laws. A combinator’s put must satisfy GetPut and PutGet against the combinator’s get; a contributor cannot rely on the laws being obvious from the code and must write a law-verification test. The standard pattern is a proptest test over the combinator’s input space that constructs a lens, samples input values and modifications, and checks that the round-trip equations hold pointwise.
Combinators that cannot satisfy the round-trip laws on all inputs but do satisfy them on a restricted subset are still valuable; the contributor declares the restriction explicitly and documents it in the combinator’s rustdoc. A typical restriction is “GetPut holds for sources whose complement is consistent with the modification’s view”; panproto-lens has a trait for expressing this kind of pre-condition, and the law-verification test respects it.
Symbolic simplification rules (panproto_lens::symbolic) may need updating when a new combinator is added. The simplifier is a term-rewriting system over a finite set of lens identities; a new combinator that reduces under composition with existing ones should have its simplification rules added.
A new expression-language builtin
Builtins live in panproto_expr::builtin. Each builtin is a function with a fixed arity, a type signature declared against panproto-expr’s type language, and a reduction rule that evaluates it on value arguments. A contributor adding a builtin writes:
- A new variant of the
Builtinenum. - A type-checker clause in
panproto_expr::typecheckthat accepts the builtin and produces its result type. - A reduction rule in
panproto_expr::evalthat evaluates the builtin on values. - Documentation in the public rustdoc naming the builtin’s semantics precisely enough that the reduction rule is a consequence.
- At least one test per kind of input the builtin accepts.
Builtins must be total on well-typed inputs. A builtin that would otherwise fail on a bad input (division by zero, taking the head of an empty list) produces a BuiltinError that the evaluator propagates as a standard EvalError at the call site. The error rather than a panic is critical: panproto expects every builtin to behave predictably against every input the type-checker admits.
Builtins are not casually added. A builtin that covers a genuine gap in what the expression language can express is worth the addition; a builtin that can be composed from existing ones is better not added, since it enlarges the trusted surface without giving the language anything new. A contributor should discuss a proposed builtin on the issue tracker before writing the PR, partly to ensure the gap is real and partly to align on the name and signature.
Adding a cross-protocol translation
A separate kind of contribution is a cross-protocol translation: a lens between two existing protocols that panproto does not yet know how to mediate. The construction goes through panproto_lens::symmetric (see Bidirectional lenses) and is registered through panproto-protocols in the cross-protocol table. The details are in the panproto-cross-protocol skill.
Closing
The final chapter of Part VII, Experimental and feature-gated subsystems, covers the subsystems that are in the workspace but have not yet stabilised. A contributor whose extension falls into one of those subsystems should read that chapter before proceeding.
Experimental and feature-gated subsystems
Disclaimer. The content of this page is largely LM-generated. It was written as a stopgap to make the panproto system legible while we work through the book verifying and editing the content by hand. When a chapter has been verified or edited by a human, the parts that were verified or edited will be noted at the head of the chapter.
Not every crate in the workspace is production-ready. Three are feature-gated and marked experimental — panproto-llvm, panproto-jit, and panproto-xrpc — for reasons specific to each: an LLVM installation is a substantial dependency, the JIT compiler is promising but still incomplete, and the XRPC remote-repository support is not yet stable across registry differences. The present chapter describes each crate, says where it is currently reliable, and names the acceptance criteria for promoting it to the stable default-on configuration. A contributor working in any of them should treat the API surface as subject to change between minor versions.
panproto-llvm
panproto-llvm registers LLVM IR as a panproto protocol: a theory whose sorts are LLVM IR’s kinds (modules, functions, basic blocks, instructions, types), whose operations are the structural accessors of those kinds, and whose equations encode LLVM’s well-formedness conditions. A schema under this protocol is an LLVM module’s structure at the theory level, and a migration is a structural transformation of the module (rename an intrinsic, reorganise a function’s control flow, add a pass’s effect to the module’s metadata).
The crate is feature-gated, since it depends on an LLVM installation at build time: a significant cost for the 99% of panproto users who do not need it. When the feature is enabled, the LLVM bindings go through inkwell against whichever LLVM version the user has installed, and the schema-level operations dispatch to LLVM’s own module manipulation APIs for the operations panproto cannot express purely in its own term language.
The current state is that theory construction is complete and migration between LLVM versions (an LLVM IR migration from LLVM 16 to LLVM 17, for example) works for the well-typed subset of modules. The parts that do not yet work reliably are complex inter-procedural optimizations expressed as migrations; the existence-checker rejects many plausible such migrations for reasons that appear to be conservative rather than necessary.
Acceptance criterion: the LLVM 17-to-LLVM 18 migration should apply to the full kernel of the Linux kernel’s IR output without rejection, and the round-trip (parse then emit) through the panproto representation should produce IR that the LLVM assembler accepts without adjustment.
panproto-jit
panproto-jit compiles panproto-expr evaluations to native code through LLVM. When a migration engine needs to run the same expression against a large number of instance records, the cost of the small-step evaluator becomes noticeable; JIT-compiling the expression into a single native function amortises the cost across the records and improves throughput on large datasets. The compiler architecture follows the standard pipeline treatment of Appel (1998).
The crate is feature-gated alongside panproto-llvm and shares the LLVM-installation dependency. When enabled, it replaces the small-step evaluator of Totality and termination for expressions above a configurable size threshold, falling back to the interpreter for expressions below the threshold (where the compile cost would not pay back).
The current state covers integer arithmetic, string operations, and simple list operations. Pattern matching and list comprehensions are not yet supported; an expression containing either falls back to the interpreter. Totality guarantees are preserved (the JIT respects the step and depth limits through explicit counter decrements in the compiled code), but the performance of the compiled code on the supported subset is noticeably faster than the interpreter only for expressions that repeat over many records.
Acceptance criterion: pattern matching support, a benchmark where the JIT outperforms the interpreter by at least 5x on a representative migration workload, and deterministic equivalence between JIT and interpreter outputs on the standard test suite.
panproto-xrpc
panproto-xrpc implements a client for remote panproto-vcs operations through the XRPC protocol used by the AT Protocol ecosystem. The crate allows a panproto-vcs repository to be pushed to and pulled from a remote server that exposes a standard XRPC endpoint, providing an alternative to the git-bridge approach of The git bridge for collaborations that are already running their own XRPC-based infrastructure.
The crate is feature-gated for two reasons: the XRPC dependency is specific to AT Protocol and does not benefit users not working in that ecosystem, and the remote-repository semantics for panproto-vcs are not yet stable. A push to a remote XRPC repository works; a pull works; a merge across a remote branch and a local branch works in the common cases. The cases that do not yet work reliably are the ones where the remote and local sides have different sets of registered protocols, so that a commit on the remote references a protocol the local side does not know about.
Acceptance criterion: cross-protocol-registry merge resolution, a stability guarantee on the XRPC wire format across minor versions, and a round-trip test suite comparable to the git bridge’s.
The pattern
Each of these subsystems lives in the workspace rather than in a separate repository so that it can evolve alongside the main panproto stack. A change to panproto-gat’s theory representation, for example, is propagated to panproto-llvm at the same time as to the other theory consumers, which avoids the drift that a separate-repository arrangement would produce.
A contributor who wants to work on any of these crates has to opt in through feature flags and be aware that their work is against a moving target. In exchange, promising ideas can be tried out in tree, measured against the rest of the stack, and either promoted or abandoned without the friction of separate-repository maintenance.
A crate is promoted to stable when its acceptance criterion is met and a release includes the promotion as a changelog entry. Demotion back to experimental is possible but rare; it has not happened in the crates currently listed as experimental.
Further reading
For the JIT compiler architecture, Appel (1998) is the standard pipeline treatment. The LLVM project’s own documentation at https://llvm.org/docs/ covers the IR and the pass infrastructure panproto-llvm integrates with. For the XRPC protocol that panproto-xrpc implements, Bluesky PBC (n.d.) is the authoritative specification.
Closing
Part VII ends with this chapter. The appendices that follow contain a notation reference, a glossary, and an open-problems list covering places where panproto’s implementation or the book’s mathematical account is ahead of the other.
Notation reference
Disclaimer. The content of this page is largely LM-generated. It was written as a stopgap to make the panproto system legible while we work through the book verifying and editing the content by hand. When a chapter has been verified or edited by a human, the parts that were verified or edited will be noted at the head of the chapter.
A lookup for the non-standard symbols and typographic conventions used in the book. The table lists each symbol, its meaning, and the chapter in which it is introduced.
Categorical notation
| Symbol | Meaning | Introduced in |
|---|---|---|
| A category | Categories | |
| The class of objects of | Categories | |
| , | The hom-set of morphisms in | Categories |
| A morphism from to | Categories | |
| Composition of and | Categories | |
| The identity morphism on | Categories | |
| and are isomorphic | Categories | |
| A functor | Functors and natural transformations | |
| A natural transformation | Functors and natural transformations | |
| The component of a natural transformation at | Functors and natural transformations | |
| , | Product of and with its projections | Universal properties |
| , | Coproduct of and with its injections | Universal properties |
| Pairing into a product | Universal properties | |
| Co-pairing out of a coproduct | Universal properties | |
| , | Terminal and initial objects | Universal properties |
| A diagram in with shape | Colimits and pushouts | |
| The constant functor at | Colimits and pushouts |
GAT and model notation
| Symbol | Meaning | Introduced in |
|---|---|---|
| , | A generalised algebraic theory / a panproto protocol | Algebraic and generalised algebraic theories |
| The syntactic contextual category of | Algebraic and generalised algebraic theories | |
| The category of models of | Algebraic and generalised algebraic theories | |
| A model / a panproto schema | Protocols as theories, schemas as instances | |
| A theory morphism | Theory morphisms and instance migration | |
| , , | The pullback and pushforward functors along | Theory morphisms and instance migration |
| A migration between panproto schemas | Theory morphisms and instance migration | |
| The instance functor for protocol | Protocols as theories, schemas as instances |
Lens notation
| Symbol | Meaning | Introduced in |
|---|---|---|
| , | The two functions of an asymmetric lens | Bidirectional lenses |
| GetPut, PutGet, PutPut | The lens round-trip laws | Bidirectional lenses |
| A protolens | Protolenses | |
| Source and target of a schema-indexed lens | Protolenses |
Type-theoretic notation
| Symbol | Meaning | Introduced in |
|---|---|---|
a -> b, | A function type | Categories |
., | Function composition (Haskell period, mathematical circle) | Categories |
id :: a -> a | Haskell’s polymorphic identity function | Categories |
{l1 = e1, ...} | Record literal (panproto-expr) | Syntax and semantics |
[e1 | x <- e2] | List comprehension (panproto-expr) | Syntax and semantics |
e ⟶ e' | Small-step reduction | Syntax and semantics |
Typographic conventions
Inline code and type names appear in monospace: SchemaBuilder, panproto-mig, compile. Italicised terms (category, protolens, complement) mark a definition being introduced on first appearance, and bold terms (Associativity, Identity) are concept labels the reader may want to scan for.
Every first mention of an external tool or library is a hyperlink to its canonical home page: Haskell, Rust, tree-sitter. Every first mention of a panproto module, type, or function is a hyperlink to the corresponding docs.rs page. Later mentions in the same chapter appear as plain prose.
Citations follow Pandoc syntax rendered by mdbook-bib in Chicago author-date style. Inline: @key for textual (“Smith 2010”), [@key] for parenthetical (“(Smith 2010)”), [-@key] for year only. Every cited work has a BibTeX entry in book/src/references.bib.
Glossary
Disclaimer. The content of this page is largely LM-generated. It was written as a stopgap to make the panproto system legible while we work through the book verifying and editing the content by hand. When a chapter has been verified or edited by a human, the parts that were verified or edited will be noted at the head of the chapter.
A glossary of defined terms used in more than one chapter, with back-links to the chapter that introduces each.
arrow. Synonym for morphism. Used more often when drawing diagrams. Categories.
asymmetric lens. A lens whose source is larger than its view, with a put function that takes the old source as an argument to preserve data outside the view. Bidirectional lenses.
auto-migration. A migration inferred automatically by panproto-vcs from the diff between two schema versions, without an explicit user-written declaration. Data versioning.
BibTeX. The citation-entry format used by references.bib. Notation reference.
blob. A content-addressed byte sequence. Used in both git’s and panproto-vcs’s object models. What git already versions and what it does not.
category. The mathematical structure at the centre of the book: a class of objects, hom-sets of morphisms between them, a composition operation, and identity morphisms, subject to associativity and identity laws. Categories.
cocone. A natural transformation from a diagram to a constant functor. The universal cocone under is the colimit of . Colimits and pushouts.
colimit. A universal cocone under a diagram. Coproducts, pushouts, and coequalizers are specific cases. Colimits and pushouts.
complement. For a lens, the part of the source that put preserves while modifying the view. Made explicit in Cambria-style lenses. Bidirectional lenses.
composition. The primitive operation of a category: given two morphisms whose ends meet, a third morphism from the start of the first to the end of the second. Categories.
contextual category. The categorical structure of a generalised algebraic theory, equipped with the machinery needed to interpret dependent sorts. Algebraic and generalised algebraic theories.
coproduct. The universal cocone under a two-object discrete diagram. In it is the disjoint union. Universal properties.
diagram. A functor from a small shape category into a target category. Colimits and pushouts.
functor. A structure-preserving mapping between categories, sending objects to objects and morphisms to morphisms while respecting composition and identity. Functors and natural transformations.
GAT. Generalised algebraic theory, in the sense of Cartmell (1986). The expressive theory formalism that panproto uses to specify protocols. Algebraic and generalised algebraic theories.
Hask. The category of Haskell types and functions between them, under the idealisation that every function is total. Categories.
hom-set. The set of morphisms between two fixed objects of a category, written or . Categories.
identity. The morphism required to exist for every object of a category and to act as a two-sided unit for composition. Also the identity function in a programming language, the cat command with no arguments in Unix, and the identity migration in panproto. Categories.
instance. A record set under a panproto schema, representing the data that lives under the schema. Protocols as theories, schemas as instances.
isomorphism. A morphism with a two-sided inverse . Objects with an isomorphism between them are isomorphic, written . Categories.
lens. A pair of functions get and put between two data structures that together behave like a disciplined two-way translation, subject to the round-trip laws. Bidirectional lenses.
Lexicon. The schema language of ATProto, consisting of JSON documents that declare record types and their constraints. ATProto lexicons.
lift. The operation that applies a compiled migration to an instance, producing an instance under the target schema. The restrict/lift pipeline.
migration. A morphism of models in the category for a protocol , packaged with a pushforward choice at each extension site. Theory morphisms and instance migration.
model. A structure-preserving functor from the syntactic contextual category of a theory into a target contextual category. A panproto schema is a model of its protocol’s theory. Algebraic and generalised algebraic theories.
morphism. An arrow in a category. Synonym for arrow when the context is formal. Categories.
natural transformation. A morphism between functors that respects the categorical structure on both sides; a collection of component morphisms indexed by objects of the source category, satisfying the naturality square. Functors and natural transformations.
product. The universal cone over a two-object discrete diagram. In it is the Cartesian product. Universal properties.
protocol. A generalised algebraic theory together with a parser, an emitter, and a registered entry in the protocols registry. Defining a protocol.
protolens. A schema-indexed family of lenses, of the form . Protolenses.
pullback functor. The functor induced by a theory morphism , obtained by reading a model through . Theory morphisms and instance migration.
pushforward functor. One of the two adjoints of the pullback, (left) or (right), going from to . Theory morphisms and instance migration.
pushout. The colimit of a span, the universal cocone under a three-object diagram with two arrows from a common source. Colimits and pushouts.
schema. A panproto schema, in the sense of this book: a model of a registered protocol’s generalised algebraic theory. Includes protocol schemas (ATProto, Avro, and so on), relational schemas, and programming-language grammars derived from tree-sitter. Protocols as theories, schemas as instances.
serialize / deserialize. The operations that turn a Rust value into bytes and back. In panproto, handled through serde.
symmetric lens. A lens between two structures neither of which is smaller than the other, with a shared complement between them. Bidirectional lenses.
theory morphism. A structure-preserving mapping between generalised algebraic theories, equivalent to a functor between the contextual categories they generate. Theory morphisms and instance migration.
tree-sitter. The incremental parsing library used by panproto to auto-derive protocols for programming languages. Tree-sitter and full-AST parsing.
WASM boundary. The opaque-handle, MessagePack-serialised boundary between panproto’s Rust core and its non-Rust clients. The WebAssembly boundary.
Open problems
Disclaimer. The content of this page is largely LM-generated. It was written as a stopgap to make the panproto system legible while we work through the book verifying and editing the content by hand. When a chapter has been verified or edited by a human, the parts that were verified or edited will be noted at the head of the chapter.
A list of questions where panproto’s implementation and the book’s mathematical account are not yet in full agreement, or where we do not know of prior work on a topic panproto addresses. Each entry names the question, the current state, and what would count as progress.
Where we don’t know of prior work
Protolenses as a first-class abstraction
Panproto’s protolens construction, developed in Protolenses, is a schema-indexed lens family that admits auto-generation, symbolic simplification, and an explicit cost model. We are not aware of a published treatment of this exact construction. The closest analogues (profunctor optics, indexed lenses in the sense of Pacheco, Cunha, and Hu) differ in the indexing structure.
Progress would be a writeup placing protolenses in the optics literature, either as a known construction under another name or as a genuine novelty. A contribution from a reader familiar with recent optics work would be welcome.
The instance functor’s symbolic simplifier
panproto_lens::symbolic performs term-rewriting simplification of lens specifications according to an identity base we have extended incrementally as cases arose. The set of rewrite rules currently in the simplifier is finite, terminating, and confluent on the cases the test suite covers. We do not have a proof that it is confluent on all well-formed inputs, and we do not have a characterisation of the lens specifications that reach a normal form consisting of the identity lens.
Progress would be either a proof of confluence on the full specification language or a counterexample that would motivate a reformulation.
Cross-protocol symmetric lenses at scale
Panproto’s symmetric lenses (Bidirectional lenses) handle cross-protocol translation for the cases the test suite covers. The underlying mathematical framework is that of Hofmann, Pierce, and Wagner, extended to the schema-dependent setting in the obvious way. We do not know of a published treatment of the extension, and the specific lens compositions panproto performs (a chain of symmetric lenses that collectively translate between three or more protocols) would benefit from a more thorough theoretical development.
Where the theory is ahead of the software
The Hofmann-Pierce-Wagner laws on arbitrary symmetric lenses
The symmetric-lens laws of Hofmann, Pierce, and Wagner are known and well-characterised. Panproto verifies them on the cases it handles, through the same property-based approach as for asymmetric lenses. What the panproto implementation does not yet do is symbolically verify the laws on user-constructed symmetric lens specifications, in the way that panproto_lens::symbolic verifies the asymmetric-lens laws on a restricted fragment. The mathematical tools exist; the implementation does not.
Functorial data migration with equations
Spivak’s functorial data migration framework, which Theory morphisms and instance migration cites, is presented primarily in the setting of Lawvere theories. Panproto generalises to GATs, which add dependent sorts. The two pushforward functors and behave as expected on the dependent-sort-free fragment. On the full GAT fragment, the behaviour of in particular becomes more subtle; the current implementation handles the cases we have encountered without a full-generality treatment.
A full generalisation of Spivak’s framework to GATs, possibly drawing on Fiore-Plotkin-Turi’s algebraic theories with binding, would close the gap. We suspect this is a tractable research question and would welcome a contribution.
Where prior work may exist and we do not know it
Schema-version merge as pushout
Panproto’s three-way merge (Merge as pushout) is the pushout of the span of schema morphisms from the most-recent-common-ancestor. The construction is categorically standard, and its application to version control is not original to panproto: the patch-theory tradition of Pijul and Darcs, and the category-theoretic merge work of Mimram, Di Giusto, and others, have treated similar constructions in adjacent settings.
What we do not know is whether the specific combination panproto assembles (pushout in the category of schemas under a protocol, with the pushout’s instances lifted through a computed migration) appears in the published literature under a different name. A pointer from a reader familiar with the version-control-as-category tradition would save us the alternative, which is to write up the construction ourselves and invite critique.
Tree-sitter grammars as GATs
The identification of a tree-sitter grammar’s node-types.json metadata with a GAT is, as far as we know, specific to panproto. The mathematical content is modest (the identification is almost mechanical), but the consequence (auto-derivation of protocols for 248 languages) is practically significant. We would welcome a comparison with any prior work on grammar-as-theory that treats the concrete-syntax-tree case rather than the abstract-syntax-tree case.
Operational open problems
These are operational limitations rather than theoretical gaps, and we hope to close them with engineering work.
Stored-config remote management
Panproto-vcs currently supports programmatic remote operations (push, fetch, clone through the git bridge) but does not yet store remote configuration in the repository itself the way git stores .git/config. A user who clones a panproto-vcs repository through the git bridge and wants to push back to a related remote must re-specify the remote URL on every push. The fix is mechanical but not yet implemented.
Streaming incremental merge
panproto_vcs::merge operates in batch mode: the merge algorithm reads both branches’ schemas in full before producing the pushout. For repositories with very large schemas (hundreds of thousands of sorts), this can exhaust memory. A streaming merge that produces pushout components incrementally is possible in principle and would resolve the memory issue.
Interactive conflict resolution UI
The colimit-based merge algorithm detects conflicts as failures of the pushout’s universal property. It reports them as structured diagnostics. It does not yet include an interactive resolution UI that would let a developer accept, reject, or modify each conflict individually. The diagnostics contain the information a UI would need; the UI itself is not written.
Full LLVM integration
The experimental panproto-llvm crate has the acceptance criteria for promotion named in Experimental and feature-gated subsystems. The work is in scaling the theory’s existence-checker to the full LLVM IR specification rather than the well-typed subset.
How to contribute
An open problem listed above is an invitation to contribute, and panproto accepts contributions on any of them. The Extending panproto chapter describes the mechanics; the panproto-contributing skill gives the development workflow. For research-flavoured contributions (a proof, a counterexample, a pointer to prior work), the main issue tracker is the right venue.