Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Theory morphisms and instance migration

Disclaimer. The content of this page is largely LM-generated. It was written as a stopgap to make the panproto system legible while we work through the book verifying and editing the content by hand. When a chapter has been verified or edited by a human, the parts that were verified or edited will be noted at the head of the chapter.

Every change of schema in a working system is a migration waiting to happen. Add a field and somebody has to decide what to do for the records that did not have it; rename a field and somebody has to decide how to reconcile the old name with the new; merge two schemas and somebody has to decide what the shared structure means. Doing this by hand, as most teams still do, is how production incidents begin.

The central claim of this chapter, due in its categorical form to Spivak (2012), is that every such change of schema is the pullback of a theory morphism — plus, when the change extends rather than restricts, a choice between two universal strategies for filling in what the source did not supply. The chapter unpacks the claim. Panproto’s panproto-mig crate is the implementation of what the claim prescribes, and the remainder of Part II shows what the implementation looks like stage by stage.

Theory morphisms

A theory morphism from a GAT to a GAT is a translation of the first theory’s vocabulary into the second’s that respects structure on both sides. Concretely, a theory morphism assigns each sort of to a sort of ; each operation of to an operation of with matching arity after translation (the argument sorts and result sort of the -operation, translated through the sort assignment, must match the signature of the -operation chosen for it); and each equation of to a consequence of ’s equations under the translation.

Equivalently, and perhaps more pleasantly, a theory morphism is a structure-preserving functor between the contextual categories the GATs generate. The functor laws reappear as the three conditions above, lifted to the dependent-sort setting; the contextual structure — how sorts and operations depend on free variables — must match across the translation as well.

Panproto represents a theory morphism by a Morphism value in panproto-gat. The type-checker verifies each of the three conditions. Every source sort’s image must exist in the target theory with the right dependencies; every source operation’s translated body must type-check in the target’s context; every source equation must be derivable from the target’s equations under the translation. The last of these is the deepest check and is the one that most often rejects a proposed morphism that the two theories do not actually support.

The full statement of the third condition is derivability: the translated equation must follow from the target’s equational theory, not merely appear verbatim in its list of axioms. The implementation presently enforces a conservative approximation: for every source equation lhs = rhs, the translated pair F(lhs) = F(rhs) is matched alpha-equivalently against the target’s declared equations. A morphism whose image is derivable via directed rewrites or via a chain of other equations, but is not already listed literally, is rejected under the current check. This is sound but incomplete; complete preservation via normalization or congruence closure against the target’s full equational theory is a queued follow-up, tracked against check_morphism. Readers designing morphisms between theories whose axioms are equivalent-but-not-literal should be aware of the stricter spelling the checker currently accepts.

Two shapes that recur

Two kinds of theory morphism come up often enough to be worth naming. An inclusion expresses that extends with new sorts, operations, or equations, with every symbol of mapping to itself in . Adding a new field to a record schema, adding a new table to a relational schema, tightening a constraint on an existing field — all of these are inclusions. A quotient expresses that identifies some symbols of or imposes new equations on them; each symbol of maps to its equivalence class under the new identifications. Renaming two fields to the same name, or adding an equation that forces two operations to agree, are quotients.

Most real migrations are neither pure inclusions nor pure quotients but combinations: a morphism that adds a new field (an inclusion) while renaming an old one (a quotient), for instance. Panproto’s migration engine decomposes each migration into its inclusion and quotient components internally for the existence checker’s benefit, but the developer writes one Morphism value covering both.

The three migration functors

A theory morphism does not just translate symbols; it induces three functors between the categories of models, each with a distinct operational meaning.

The three sit in an adjoint relationship:

In words: goes from -models to -models; its two adjoints and go the other way; and each adjoint is pinned down up to unique isomorphism by the universal property of being left or right adjoint to . The three functors take distinct operational shapes at the data level, and we take them one at a time.

The pullback functor

The pullback is the simplest of the three, and the one panproto uses most. Given a -model , the pullback is the -model obtained by reading through : a sort of is interpreted as ’s interpretation of , and an operation of is interpreted as ’s interpretation of its image. Concretely, if sends the sort of to the sort of , then ’s interpretation of is ’s interpretation of . No data is created, no data is thrown away; the pullback is a relabelling.

Functoriality is immediate: a morphism in induces a morphism with the same underlying assignment, read through . The pullback functor is implemented in panproto_mig::lift and is the cheapest of the three at runtime — no data-level computation beyond relabelling. When a migration reduces a richer schema to a smaller one (a forgetful migration), is the functor doing the work, and the next chapter calls this operation the restrict half of its pipeline.

The pushforward functors and

The two pushforwards go the other way: both are functors . They differ in how they handle the new structure demands but -models cannot supply.

, the left adjoint, is obtained by freely adding whatever new structure the target theory asks for. If extends with a new operation, has the new operation interpreted as a free choice at every element, with every possible value admitted; if extends with a new sort, interprets that sort as the set of all possible values. Formally, is the smallest -model from which can be recovered by pullback.

, the right adjoint, is obtained by universal selection rather than free expansion. Given a -model , is a -model whose elements are precisely those elements of that admit a unique extension compatible with the target theory. Where is maximally permissive, is maximally restrictive: it includes only what is forced.

A reader may ask why two pushforwards are needed. The answer lies in the adjointness. answers the question what is the smallest -model that recovers by pullback? and answers what is the largest -model whose pullback equals ? The two coincide only in trivial cases, and diverge whenever ’s new structure admits ambiguity. In practical terms, corresponds to “fill new fields with defaults” and to “only keep rows whose new fields are fully determined”. Having both available is not a luxury: different migrations want different strategies at different sites, and real schema evolution routinely needs both.

This triple of functors is the framework of functorial data migration, developed in the relational setting by Spivak (2012), refined in Spivak & Wisnesky (2015), and worked out in executable form as the CQL system of Wisnesky (2013). Panproto adopts it essentially unchanged, with one generalisation: Spivak’s original work is stated for Lawvere theories, and panproto extends the same three functors to GATs. The extension is mathematically straightforward, because contextual categories admit the same adjoint structure as categories with finite products, but it is what lets panproto handle schema languages with dependent structure that Lawvere theories cannot express directly.

A worked example

The three functors are easier to read in an example than in the abstract. Take the running case from Part I.

Let be the theory of a one-field record: one sort , one operation , no equations. Let extend with a second operation . The theory morphism sends to and to and declares to be new in .

Start with a specific -model : a three-person address book with names and emails,

The pullback is the same three people with only their names:

The email column has been forgotten, because has no operation for it. That is what a pullback does.

Now go the other way. Start from a -model — the three-name address book without emails — and ask what a -model compatible with it should look like.

, the left adjoint, is the smallest such -model. Because has a new operation that knows nothing about, must supply some email for each person, and it does so freely: every possible email assignment is admitted. The population of contains one entry for every pair (person of , possible email string), which is a very large set.

This is almost never what a developer actually wants, and it is worth understanding why the mathematics gives us an answer a developer would reject in practice. The mathematics wants the universal answer, and the universal answer is to admit every possibility, because any commitment to a specific email would impose structure the source model does not justify. Panproto’s migration DSL therefore accepts a restricted form of : the developer supplies a rule that picks a single email for each person — a default, a computed value, an empty string — and the engine compiles the rule into a -style pushforward that uses the rule at every extension site. The underlying category-theoretic construction is ; the practical construction panproto calls “fill with default” is a restriction of it to a specific choice.

, the right adjoint, is the largest -model whose pullback equals . For the schema as stated, it is empty: no person can be extended to a -record unambiguously, because every email value is allowed. becomes operationally useful when the target theory carries constraints that eliminate ambiguity. If imposes an equation saying every unset email is the value "unknown", then extends by inserting the default; if requires email values to match a pattern that determines them from the name, drops every person whose name does not already force a unique email.

In panproto’s vocabulary, the pullback is the restrict of The restrict/lift pipeline, and the combination of and at various sites of a migration is the lift. The engine does not supply naïve and ; it supplies the restricted forms the developer asks for, under the declarations the migration DSL expresses.

Panproto’s packaging

A panproto migration is more than a theory morphism. It is a theory morphism together with a declaration of what to do at each extension site — which strategy among and , with what specific rule — expressed in the migration DSL.

The Rust representation is a Migration value from panproto-mig. Construction goes through the migration DSL, whose surface syntax belongs to Part III. Compilation — the translation from the symbolic declaration to a runtime lift function — goes through panproto_mig::compile. Execution is panproto_mig::lift, which applies a compiled migration to a specific source instance.

When the user does not already hold a migration and wants the engine to propose one, the panproto-mig::align module seeds the candidate pool with anchor correspondences from a family of pluggable strategies. The 0.37 release broadens the family. edge_label_anchors compares the label multisets of incident edges, which catches two vertices that share the same outgoing edge names even when their own names diverge. suffix_anchors keys on the terminal dotted segment of a namespaced identifier, which lines up app.bsky.feed.post#author with com.example.post#author without a hint. description_anchors runs token similarity over the human-readable description metadata, useful when the two schemas have synonymous names. neighborhood_anchors propagates anchors outward from an already-matched seed, so a single high-confidence correspondence spreads through adjacency. wl_anchors runs Weisfeiler-Leman structural refinement and matches vertices whose WL colors agree at a fixed iteration. The existing preservation-note still applies: the engine validates every anchor against the kinds and constraints of both sides before it enters the candidate pool, and an anchor that violates the target theory’s equations is rejected before compilation begins.

Composition of migrations lives in panproto_mig::compose. The functor axioms of Functors and natural transformations reappear here as the crate’s compilation invariants: compiling the composite of two migrations produces the same runtime function as composing the compiled migrations separately; the identity migration on a schema compiles to the identity function on its instances. Both invariants are load-bearing. A migration engine that fails them is producing different answers depending on how it chose to evaluate a migration chain, which is the worst kind of bug — present intermittently, hard to reproduce, impossible to diagnose without understanding the mathematics the engine is supposed to be implementing.

Further reading

The foundational source for the triple in the database setting is Spivak (2012). Spivak & Wisnesky (2015) refines the construction for the relational case; Wisnesky (2013) is the thesis that works out the implementation in CQL, which is the closest existing system to panproto’s migration engine.

For the broader setting of adjoint functors, Mac Lane (1998) chapter IV is canonical, Awodey (2010) chapter 9 (“Adjoints”) gives the material at undergraduate register, and Riehl (2016) chapter 4 (“Adjunctions”) is the modern treatment.

For the relational-database antecedent, Codd (1970) is the founding paper, and Kleppmann (2017) chapter 2 (“Data Models and Query Languages”) gives the working-developer’s view of the same ideas without category theory. Reading the two together is a useful exercise in recognising how a single idea can be stated at very different levels of abstraction.

Closing

The next chapter, The restrict/lift pipeline, takes a Migration value apart into its compilation stages: existence checking, restrict, lift, compose, invert. Each stage performs one operation on the data assembled above, and the decomposition is what lets the engine diagnose failures at the earliest point a migration can be seen to go wrong.