Theory morphisms and instance migration

Disclaimer. The content of this page is largely LM-generated. It was written as a stopgap to make the panproto system legible while we work through the book verifying and editing the content by hand. When a chapter has been verified or edited by a human, the parts that were verified or edited will be noted at the head of the chapter.

Every change of schema in a working system is a migration waiting to happen. Add a field and somebody has to decide what to do for the records that did not have it; rename a field and somebody has to decide how to reconcile the old name with the new; merge two schemas and somebody has to decide what the shared structure means. Doing this by hand, as most teams still do, is how production incidents begin.

The central claim of this chapter, due in its categorical form to Spivak (2012), is that every such change of schema is the pullback of a theory morphism — plus, when the change extends rather than restricts, a choice between two universal strategies for filling in what the source did not supply. The chapter unpacks the claim. Panproto’s panproto-mig crate is the implementation of what the claim prescribes, and the remainder of Part II shows what the implementation looks like stage by stage.

Theory morphisms

A theory morphism from a GAT $T_{1}$ to a GAT $T_{2}$ is a translation of the first theory’s vocabulary into the second’s that respects structure on both sides. Concretely, a theory morphism assigns each sort of $T_{1}$ to a sort of $T_{2}$ ; each operation of $T_{1}$ to an operation of $T_{2}$ with matching arity after translation (the argument sorts and result sort of the $T_{1}$ -operation, translated through the sort assignment, must match the signature of the $T_{2}$ -operation chosen for it); and each equation of $T_{1}$ to a consequence of $T_{2}$ ’s equations under the translation.

Equivalently, and perhaps more pleasantly, a theory morphism is a structure-preserving functor $Th (T_{1}) \to Th (T_{2})$ between the contextual categories the GATs generate. The functor laws reappear as the three conditions above, lifted to the dependent-sort setting; the contextual structure — how sorts and operations depend on free variables — must match across the translation as well.

Panproto represents a theory morphism by a Morphism value in panproto-gat. The type-checker verifies each of the three conditions. Every source sort’s image must exist in the target theory with the right dependencies; every source operation’s translated body must type-check in the target’s context; every source equation must be derivable from the target’s equations under the translation. The last of these is the deepest check and is the one that most often rejects a proposed morphism that the two theories do not actually support.

The full statement of the third condition is derivability: the translated equation must follow from the target’s equational theory, not merely appear verbatim in its list of axioms. The implementation presently enforces a conservative approximation: for every source equation lhs = rhs, the translated pair F(lhs) = F(rhs) is matched alpha-equivalently against the target’s declared equations. A morphism whose image is derivable via directed rewrites or via a chain of other equations, but is not already listed literally, is rejected under the current check. This is sound but incomplete; complete preservation via normalization or congruence closure against the target’s full equational theory is a queued follow-up, tracked against check_morphism. Readers designing morphisms between theories whose axioms are equivalent-but-not-literal should be aware of the stricter spelling the checker currently accepts.

Two shapes that recur

Two kinds of theory morphism come up often enough to be worth naming. An inclusion $T_{1} ↪ T_{2}$ expresses that $T_{2}$ extends $T_{1}$ with new sorts, operations, or equations, with every symbol of $T_{1}$ mapping to itself in $T_{2}$ . Adding a new field to a record schema, adding a new table to a relational schema, tightening a constraint on an existing field — all of these are inclusions. A quotient $T_{1} ↠ T_{2}$ expresses that $T_{2}$ identifies some symbols of $T_{1}$ or imposes new equations on them; each symbol of $T_{1}$ maps to its equivalence class under the new identifications. Renaming two fields to the same name, or adding an equation that forces two operations to agree, are quotients.

Most real migrations are neither pure inclusions nor pure quotients but combinations: a morphism that adds a new field (an inclusion) while renaming an old one (a quotient), for instance. Panproto’s migration engine decomposes each migration into its inclusion and quotient components internally for the existence checker’s benefit, but the developer writes one Morphism value covering both.

The three migration functors

A theory morphism $f : T_{1} \to T_{2}$ does not just translate symbols; it induces three functors between the categories of models, each with a distinct operational meaning.

The three sit in an adjoint relationship:

$Δ_{f} : Mod (T_{2}) \to Mod (T_{1}), Σ_{f} ⊣ Δ_{f} ⊣ Π_{f} .$

In words: $Δ_{f}$ goes from $T_{2}$ -models to $T_{1}$ -models; its two adjoints $Σ_{f}$ and $Π_{f}$ go the other way; and each adjoint is pinned down up to unique isomorphism by the universal property of being left or right adjoint to $Δ_{f}$ . The three functors take distinct operational shapes at the data level, and we take them one at a time.

The pullback functor $Δ_{f}$

The pullback $Δ_{f}$ is the simplest of the three, and the one panproto uses most. Given a $T_{2}$ -model $M$ , the pullback $Δ_{f} M$ is the $T_{1}$ -model obtained by reading $M$ through $f$ : a sort $s$ of $T_{1}$ is interpreted as $M$ ’s interpretation of $f (s)$ , and an operation of $T_{1}$ is interpreted as $M$ ’s interpretation of its image. Concretely, if $f$ sends the sort $Person$ of $T_{1}$ to the sort $Contact$ of $T_{2}$ , then $Δ_{f} M$ ’s interpretation of $Person$ is $M$ ’s interpretation of $Contact$ . No data is created, no data is thrown away; the pullback is a relabelling.

Functoriality is immediate: a morphism $α : M \to M^{'}$ in $Mod (T_{2})$ induces a morphism $Δ_{f} α : Δ_{f} M \to Δ_{f} M^{'}$ with the same underlying assignment, read through $f$ . The pullback functor is implemented in panproto_mig::lift and is the cheapest of the three at runtime — no data-level computation beyond relabelling. When a migration reduces a richer schema to a smaller one (a forgetful migration), $Δ_{f}$ is the functor doing the work, and the next chapter calls this operation the restrict half of its pipeline.

The pushforward functors $Σ_{f}$ and $Π_{f}$

The two pushforwards go the other way: both are functors $Mod (T_{1}) \to Mod (T_{2})$ . They differ in how they handle the new structure $T_{2}$ demands but $T_{1}$ -models cannot supply.

$Σ_{f}$ , the left adjoint, is obtained by freely adding whatever new structure the target theory asks for. If $T_{2}$ extends $T_{1}$ with a new operation, $Σ_{f} M$ has the new operation interpreted as a free choice at every element, with every possible value admitted; if $T_{2}$ extends $T_{1}$ with a new sort, $Σ_{f} M$ interprets that sort as the set of all possible values. Formally, $Σ_{f} M$ is the smallest $T_{2}$ -model from which $M$ can be recovered by pullback.

$Π_{f}$ , the right adjoint, is obtained by universal selection rather than free expansion. Given a $T_{1}$ -model $M$ , $Π_{f} M$ is a $T_{2}$ -model whose elements are precisely those elements of $M$ that admit a unique extension compatible with the target theory. Where $Σ_{f}$ is maximally permissive, $Π_{f}$ is maximally restrictive: it includes only what is forced.

A reader may ask why two pushforwards are needed. The answer lies in the adjointness. $Σ_{f}$ answers the question what is the smallest $T_{2}$ -model that recovers $M$ by pullback? and $Π_{f}$ answers what is the largest $T_{2}$ -model whose pullback equals $M$ ? The two coincide only in trivial cases, and diverge whenever $T_{2}$ ’s new structure admits ambiguity. In practical terms, $Σ_{f}$ corresponds to “fill new fields with defaults” and $Π_{f}$ to “only keep rows whose new fields are fully determined”. Having both available is not a luxury: different migrations want different strategies at different sites, and real schema evolution routinely needs both.

This triple of functors is the framework of functorial data migration, developed in the relational setting by Spivak (2012), refined in Spivak & Wisnesky (2015), and worked out in executable form as the CQL system of Wisnesky (2013). Panproto adopts it essentially unchanged, with one generalisation: Spivak’s original work is stated for Lawvere theories, and panproto extends the same three functors to GATs. The extension is mathematically straightforward, because contextual categories admit the same adjoint structure as categories with finite products, but it is what lets panproto handle schema languages with dependent structure that Lawvere theories cannot express directly.

A worked example

The three functors are easier to read in an example than in the abstract. Take the running case from Part I.

Let $T_{1}$ be the theory of a one-field record: one sort $Person$ , one operation $name : Person \to String$ , no equations. Let $T_{2}$ extend $T_{1}$ with a second operation $email : Person \to String$ . The theory morphism $f : T_{1} ↪ T_{2}$ sends $Person$ to $Person$ and $name$ to $name$ and declares $email$ to be new in $T_{2}$ .

Start with a specific $T_{2}$ -model $M$ : a three-person address book with names and emails,

${alice \mapsto ("Alice", "a@ex"), bob \mapsto ("Bob", "b@ex"), carol \mapsto ("Carol", "c@ex")} .$

The pullback $Δ_{f} M$ is the same three people with only their names:

${alice \mapsto "Alice", bob \mapsto "Bob", carol \mapsto "Carol"} .$

The email column has been forgotten, because $T_{1}$ has no operation for it. That is what a pullback does.

Now go the other way. Start from a $T_{1}$ -model $M$ — the three-name address book without emails — and ask what a $T_{2}$ -model compatible with it should look like.

$Σ_{f} M$ , the left adjoint, is the smallest such $T_{2}$ -model. Because $T_{2}$ has a new operation $email$ that $M$ knows nothing about, $Σ_{f} M$ must supply some email for each person, and it does so freely: every possible email assignment is admitted. The population of $Σ_{f} M$ contains one entry for every pair (person of $M$ , possible email string), which is a very large set.

This is almost never what a developer actually wants, and it is worth understanding why the mathematics gives us an answer a developer would reject in practice. The mathematics wants the universal answer, and the universal answer is to admit every possibility, because any commitment to a specific email would impose structure the source model does not justify. Panproto’s migration DSL therefore accepts a restricted form of $Σ_{f}$ : the developer supplies a rule that picks a single email for each person — a default, a computed value, an empty string — and the engine compiles the rule into a $Σ_{f}$ -style pushforward that uses the rule at every extension site. The underlying category-theoretic construction is $Σ_{f}$ ; the practical construction panproto calls “fill with default” is a restriction of it to a specific choice.

$Π_{f} M$ , the right adjoint, is the largest $T_{2}$ -model whose pullback equals $M$ . For the schema as stated, it is empty: no person can be extended to a $T_{2}$ -record unambiguously, because every email value is allowed. $Π_{f}$ becomes operationally useful when the target theory carries constraints that eliminate ambiguity. If $T_{2}$ imposes an equation saying every unset email is the value "unknown", then $Π_{f} M$ extends $M$ by inserting the default; if $T_{2}$ requires email values to match a pattern that determines them from the name, $Π_{f} M$ drops every person whose name does not already force a unique email.

In panproto’s vocabulary, the pullback $Δ_{f}$ is the restrict of The restrict/lift pipeline, and the combination of $Σ_{f}$ and $Π_{f}$ at various sites of a migration is the lift. The engine does not supply naïve $Σ_{f}$ and $Π_{f}$ ; it supplies the restricted forms the developer asks for, under the declarations the migration DSL expresses.

Panproto’s packaging

A panproto migration is more than a theory morphism. It is a theory morphism together with a declaration of what to do at each extension site — which strategy among $Σ_{f}$ and $Π_{f}$ , with what specific rule — expressed in the migration DSL.

The Rust representation is a Migration value from panproto-mig. Construction goes through the migration DSL, whose surface syntax belongs to Part III. Compilation — the translation from the symbolic declaration to a runtime lift function — goes through panproto_mig::compile. Execution is panproto_mig::lift, which applies a compiled migration to a specific source instance.

When the user does not already hold a migration and wants the engine to propose one, the panproto-mig::align module seeds the candidate pool with anchor correspondences from a family of pluggable strategies. The 0.37 release broadens the family. edge_label_anchors compares the label multisets of incident edges, which catches two vertices that share the same outgoing edge names even when their own names diverge. suffix_anchors keys on the terminal dotted segment of a namespaced identifier, which lines up app.bsky.feed.post#author with com.example.post#author without a hint. description_anchors runs token similarity over the human-readable description metadata, useful when the two schemas have synonymous names. neighborhood_anchors propagates anchors outward from an already-matched seed, so a single high-confidence correspondence spreads through adjacency. wl_anchors runs Weisfeiler-Leman structural refinement and matches vertices whose WL colors agree at a fixed iteration. The existing preservation-note still applies: the engine validates every anchor against the kinds and constraints of both sides before it enters the candidate pool, and an anchor that violates the target theory’s equations is rejected before compilation begins.

Composition of migrations lives in panproto_mig::compose. The functor axioms of Functors and natural transformations reappear here as the crate’s compilation invariants: compiling the composite of two migrations produces the same runtime function as composing the compiled migrations separately; the identity migration on a schema compiles to the identity function on its instances. Both invariants are load-bearing. A migration engine that fails them is producing different answers depending on how it chose to evaluate a migration chain, which is the worst kind of bug — present intermittently, hard to reproduce, impossible to diagnose without understanding the mathematics the engine is supposed to be implementing.

Closing

The next chapter, The restrict/lift pipeline, takes a Migration value apart into its compilation stages: existence checking, restrict, lift, compose, invert. Each stage performs one operation on the data assembled above, and the decomposition is what lets the engine diagnose failures at the earliest point a migration can be seen to go wrong.

Keyboard shortcuts

panproto