28 Formal Foundations

This appendix collects the mathematical foundations underlying panproto: category theory, algebraic theories, functorial data migration, lenses, graph transformation, W-types, and correctness proofs. When reading the main tutorial chapters, you can consult individual sections here whenever you want the precise formalism behind a concept introduced informally.

We assume basic set theory and abstract algebra (groups, rings). Category theory is developed from scratch, though it rewards prior exposure. For each major concept, we note where it appears in panproto’s code and which tutorial chapter relies on it.

28.1 Categories, functors, and natural transformations

28.1.1 Categories

A category \(\mathcal{C}\) consists of four components:

A collection \(\mathrm{Ob}(\mathcal{C})\) of objects.
For each pair of objects \(A, B\), a set \(\mathrm{Hom}_{\mathcal{C}}(A, B)\) of morphisms (arrows) from \(A\) to \(B\).
For each object \(A\), an identity morphism \(\mathrm{id}_A \in \mathrm{Hom}(A, A)\).
A composition operation: given \(f: A \to B\) and \(g: B \to C\), we form \(g \circ f: A \to C\).

These satisfy two laws: associativity \((h \circ g) \circ f = h \circ (g \circ f)\) and unit laws \(f \circ \mathrm{id}_A = f\) and \(\mathrm{id}_B \circ f = f\).

In panproto. The category \(\mathbf{GAT}\) has generalized algebraic theories as objects and theory morphisms as arrows (Section 28.3). The category \(\mathbf{Mod}(T)\) has models of a theory \(T\) as objects and model homomorphisms as arrows. Schema mappings become morphisms in \(\mathbf{Mod}(T)\), which is where data migration begins (Chapter 5). Standard references: (Mac Lane 1998, Ch. 1).

28.1.2 Functors

A functor \(F: \mathcal{C} \to \mathcal{D}\) maps categories structurally. It assigns each object \(A \in \mathcal{C}\) to an object \(F(A) \in \mathcal{D}\) and each morphism \(f: A \to B\) to a morphism \(F(f): F(A) \to F(B)\), preserving composition and identities:

\[ F(g \circ f) = F(g) \circ F(f), \qquad F(\mathrm{id}_A) = \mathrm{id}_{F(A)}. \]

In panproto. A theory morphism \(F: T_1 \to T_2\) induces a pullback functor \(F^*: \mathbf{Mod}(T_2) \to \mathbf{Mod}(T_1)\), which becomes the restrict operation in panproto. This reverses the direction of the theory morphism and is the core of data migration (Chapter 9 and Section 9.4). Reference: (Mac Lane 1998, Ch. 1).

28.1.3 Natural transformations

When we have two functors \(F, G: \mathcal{C} \to \mathcal{D}\), a natural transformation \(\eta: F \Rightarrow G\) is a family of morphisms \(\eta_A: F(A) \to G(A)\) (one per object \(A \in \mathcal{C}\)) such that every morphism \(f: A \to B\) in \(\mathcal{C}\) makes this diagram commute:

\[ \begin{array}{ccc} F(A) & \xrightarrow{\eta_A} & G(A) \\ \downarrow{F(f)} & & \downarrow{G(f)} \\ F(B) & \xrightarrow{\eta_B} & G(B) \end{array} \]

That is, \(G(f) \circ \eta_A = \eta_B \circ F(f)\).

In panproto. Natural transformations between migration functors capture systematic ways to transform one migration strategy into another. The naturality condition ensures different migration paths yield compatible results—the mathematical content of the lens round-trip laws (Chapter 7 and Section 16.5). Reference: (Mac Lane 1998, Ch. 2).

28.1.4 Limits and colimits

A limit of a diagram \(D: \mathcal{J} \to \mathcal{C}\) is a universal cone over it: an object \(L\) with morphisms to each object in the diagram, satisfying a universal property. A colimit is the dual—a universal cocone.

Key special cases:

Products (limits of discrete diagrams): the universal object with projections to \(A\) and \(B\).
Equalizers (limits of parallel pairs): the universal subobject where two morphisms agree.
Pullbacks (limits of cospans \(A \to C \leftarrow B\)): the universal object mapping to both \(A\) and \(B\) over \(C\).
Coproducts (colimits of discrete diagrams): the disjoint union, universally.
Coequalizers (colimits of parallel pairs): the universal quotient identifying images of two morphisms.
Pushouts (colimits of spans \(A \leftarrow S \rightarrow B\)): the universal gluing of \(A\) and \(B\) along \(S\).

In panproto. New protocols are built via colimits of theories. The pushout of a span \(T_1 \leftarrow T_{\mathrm{shared}} \rightarrow T_2\) merges two theories along a shared base—panproto’s main mechanism for composing component theories into protocols (Chapter 4 and Chapter 9). References: (Mac Lane 1998, Ch. 5); (Adámek and Rosický 1994, Ch. 11).

28.1.5 Adjunctions

An adjunction \(F \dashv G\) between categories \(\mathcal{C}\) and \(\mathcal{D}\) consists of functors \(F: \mathcal{C} \to \mathcal{D}\) and \(G: \mathcal{D} \to \mathcal{C}\) with a natural isomorphism:

\[ \mathrm{Hom}_{\mathcal{D}}(F(A), B) \cong \mathrm{Hom}_{\mathcal{C}}(A, G(B)) \]

The left adjoint \(F\) freely constructs; the right adjoint \(G\) forgets structure.

In panproto. The migration functors form an adjoint triple \(\Sigma_F \dashv \Delta_F \dashv \Pi_F\). The restrict operation \(\Delta_F\) (pullback) sits in the middle, flanked by its left adjoint \(\Sigma_F\) (left push) and right adjoint \(\Pi_F\) (right push). These adjunctions guarantee the round-trip properties of lenses and well-behaved data migration (Section 28.4). References: (Spivak 2012); (Mac Lane 1998, Ch. 4).

28.1.6 Kan extensions

The left Kan extension of a functor \(F: \mathcal{C} \to \mathcal{E}\) along a functor \(K: \mathcal{C} \to \mathcal{D}\) is a functor \(\mathrm{Lan}_K F: \mathcal{D} \to \mathcal{E}\) together with a natural transformation \(\eta: F \Rightarrow (\mathrm{Lan}_K F) \circ K\) that is universal among such pairs. The right Kan extension \(\mathrm{Ran}_K F\) is dual. They generalize limits and colimits:

\[ \Sigma_F = \mathrm{Lan}_F, \qquad \Pi_F = \mathrm{Ran}_F. \]

In panproto. The left and right pushforward operations are computed as left and right Kan extensions along the theory morphism \(F\) (Section 28.4). References: (Mac Lane 1998, Ch. 10); (Spivak 2012).

28.1.7 Fibrations

A functor \(p: \mathcal{E} \to \mathcal{B}\) is a fibration (Grothendieck fibration) when: for every morphism \(f: B' \to B\) in \(\mathcal{B}\) and every object \(E\) in the fiber \(p^{-1}(B)\), there exists a cartesian lift—a morphism \(\bar{f}: E' \to E\) in \(\mathcal{E}\) with \(p(\bar{f}) = f\), universal among lifts.

In panproto. The projection from the total category of all schemas (across all protocols) down to the category of theories is a fibration. Lifting along it is what allows panproto to migrate schemas when the underlying protocol theory changes. The lens structure itself is induced by the cleavage of this fibration (Section 28.5). References: (Barr and Wells 1990, Ch. 12); (Johnson and Rosebrugh 2013).

28.2 Algebraic theories and Lawvere’s framework

Lawvere (1963) introduced a revolutionary idea: algebraic theories—like monoids, groups, rings—can be understood as categories with finite products, and models are product-preserving functors into sets.

28.2.1 Lawvere theories

A Lawvere theory is a category \(\mathcal{L}\) with a distinguished object \(1\) such that every object is isomorphic to a finite power \(1^n\) for some \(n \in \mathbb{N}\). A model of \(\mathcal{L}\) in the category of sets is a product-preserving functor:

\[ M: \mathcal{L} \to \mathbf{Set}. \]

Models form their own category: \(\mathbf{Mod}(\mathcal{L}) = \mathrm{Fun}^{\times}(\mathcal{L}, \mathbf{Set})\), with natural transformations as morphisms.

Consider the Lawvere theory of monoids \(\mathcal{L}_{\mathrm{Mon}}\):

Objects: \(1^0 = 1, \; 1^1, \; 1^2, \; 1^3, \ldots\)
Morphisms: generated by \(\mu: 1^2 \to 1^1\) (multiplication) and \(\eta: 1^0 \to 1^1\) (unit), subject to associativity and unit equations.

A product-preserving functor \(M: \mathcal{L}_{\mathrm{Mon}} \to \mathbf{Set}\) sends \(1^1\) to a set \(S\), \(\mu\) to a binary operation \(S \times S \to S\), and \(\eta\) to a distinguished element of \(S\)—precisely the data of a monoid.

28.2.2 The functorial insight

Lawvere’s key insight: “satisfying axioms” becomes “preserving structure.” Instead of checking axioms against a candidate algebra, we define a model as a functor that preserves categorical structure (finite products). This perspective achieves three things:

The collection of all models becomes a category with well-defined morphisms.
New models arise from old ones via categorical operations (limits, colimits, Kan extensions).
A precise notion of theory morphism \(F: \mathcal{L}_1 \to \mathcal{L}_2\) emerges—a product-preserving functor inducing a model reduct functor \(F^*: \mathbf{Mod}(\mathcal{L}_2) \to \mathbf{Mod}(\mathcal{L}_1)\).

In panproto. The theory-model-morphism architecture comes directly from this work. panproto’s theories are generalized algebraic theories (dependent-type extension of Lawvere theories), but the core insight—theories are objects, models are functors, theory morphisms induce migration functors—originates with Lawvere (1963). See Section 28.3 for how this unfolds in practice.

28.2.3 Limitations of ordinary algebraic theories

Lawvere theories require all sorts to be independent: the sorts \(\mathrm{Vertex}\) and \(\mathrm{Edge}\) in graph theory do not depend on each other. This works for many algebraic structures but cannot express type-indexed families like “the sort of constraints applicable to vertex \(v\)”—essential for capturing schema language structure. This limitation motivates the generalization to GATs.

28.3 Generalized algebraic theories

28.3.1 Cartmell’s Framework

Cartmell (1986) extended Lawvere to handle dependent sorts: sorts that depend on terms of other sorts. The resulting Generalized Algebraic Theories (GATs) form panproto’s mathematical foundation.

A GAT consists of three parts:

Sorts, which may depend on previously declared sorts. A sort declaration: \[ S(x_1 : A_1, \; x_2 : A_2(x_1), \; \ldots, \; x_n : A_n(x_1, \ldots, x_{n-1})) \] Each \(A_i\) is a previously declared sort; parameters \(x_1, \ldots, x_{i-1}\) are terms of appropriate sorts. Non-dependent sorts have empty parameter lists.
Operations, typed in a dependent context: \[ f(x_1 : A_1, \ldots, x_n : A_n(\ldots)) : B(t_1, \ldots, t_m) \] The output sort \(B\) may depend on input terms.
Equations between terms of the same sort, universally quantified over appropriately typed variables.

The semantic counterpart is the contextual category—also introduced by Cartmell (1986). He established that the theory of contextual categories is itself a contextual category, the self-description property that panproto exploits for its meta-level architecture.

Example: Constrained schemas. A schema language with type-indexed constraints might have:

\(\mathrm{TypeNode}\) — an ordinary sort (no dependencies)
\(\mathrm{PropEdge}\) — an ordinary sort
\(\mathrm{Constraint}(v : \mathrm{TypeNode})\) — a dependent sort (indexed by vertices)
\(\mathrm{src}(p : \mathrm{PropEdge}) : \mathrm{TypeNode}\)
\(\mathrm{maxLength}(c : \mathrm{Constraint}(\mathrm{stringNode})) : \mathrm{Nat}\)

The dependent sort \(\mathrm{Constraint}(v : \mathrm{TypeNode})\) is a family indexed by vertices. String nodes admit \(\mathrm{maxLength}\) and \(\mathrm{pattern}\); integer nodes admit \(\mathrm{minimum}\) and \(\mathrm{maximum}\). This is the expressiveness panproto needs.

28.3.2 The GAT-EAT Equivalence

GATs are expressive but implementing a full dependent type checker is substantial engineering work. panproto sidesteps this using a classical categorical-logic result.

An Essentially Algebraic Theory (EAT) is like an ordinary algebraic theory except operations may be partial: defined only when equational conditions hold. Category theory itself is an EAT: composition \(g \circ f\) is defined only when \(\mathrm{cod}(f) = \mathrm{dom}(g)\).

Theorem (Adámek–Rosický). For a GAT \(T\) whose dependent sorts satisfy a finiteness condition (finitely presentable contexts), the category \(\mathbf{Mod}(T)\) of models is equivalent to the category of models of an EAT \(T'\). Both categories are locally finitely presentable.

The key reference (Adámek and Rosický 1994) establishes locally presentable categories. In this framework, model categories of GATs and EATs coincide for a broad class of theories—precisely those arising in schema descriptions. Practically: panproto implements dependent sorts using partial operations (only defined when a condition holds) rather than a full dependent type system.

For the \(\mathrm{Constraint}(t : \mathrm{TypeNode})\) example, the EAT version has:

An ordinary sort \(\mathrm{Constraint}\) (not dependent)
A total operation \(\mathrm{appliesTo}(c : \mathrm{Constraint}) : \mathrm{TypeNode}\)
A partial operation \(\mathrm{maxLength}(c : \mathrm{Constraint}) : \mathrm{Nat}\), defined when \(\mathrm{appliesTo}(c) = \mathrm{stringNode}\)

The partiality encodes the dependency. Mathematically, identical model categories. Computationally, much simpler—you need partial functions, not dependent types.

In panproto. This architectural choice comes from GATlab (Lynch et al. 2024), a Julia framework for computational category theory. GATlab demonstrated the GAT-EAT equivalence is practical: real software can exploit it to get dependent-sort expressiveness without dependent type-checking complexity. panproto adapts this to Rust, gaining static typing, zero-cost abstractions, and Wasm compilation while retaining the mathematical foundations (Section 28.3).

28.3.3 Locally finitely presentable categories

The locally presentable framework (Adámek and Rosický 1994) provides theoretical guarantees that panproto’s constructions are sound:

Colimits exist. Every diagram of GAT presentations has a colimit, guaranteeing theory composition via pushout always succeeds (Chapter 4).
Initial algebras exist. For endofunctors satisfying a boundedness condition, initial algebras exist, guaranteeing W-type constructions terminate (Section 28.7).
Free models exist. Every locally finitely presentable category has free objects, used in panproto’s model construction algorithms.

These are not formalities. Without locally presentable structure, no guarantee exists that computing a pushout of theories produces a valid theory, or that recursive data types have well-defined semantics. The framework provides these guarantees uniformly.

28.4 Functorial data migration

28.4.1 Spivak’s Framework

Spivak (2012) developed the functorial approach to data migration. The key observation: a functor \(F: \mathcal{C} \to \mathcal{D}\) between database schemas (viewed as categories) induces three data migration functors forming an adjoint triple:

\[ \Sigma_F \dashv \Delta_F \dashv \Pi_F \]

where:

\(\Delta_F: \mathbf{Inst}(\mathcal{D}) \to \mathbf{Inst}(\mathcal{C})\) is the pullback (restriction/reindexing). Given an instance \(I: \mathcal{D} \to \mathbf{Set}\), the restricted instance is \(\Delta_F(I) = I \circ F: \mathcal{C} \to \mathbf{Set}\).
\(\Sigma_F: \mathbf{Inst}(\mathcal{C}) \to \mathbf{Inst}(\mathcal{D})\) is the left pushforward (left Kan extension along \(F\)), computing \((\Sigma_F J)(d) = \mathrm{colim}_{(c, f: F(c) \to d)} J(c)\).
\(\Pi_F: \mathbf{Inst}(\mathcal{C}) \to \mathbf{Inst}(\mathcal{D})\) is the right pushforward (right Kan extension along \(F\)), computing \((\Pi_F J)(d) = \lim_{(c, f: d \to F(c))} J(c)\).

The adjunction \(\Sigma_F \dashv \Delta_F\) means left pushforward and pullback satisfy a universal property: for every instance \(J\) over \(\mathcal{C}\) and \(I\) over \(\mathcal{D}\),

\[ \mathrm{Hom}(\Sigma_F J, I) \cong \mathrm{Hom}(J, \Delta_F I). \]

This is the abstract version of the round-trip laws that lenses satisfy.

In panproto. restrict is \(\Delta_F\) (Chapter 9 and Section 9.4). The lift operation computes \(\Sigma_F\) or \(\Pi_F\) depending on context (Section 28.4). The adjoint triple is panproto’s migration engine.

28.4.2 Schemas as categories, instances as functors

Spivak’s framework treats:

A schema as a finitely presented category \(\mathcal{C}\) (objects are tables/types, morphisms are foreign keys/relationships, path equations encode constraints).
An instance of schema \(\mathcal{C}\) as a functor \(I: \mathcal{C} \to \mathbf{Set}\) (each table maps to its row set, each foreign key maps to the reference-following function).
A schema mapping as a functor \(F: \mathcal{C} \to \mathcal{D}\) (tables map to tables, structure is preserved).

panproto works at higher abstraction: instead of fixing schemas as categories, we parameterize over the theory that schemas instantiate. This handles schema languages not naturally categorical—like Protobuf, where field numbers are essential structural data plain categories do not capture (Section 28.3 and Chapter 4).

28.4.3 Algebraic databases

Schultz et al. (2017) extended functorial data migration to algebraic databases, where schemas carry equations, not just graph structure. The CQL system is a direct ancestor of panproto’s design, introducing the “typeside” concept—a fixed base theory analogous to panproto’s Level 0—but not achieving full self-description. panproto generalizes by replacing single-sorted algebraic theories with GATs, enabling dependent sorts and self-description.

28.5 Lenses and bidirectional transformations

28.5.1 Asymmetric lenses

Foster et al. (2007) introduced lenses as function pairs satisfying round-trip laws. An asymmetric lens between source \(S\) and view \(V\) consists of:

\[ \mathrm{get}: S \to V, \qquad \mathrm{put}: S \times V \to S \]

subject to two well-behavedness laws:

\[ \mathrm{get}(\mathrm{put}(s, v)) = v \qquad \text{(PutGet)} \] \[ \mathrm{put}(s, \mathrm{get}(s)) = s \qquad \text{(GetPut)} \]

PutGet: putting a view back and getting it returns the same view. GetPut: putting back what you got is a no-op.

In panproto. panproto’s lens combinators (renameField, addField, wrapInObject, etc.) follow the combinator pattern from Foster et al. (2007) (Section 7.4).

28.5.2 Symmetric lenses

Hofmann et al. (2011) generalized asymmetric lenses to symmetric lenses, treating both sides equally. A symmetric lens between \(S_1\) and \(S_2\) is a span:

\[ S_1 \xleftarrow{\mathrm{get}_1} C \xrightarrow{\mathrm{get}_2} S_2 \]

where \(C\) is the complement—information one side has but the other doesn’t. As synchronization proceeds, the complement updates, and round-trip laws reformulate in complement-passing terms:

\[ \mathrm{put}_R: S_1 \times C \to S_2 \times C, \qquad \mathrm{put}_L: S_2 \times C \to S_1 \times C \]

These satisfy coherence conditions ensuring round-tripping through both directions is the identity (up to complement).

In panproto. panproto’s complement-based get/put API follows the symmetric lens framework (Chapter 18 and Section 7.2). The complement stores information the forward migration discards, enabling exact round-trips.

28.5.3 Delta lenses

Johnson and Rosebrugh (2013) connected lenses to fibrations, showing well-behaved lenses correspond to certain cleavages of Grothendieck fibrations. A delta lens from \(\mathcal{C}\) to \(\mathcal{D}\) is a functor \(F: \mathcal{C} \to \mathcal{D}\) equipped with a chosen lifting operation for morphisms in \(\mathcal{D}\), satisfying lens-like conditions.

The fibration perspective shows panproto’s lens decomposition is canonical, not heuristic. Migration functors factor through a fibration, and lens structure is induced by the cleavage.

In panproto. Delta lenses appear in migrations that are not fully invertible—cases where the complement carries essential information (Chapter 7).

28.5.4 Internal lenses

Clarke (2020) introduced delta lenses in an internal setting, characterizing them as functors and cofunctors. This provides a more general framework than the original get/put formulation, applicable in any category with sufficient structure.

In panproto. Clarke’s framework is relevant to panproto’s protolens architecture, where lenses are parameterized over schemas rather than operating on fixed types (Chapter 16).

28.5.5 Lens categories

Lenses organize into categories. Objects are schemas (or types), morphisms are lenses between them. Composition of lenses is well-defined when round-trip laws hold. Different lens classes yield different categories:

Asymmetric lenses form a category with composition as functional composition of get/put pairs.
Symmetric lenses form a category using the complement-passing structure.
Delta lenses form a subcategory of the arrow category of \(\mathbf{Cat}\).

The relationship between lens categories and the Grothendieck construction is developed in (Johnson and Rosebrugh 2013) and (Clarke 2020). See Section 28.5 for panproto’s use of this categorical structure.

28.5.6 Sequoia and comonadic lenses

Litt et al. (2022) proposed a comonadic foundation for bidirectional transformations, addressing limitations of the classical lens framework for recursive data structures. panproto’s internal lens representation draws on ideas from this work, particularly for handling recursive data where classical get/put laws need careful reformulation (Chapter 16).

28.6 Graph and hypergraph transformation

28.6.1 The algebraic approach

Ehrig et al. (1973) originated the algebraic approach to graph transformation, defining rewriting via pushouts in the category of graphs. The same categorical construction that composes panproto theories—the pushout—also governs schema modifications.

A graph transformation rule is a span of graph morphisms \(L \xleftarrow{l} K \xrightarrow{r} R\): \(L\) is the left-hand side (pattern to match), \(R\) is the right-hand side (replacement), and \(K\) is the interface (what is preserved). Applying the rule to a host graph \(G\) at a match \(m: L \to G\) proceeds via the double-pushout (DPO) construction:

\[ \begin{array}{ccccc} L & \xleftarrow{l} & K & \xrightarrow{r} & R \\ \downarrow{m} & & \downarrow{} & & \downarrow{} \\ G & \xleftarrow{} & D & \xrightarrow{} & H \end{array} \]

where \(D\) is the pushout complement (the host graph with the matched pattern removed) and \(H\) is the result (the pushout of \(D\) and \(R\) along \(K\)).

In panproto. DPO is the mathematical basis for panproto’s schema diff and patch operations. Adding a field, removing a vertex, or restructuring a type hierarchy are all graph transformation steps (Section 8.1 and Section 28.6).

28.6.2 Adhesive categories

Lack and Sobociński (2005) identified adhesive categories as the abstract setting where pushout-based rewriting works well. A category is adhesive if pushouts along monomorphisms exist and satisfy exactness conditions (Van Kampen squares). The key consequence: in adhesive categories, DPO rewriting is well-behaved—pushout complements are unique when they exist, and rewriting steps compose correctly.

In panproto. The category of schemas-as-graphs is adhesive, ensuring panproto’s DPO-based diff/patch is well-behaved. Pushout complement uniqueness guarantees schema diffs are deterministic (Section 28.6).

28.6.3 Single-pushout rewriting

Löwe (1993) introduced the single-pushout (SPO) approach, handling partial morphisms and simpler than DPO for certain operations. In SPO rewriting, applying a rule \(L \to R\) at a match \(m: L \to G\) computes a single pushout in the category of graphs with partial morphisms.

In panproto. SPO rewriting applies in migration scenarios involving deletions (e.g., removing a field), where DPO would require computing a pushout complement that may not exist.

28.6.4 Hypergraph rewriting

Drewes et al. (1997) extended algebraic graph rewriting to hypergraphs, where edges connect arbitrary numbers of vertices. This matters for schema languages with multi-way relationships (e.g., SQL foreign keys referencing composite keys).

In panproto. panproto’s ThHypergraph theory models schemas with multi-way relationships. Hypergraph rewriting governs migrations within such schemas (Section 28.6). Comprehensive reference: (Ehrig et al. 2006).

28.7 W-types and polynomial functors

28.7.1 Polynomial functors

A polynomial functor on \(\mathbf{Set}\) is an endofunctor of the form:

\[ P(X) = \sum_{a \in A} X^{B(a)} \]

where \(A\) is a set of “shapes” (node types) and \(B: A \to \mathbf{Set}\) assigns to each shape a set of “directions” (arities). Polynomial functors classify branching structures: each element of \(A\) is a node type, and \(B(a)\) is the set of children a node of type \(a\) has.

Example. Binary trees with leaf data from set \(D\):

\[ P(X) = D + X^2 \]

Here \(A = D \sqcup \{*\}\) (leaves labeled by \(D\), plus one branch node type), and \(B(d) = \emptyset\) for leaves, \(B(*) = \{L, R\}\) for branches.

28.7.2 W-types

A W-type (wellfounded tree type) is the initial algebra of a polynomial functor. Given \(P(X) = \sum_{a \in A} X^{B(a)}\), the W-type \(W_P\) is the smallest set satisfying:

\[ W_P \cong P(W_P) = \sum_{a \in A} W_P^{B(a)} \]

Elements of \(W_P\) are wellfounded trees: each node is labeled by some \(a \in A\) and has exactly \(|B(a)|\) children, each itself a wellfounded tree.

The initial algebra’s universal property gives a recursion principle: for any other \(P\)-algebra \(\alpha: P(X) \to X\), there is a unique homomorphism \(\mathrm{fold}_\alpha: W_P \to X\).

In panproto. Recursive data structures (trees, nested records, lists) are modeled as W-types. A schema determines a polynomial functor whose initial algebra is the type of well-founded data conforming to that schema. Migrations between schemas lift to maps between polynomial functors, and the recursion principle provides the migration function (Section 4.2.2 and Chapter 9).

28.7.3 Dependent polynomial functors

Gambino and Hyland (2004) extended polynomial functors and W-types to the dependent setting, working in the internal language of a locally cartesian closed category. A dependent polynomial functor is determined by a diagram:

\[ A \xleftarrow{s} B \xrightarrow{p} C \xrightarrow{t} D \]

The W-type of a dependent polynomial functor is the initial algebra in the slice category over \(D\).

In panproto. panproto’s instance trees are W-types of dependent polynomial functors: the “shape” depends on which schema vertex a node anchors to, and the “arity” (set of children) depends on outgoing edges from that vertex in the schema. This is the mathematical content of the instance tree representation (Section 4.2.2). References: (Gambino and Hyland 2004); (Abbott et al. 2005).

28.7.4 Containers

Abbott et al. (2005) introduced containers as a presentation of strictly positive type constructors. A container \((A, B)\) consists of a set of shapes \(A\) and a family of positions \(B: A \to \mathbf{Set}\). The extension of a container is the polynomial functor \(\sum_{a \in A} X^{B(a)}\).

Containers make the correspondence between data types and polynomial functors constructive and algorithmic. Morphisms of containers (pairs of maps on shapes and positions) correspond exactly to natural transformations between associated polynomial functors.

In panproto. panproto’s internal representation of schema-as-polynomial-functor draws on the container perspective, making it possible to compute migrations of recursive data algorithmically: a migration is a container morphism, and container morphisms are pairs of functions computed from theory morphisms (Chapter 9).

28.8 Correctness proofs

panproto applies several optimizations to the restrict pipeline and supporting data structures. Each optimization must preserve the input–output semantics of the unoptimized version exactly. This section states the correctness theorems and provides proofs.

We write \(\mathcal{T} = (V, E, r, \mathrm{anchor}, \mathrm{parent})\) for an instance tree with vertex set \(V\), edge set \(E\), root \(r\), anchor function \(\mathrm{anchor}: V \to \mathrm{Vert}(S)\), and parent function \(\mathrm{parent}: V \setminus \{r\} \to V\). We write \(m = (\Sigma, \rho)\) for a compiled migration with surviving set \(\Sigma \subseteq \mathrm{Vert}(S)\) and resolver \(\rho\).

28.8.1 Index cache correctness

The Theory struct maintains FxHashMap indices from names to vector positions, built once at construction time. The unoptimized version performs a linear scan.

Implementation

Theory::new() and Theory::extending() in crates/panproto-gat/src/theory.rs construct the indices sort_idx, op_idx, eq_idx.

Theorem 1 (Index Cache Correctness). Let \(T\) be a theory with sort vector \(\mathbf{s} = (s_0, \ldots, s_{k-1})\), let \(\mathcal{I}: \mathrm{Name} \rightharpoonup \{0, \ldots, k-1\}\) be the index map, and let \(n\) be a name. Then:

\[ \texttt{find\_sort}_{\mathrm{idx}}(n) = \texttt{find\_sort}_{\mathrm{lin}}(n) \]

Proof. We proceed by cases on whether \(n\) appears in \(\mathbf{s}\).

Case 1: \(\exists!\, j.\; s_j.\mathrm{name} = n\). The Theory constructor enforces name uniqueness: if two sorts share a name, construction fails. Therefore the existential is unique. The linear scan returns \(s_j\) where \(j\) is the unique index with \(s_j.\mathrm{name} = n\). The index map \(\mathcal{I}\) is built by \(\mathcal{I}(s_i.\mathrm{name}) \mapsto i\) for each \(i \in \{0, \ldots, k-1\}\), so \(\mathcal{I}(n) = j\). The indexed lookup returns \(s_{\mathcal{I}(n)} = s_j\). Both agree.

Case 2: \(\nexists\, j.\; s_j.\mathrm{name} = n\). The linear scan returns None. Since no sort has name \(n\), the construction loop never inserted \(n\) into \(\mathcal{I}\), so \(n \notin \mathrm{dom}(\mathcal{I})\) and the indexed lookup also returns None.

Invariant preservation. Theory is immutable after construction: the sorts vector and the index map \(\mathcal{I}\) are never modified. Therefore the bijection between names and positions holds for the lifetime of the object. The Deserialize implementation reconstructs the index from the vector, preserving the invariant across serialization boundaries. \(\square\)

28.8.2 Path compression correctness

Step 3 of the restrict pipeline (Section 9.2.3) computes, for each surviving node, its nearest surviving ancestor. The optimized version caches results along the parent chain so subsequent queries complete in \(O(1)\).

Definition. For a node \(n \in R \setminus \{r\}\) (where \(R\) is the reachable surviving set), the nearest surviving ancestor is:

\[ \alpha(n) = \mathrm{parent}^{j}(n) \quad\text{where}\quad j = \min\{k \geq 1 \mid \mathrm{parent}^{k}(n) \in R\} \]

The minimum exists because \(r \in R\) and \(\mathrm{parent}^{k}(n) = r\) for sufficiently large \(k\) (trees are finite and acyclic).

Theorem 2 (Path Compression Correctness). The path-compressed function \(\alpha_c\) satisfies \(\alpha_c(n) = \alpha(n)\) for all \(n \in R \setminus \{r\}\).

Proof. The path-compressed algorithm maintains a cache \(C: V \rightharpoonup R\) and processes nodes in \(R\) in arbitrary order. For each query \(\alpha_c(n)\):

If \(n \in \mathrm{dom}(C)\), return \(C(n)\).
Otherwise, walk the chain \(n, p_1, p_2, \ldots\) where \(p_i = \mathrm{parent}^{i}(n)\), collecting visited nodes in a path list \(P\), until encountering either:
- 1. a node \(p_j \in \mathrm{dom}(C)\): set \(a \leftarrow C(p_j)\), or
- 1. a node \(p_j \in R\): set \(a \leftarrow p_j\).
For every node \(q \in P\), set \(C(q) \leftarrow a\).
Return \(a\).

We prove correctness by establishing the cache invariant \(\mathcal{C}\):

\[ \mathcal{C}: \quad \forall\, v \in \mathrm{dom}(C).\; C(v) = \alpha(v) \]

Base case. Before any queries, \(C = \emptyset\) and \(\mathcal{C}\) holds vacuously.

Inductive step. Assume \(\mathcal{C}\) holds before query \(\alpha_c(n)\). We show it holds after.

Step 1 (cache hit): If \(n \in \mathrm{dom}(C)\), the algorithm returns \(C(n) = \alpha(n)\) by \(\mathcal{C}\). No cache mutations occur, so \(\mathcal{C}\) is preserved.
Step 2a (walk hits cached node): The walk reaches \(p_j \in \mathrm{dom}(C)\) without encountering any node in \(R\) among \(p_1, \ldots, p_{j-1}\). By \(\mathcal{C}\), \(C(p_j) = \alpha(p_j)\). Since no \(p_i\) for \(i < j\) is in \(R\), the parent chain from \(n\) to \(\alpha(n)\) must pass through \(p_j\) without encountering a surviving node before \(p_j\). The nearest surviving ancestor above \(p_j\) is \(\alpha(p_j)\). But because \(p_j\) might itself not be in \(R\), we have \(\alpha(n) = \alpha(p_j) = C(p_j) = a\).

For intermediate nodes \(q \in P\): each \(q = p_i\) for some \(i < j\), and by the same argument \(\alpha(q) = \alpha(p_j) = a\). Setting \(C(q) \leftarrow a\) preserves \(\mathcal{C}\).
Step 2b (walk hits surviving node): The walk reaches \(p_j \in R\) without encountering any element of \(R\) among \(p_1, \ldots, p_{j-1}\). By definition of \(\alpha\), \(\alpha(n) = p_j = a\) (the first surviving node strictly above \(n\)). For intermediate nodes \(q = p_i\) with \(i < j\): since no \(p_i\) is in \(R\), the first surviving ancestor of \(q\) is also \(p_j\), so \(\alpha(q) = p_j = a\). Setting \(C(q) \leftarrow a\) preserves \(\mathcal{C}\).

In all cases, \(\mathcal{C}\) is preserved and the returned value equals \(\alpha(n)\).

Immutability assumption. The proof relies on both \(\mathrm{parent}\) and \(R\) being immutable throughout the ancestor contraction phase. This is guaranteed by the pipeline architecture: the surviving set \(R\) is computed in steps 1–2 before ancestor contraction begins, and the instance tree’s parent function is never modified (Tarjan 1975). \(\square\)

Complexity. Each node in \(V\) has its cache entry written at most once (in step 3). Each walk traverses only uncached nodes. Therefore the total work across all queries is \(O(|V|)\), regardless of query order.

28.8.3 Fused pipeline correctness

The production implementation fuses steps 1–4 of the restrict pipeline into a single BFS pass (Section 9.5.4). Rather than proving equivalence with the idealized sequential pipeline (which uses a more conservative reachability filter), we prove that the fused pipeline produces a valid restricted instance directly.

Definition (Fused pipeline). The fused version runs a single BFS from \(r\). Each queue entry carries a node \(v\) and an ancestor annotation \(\hat{a}(v) \in R_f \cup \{\bot\}\), representing \(v\)’s nearest surviving ancestor in the traversal so far. Processing node \(v\):

Surviving case (\(\mathrm{anchor}(v) \in \Sigma\)): add \(v\) to \(R_f\), emit arc \((\hat{a}(v), v, \rho(\mathrm{anchor}'(\hat{a}(v)), \mathrm{anchor}'(v)))\). For each child \(c\) of \(v\), enqueue \((c, v)\).
Non-surviving case (\(\mathrm{anchor}(v) \notin \Sigma\)): do not add \(v\) to \(R_f\). For each child \(c\) of \(v\), enqueue \((c, \hat{a}(v))\).

The root \(r\) is processed first; if \(\mathrm{anchor}(r) \notin \Sigma\), the pipeline returns RootPruned. Fan reconstruction runs as a separate pass on \(R_f\).

Theorem 3 (Fused Pipeline Validity). The fused pipeline produces a valid restricted instance. Specifically:

For every \(v \in R_f\): \(\mathrm{anchor}'(v) \in \mathrm{Vert}(S_{\mathrm{tgt}})\).
For every arc \((a, v, e)\) in the output: \(e\) is a valid edge in \(S_{\mathrm{tgt}}\) between \(\mathrm{anchor}'(a)\) and \(\mathrm{anchor}'(v)\).
The output \((R_f, E_f, r)\) is a tree rooted at \(r\).
The ancestor annotation is correct: \(\hat{a}(v) = \alpha_{R_f}(v)\) for all \(v \in R_f \setminus \{r\}\), where \(\alpha_{R_f}\) is the nearest-surviving-ancestor function with respect to \(R_f\).

Proof. We prove (4) first, as (1)–(3) follow from it.

Proof of (4) by strong induction on BFS depth.

Let \(d(v)\) denote the BFS depth of node \(v\) (distance from \(r\) in the original tree).

Base case (\(d = 0\)). The root \(r\) is the only node at depth 0. It has no ancestor annotation (it is the root of the output tree). Vacuously, (4) holds for \(r\).

Inductive hypothesis. For all nodes \(u\) with \(d(u) \leq d\): if \(u \in R_f\), then \(\hat{a}(u) = \alpha_{R_f}(u)\).

Inductive step (\(d + 1\)). Let \(v\) be a node at depth \(d + 1\), and let \(p = \mathrm{parent}(v)\) be its parent in the original tree (\(d(p) = d\)). The BFS dequeues \(v\) with ancestor annotation \(\hat{a}(v)\), which was set when \(p\) was processed:

If \(p \in R_f\): then \(p\) passed \(\hat{a}(v) = p\) to all its children. The nearest surviving ancestor of \(v\) in the original tree (with respect to \(R_f\)) is \(p\) itself—because \(p\) is the immediate parent and \(p \in R_f\). So \(\hat{a}(v) = p = \alpha_{R_f}(v)\).
If \(p \notin R_f\): then \(p\) passed \(\hat{a}(v) = \hat{a}(p)\) to all its children. Since \(p \notin R_f\), the nearest surviving ancestor of \(v\) is the nearest surviving ancestor of \(p\). Now, \(p\) was dequeued at depth \(d\). Even though \(p \notin R_f\), \(p\)’s own ancestor annotation \(\hat{a}(p)\) was set by \(p\)’s parent \(\mathrm{parent}(p)\) at depth \(d - 1\).

We need: \(\hat{a}(p) = \alpha_{R_f}(p)\) even though \(p \notin R_f\). This requires extending the inductive hypothesis to cover non-surviving nodes.

Extended inductive hypothesis. For all nodes \(u\) with \(d(u) \leq d\): the ancestor annotation \(\hat{a}(u)\) equals the nearest node in \(R_f\) that is a proper ancestor of \(u\) in the original tree, if such a node exists.

The base case and surviving-parent case proceed identically. For the non-surviving case: \(\hat{a}(v) = \hat{a}(p)\), and by the extended hypothesis at depth \(d\), \(\hat{a}(p)\) is the nearest member of \(R_f\) strictly above \(p\). Since \(p \notin R_f\), this is also the nearest member of \(R_f\) strictly above \(v\) (as \(v\)’s path to any ancestor above \(p\) passes through \(p\)). Therefore \(\hat{a}(v) = \alpha_{R_f}(v)\).

This completes the induction.

Proof of (1). A node enters \(R_f\) only if \(\mathrm{anchor}(v) \in \Sigma\). The compiled migration’s remap sends surviving source vertices to target vertices, so \(\mathrm{anchor}'(v) = m.\mathrm{remap}(\mathrm{anchor}(v)) \in \mathrm{Vert}(S_{\mathrm{tgt}})\).

Proof of (2). By (4), \(\hat{a}(v) = \alpha_{R_f}(v) \in R_f\), so \(\mathrm{anchor}'(\hat{a}(v))\) is well-defined. The edge \(e = \rho(\mathrm{anchor}'(\hat{a}(v)), \mathrm{anchor}'(v))\) is produced by the resolver, whose correctness is guaranteed by the migration’s well-formedness (check_existence verifies that \(\rho\) covers all required ancestor–descendant pairs). If the resolver has no entry, the edge is found by unique-edge lookup in the target schema; if neither exists, the pipeline returns an error.

Proof of (3). Each \(v \in R_f \setminus \{r\}\) has exactly one arc, to \(\hat{a}(v)\). By (4), \(\hat{a}(v)\) is a proper ancestor of \(v\) in the original tree, so \(\hat{a}(v) \neq v\). Since the original tree is acyclic and \(\hat{a}\) maps each node to a strict ancestor, the output graph has no cycles. The root \(r\) has no incoming arc. Every \(v \in R_f\) has a path to \(r\) (follow \(\hat{a}\) repeatedly; each step moves to a strict ancestor, which terminates at \(r\) in finitely many steps). Therefore \((R_f, E_f, r)\) is a tree. \(\square\)

Relationship to the sequential pipeline

The fused pipeline and the sequential pipeline (Chapter 9) may produce different surviving sets when non-surviving intermediate nodes have surviving descendants. The sequential step 2 BFS only traverses through anchor-surviving nodes, potentially missing surviving descendants of pruned intermediates. The fused BFS traverses all children, reaching such descendants and connecting them to their nearest surviving ancestor via contraction.

The fused behavior is strictly more complete: every node in the sequential output is also in the fused output, but the fused output may contain additional nodes that the sequential step 2 would have missed. Both outputs are valid restricted instances; the fused version simply preserves more data.

The individual step functions (anchor_surviving, reachable_from_root, ancestor_contraction, resolve_edge) are retained for unit testing and debugging. Integration tests run both the fused pipeline and the step functions on every test case, comparing outputs on cases where the surviving set is connected (the common case, where both produce identical results).

28.8.4 `FxHash` substitutability

panproto replaces the default SipHash hasher with FxHash from the rustc-hash crate in performance-critical internal hash maps.

Theorem 4 (FxHash Substitutability). Let \(H_S\) and \(H_F\) denote HashMap instances using SipHash and FxHash respectively, both operating on key type \(K\) implementing Eq + Hash. For any sequence of operations \(\mathrm{ops} = [o_1, \ldots, o_n]\) drawn from \(\{\mathrm{insert}, \mathrm{get}, \mathrm{remove}\}\):

\[ \forall\, k \in K.\quad H_S.\mathrm{get}(k)\;\text{after}\;\mathrm{ops} \;=\; H_F.\mathrm{get}(k)\;\text{after}\;\mathrm{ops} \]

Proof. Rust’s HashMap<K, V, S> maintains the following correctness invariant for any hasher \(S\) satisfying the BuildHasher trait:

\[ H.\mathrm{get}(k) = \begin{cases} \mathrm{Some}(v) & \text{if the most recent operation on } k \text{ was } \mathrm{insert}(k, v) \\ \mathrm{None} & \text{otherwise} \end{cases} \]

This invariant is maintained by the hash map’s collision resolution mechanism, which depends on two properties of the key type:

Hash consistency: \(a = b \implies \mathrm{hash}(a) = \mathrm{hash}(b)\), as required by the Hash trait.
Equality correctness: Eq is a true equivalence relation.

Both SipHash and FxHash satisfy property (1) for all types implementing Hash, since both invoke the same Hash::hash methods and differ only in how they mix the output bytes. Property (2) is independent of the hasher.

The hash function determines only the internal bucket assignment. The correctness invariant above depends on bucket assignment only to the extent that collisions are resolved correctly, which both hashers guarantee (they produce the same u64 hash value for equal keys, and HashMap’s probing logic handles collisions deterministically).

The only behavioral difference is iteration order, which is explicitly unspecified by HashMap’s API contract and is not observable through get/insert/remove. panproto never relies on hash map iteration order for correctness (where deterministic order is needed, BTreeMap is used instead). \(\square\)

28.8.5 `Arc<str>` representation equivalence

All GAT identifiers (sort names, operation names, equation names, variable names in terms) use Arc<str> instead of String, enabling \(O(1)\) clone via reference-count increment.

Theorem 5 (Arc<str> Semantic Equivalence). For any values \(a, b: \texttt{Arc<str>}\), let \(\bar{a} = a.\mathrm{to\_string}()\) and \(\bar{b} = b.\mathrm{to\_string}()\) denote the corresponding String values. Then:

Equality: \(a = b \iff \bar{a} = \bar{b}\)
Hashing: \(\mathrm{hash}(a) = \mathrm{hash}(\bar{a})\)
Ordering: \(a \leq b \iff \bar{a} \leq \bar{b}\)
Serialization: \(\mathrm{deserialize}(\mathrm{serialize}(a)) = a\)

Proof. Arc<str> implements Deref<Target = str>. Both String and Arc<str> delegate their trait implementations to the inner str slice via this deref chain:

(1) PartialEq for Arc<str> compares via <str as PartialEq>::eq, performing byte-wise comparison of the underlying UTF-8 data. PartialEq for String does the same (via Deref<Target = str>). Since \(a.\mathrm{deref}()\) and \(\bar{a}.\mathrm{deref}()\) yield &str values with identical byte content by construction of \(\bar{a}\), the comparisons agree.

(2) Hash for Arc<str> delegates to <str as Hash>::hash, feeding the byte content and length into the hasher state. Hash for String also delegates to <str as Hash>::hash. Since the byte content is identical, the same sequence of bytes is fed to the hasher, producing identical hash values for any Hasher implementation.

(3) Ord for Arc<str> delegates to <str as Ord>::cmp, which performs lexicographic byte comparison. Ord for String follows the same delegation chain. Identical byte content produces identical comparison results.

(4) serde’s Serialize for Arc<str> (enabled by the rc crate feature) serializes the inner &str as a string value. Deserialize for Arc<str> deserializes a string and wraps it in a new Arc allocation. The round-trip preserves byte content: \(\mathrm{deserialize}(\mathrm{serialize}(a)).\mathrm{deref}() = a.\mathrm{deref}()\), so the result equals \(a\) by property (1).

The only behavioral difference between Arc<str> and String is the cost of clone(): Arc<str>::clone atomically increments a reference count (\(O(1)\), though with a memory fence), while String::clone allocates heap memory and copies bytes (\(O(n)\) in string length). This is a performance property, not a semantic one. \(\square\)

28.9 Guide to the Literature

This section provides context for papers and books cited throughout the tutorial. For each work, we note what it contributes, how it connects to panproto, what background it assumes, and how accessible it is. Full bibliographic details are in the bibliography.

28.9.1 Foundations: GATs, algebraic theories, and categories

Lawvere (1963) — Functorial Semantics of Algebraic Theories. Lawvere’s doctoral dissertation introduces treating algebraic theories as categories and models as functors. This is the origin of the “theories as mathematical objects” perspective that panproto inherits. The dissertation is short but assumes comfort with basic category theory. It is worth reading for the conceptual clarity of its central idea: a theory is not a syntactic object but a structured category, and a model is a structure-preserving functor out of it. Accessible with one semester of abstract algebra and basic category theory.

Cartmell (1986) — Generalised Algebraic Theories and Contextual Categories. Cartmell extends Lawvere’s framework to handle dependent sorts. The resulting formalism, GATs, is the mathematical foundation of panproto’s theory engine. The paper also introduces contextual categories, the semantic counterpart of GATs, and establishes that the theory of contextual categories is itself a contextual category—the self-description property. Technically demanding; assumes familiarity with basic category theory and some exposure to type theory.

Adámek and Rosický (1994) — Locally Presentable and Accessible Categories. A comprehensive treatment of the categorical foundations underlying algebraic theories, including initial algebras, colimits, and presentability. The chapters on initial algebras (Ch. 10) and locally finitely presentable categories (Ch. 1–3) are most relevant. Graduate-level category theory; reference material rather than introduction.

Mac Lane (1998) — Categories for the Working Mathematician. The standard reference for category theory. When you need the precise definition of an adjunction, Kan extension, or natural transformation, this is the book. Well-written and self-contained, assuming only basic algebra. Accessible with undergraduate mathematics.

Barr and Wells (1990) — Category Theory for Computing Science. A more applied introduction, oriented toward computer scientists. It covers the same core concepts as Mac Lane (1998) but with more attention to computational examples. The treatment of fibrations and limits/colimits is particularly clear. A good entry point if Mac Lane (1998) feels too abstract. Accessible with undergraduate computer science and some abstract algebra.

28.9.2 Functorial data migration

Spivak (2012) — Functorial Data Migration. The paper that launched the functorial approach to data migration. Spivak shows that a functor between database schemas induces three data migration functors forming an adjoint triple. The paper is accessible and includes many examples. Accessible with basic category theory (functors, adjunctions, Kan extensions).

Schultz et al. (2017) — Algebraic Databases. Schultz, Spivak, Vasilakopoulou, and Wisnesky extend functorial data migration to algebraic databases, where schemas carry equations. The CQL system is a direct ancestor of panproto’s design. It introduces the “typeside” concept—a fixed base theory analogous to panproto’s Level 0—but does not achieve full self-description. Assumes familiarity with Spivak (2012) and basic universal algebra.

Spivak and Wisnesky (2015) — Relational Foundations for Functorial Data Migration. A refinement of Spivak (2012) that grounds the migration functors in relational algebra, making the connection to SQL more explicit. Assumes familiarity with both category theory and database theory.

Uotila and Lu (2022) — A Formal Category Theoretical Framework for Multi-Model Data Transformations. Extends functorial data migration to multi-model databases (relational, document, graph, key-value). The multi-model perspective validates panproto’s protocol-parametric design. Assumes familiarity with Spivak (2012).

28.9.3 Lenses and bidirectional transformations

Foster et al. (2007) — Combinators for Bidirectional Tree Transformations. The foundational paper on lenses. Foster et al. define lenses as pairs of get/put functions satisfying round-trip laws. Well-written; assumes only basic functional programming knowledge. Accessible to any working developer.

Hofmann et al. (2011) — Symmetric Lenses. Hofmann, Pierce, and Wagner generalize asymmetric lenses to symmetric lenses, where both sides are treated equally. The complement plays a central role. Assumes familiarity with Foster et al. (2007). Accessible with basic PL theory background.

Johnson and Rosebrugh (2013) — Lenses, Fibrations and Universal Translations. Johnson and Rosebrugh connect lenses to fibrations, providing deep insight into why lenses work. Requires solid category theory background (fibrations, Grothendieck construction).

Clarke (2020) — Internal Lenses as Functors and Cofunctors. Clarke introduces delta lenses in an internal setting, providing a more general framework than the original get/put formulation. Assumes graduate-level category theory.

Litt et al. (2022) — Sequoia: A New Foundation for Bidirectional Transformations. Litt, Kleppmann, and Shapiro propose a comonadic foundation for bidirectional transformations. Assumes familiarity with the lens literature and basic comonad theory.

28.9.4 Graph and hypergraph transformation

Ehrig et al. (2006) — Fundamentals of Algebraic Graph Transformation. A comprehensive textbook covering the algebraic approach to graph transformation, including SPO and DPO rewriting, adhesive categories, and applications. Assumes basic category theory. The primary reference for panproto’s schema diff and patch operations.

Ehrig et al. (1973) — Graph Grammars: An Algebraic Approach. The origin of the algebraic approach to graph transformation. Historical interest; the ideas are developed more fully in Ehrig et al. (2006).

Löwe (1993) — Algebraic Approach to Single-Pushout Graph Transformation. Introduces SPO graph rewriting, which handles partial morphisms and deletions. Technical but clearly written. Assumes basic category theory.

Lack and Sobociński (2005) — Adhesive and Quasiadhesive Categories. Lack and Sobocinski identify adhesive categories as the abstract setting where pushout-based rewriting is well-behaved. Assumes graduate-level category theory. Key result: the category of schemas-as-graphs is adhesive.

Drewes et al. (1997) — Hyperedge Replacement Graph Grammars. Extends algebraic graph rewriting to hypergraphs. Assumes familiarity with Ehrig et al. (2006).

28.9.5 W-types and polynomial functors

Gambino and Hyland (2004) — Wellfounded Trees and Dependent Polynomial Functors. Gambino and Hyland develop the theory of W-types in dependent type theory, showing they are initial algebras of dependent polynomial functors. Technically demanding; assumes familiarity with both type theory and category theory.

Abbott et al. (2005) — Containers: Constructing Strictly Positive Types. Abbott, Altenkirch, and Ghani introduce containers as a constructive presentation of strictly positive type constructors. More accessible than Gambino and Hyland (2004); assumes basic type theory.

28.9.6 Software systems

Lynch et al. (2024) — GATlab: A Julia Package for Generalized Algebraic Theories. Lynch, Brown, Fairbanks, and Patterson describe GATlab, a Julia implementation of GATs with support for computational colimits, theory morphisms, and model checking. GATlab is the most direct intellectual predecessor of panproto’s design.

panproto’s Rust engine reimplements GATlab’s core algorithms (colimits of GAT presentations, theory morphism composition, model validation) and extends them with the migration, lens, and instance layers. If you want to understand the computational GAT machinery that panproto builds on, GATlab’s codebase and this paper are the place to start. Accessible with basic abstract algebra; the paper assumes no category theory beyond what it introduces.

Abbott, Michael, Thorsten Altenkirch, and Neil Ghani. 2005. “Containers: Constructing Strictly Positive Types.” Theoretical Computer Science 342 (1): 3–27. https://doi.org/10.1016/j.tcs.2005.06.002.

Adámek, Jiří, and Jiří Rosický. 1994. Locally Presentable and Accessible Categories. Vol. 189. London Mathematical Society Lecture Note Series. Cambridge University Press. https://doi.org/10.1017/CBO9780511600579.

Barr, Michael, and Charles Wells. 1990. Category Theory for Computing Science. Prentice Hall.

Cartmell, John. 1986. “Generalised Algebraic Theories and Contextual Categories.” Annals of Pure and Applied Logic 32: 209–43. https://doi.org/10.1016/0168-0072(86)90053-9.

Clarke, Bryce. 2020. “Internal Lenses as Functors and Cofunctors.” Applied Category Theory 2019, Electronic proceedings in theoretical computer science, vol. 323: 183–95. https://doi.org/10.4204/EPTCS.323.13.

Drewes, Frank, Hans-Jörg Kreowski, and Annegret Habel. 1997. “Hyperedge Replacement Graph Grammars.” In Handbook of Graph Grammars and Computing by Graph Transformation, edited by Gert Rozenberg, vol. 1. World Scientific. https://doi.org/10.1142/9789812384720_0002.

Ehrig, Hartmut, Karsten Ehrig, Ulrike Prange, and Gabriele Taentzer. 2006. Fundamentals of Algebraic Graph Transformation. Monographs in Theoretical Computer Science. Springer. https://doi.org/10.1007/3-540-31188-2.

Ehrig, Hartmut, Michael Pfender, and Hans Jürgen Schneider. 1973. “Graph-Grammars: An Algebraic Approach.” 14th Annual IEEE Symposium on Switching and Automata Theory (FOCS 1973), 167–80. https://doi.org/10.1109/SWAT.1973.11.

Foster, J. Nathan, Michael B. Greenwald, Jonathan T. Moore, Benjamin C. Pierce, and Alan Schmitt. 2007. “Combinators for Bidirectional Tree Transformations: A Linguistic Approach to the View-Update Problem.” ACM Transactions on Programming Languages and Systems 29 (3): Article 17. https://doi.org/10.1145/1232420.1232424.

Gambino, Nicola, and J. Martin E. Hyland. 2004. “Wellfounded Trees and Dependent Polynomial Functors.” Types for Proofs and Programs (TYPES 2003), Lecture notes in computer science, vol. 3085: 210–25. https://doi.org/10.1007/978-3-540-24849-1_14.

Hofmann, Martin, Benjamin C. Pierce, and Daniel Wagner. 2011. “Symmetric Lenses.” Proceedings of the 38th ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (POPL 2011), 371–84. https://doi.org/10.1145/1926385.1926428.

Johnson, Michael, and Robert Rosebrugh. 2013. “Delta Lenses and Opfibrations.” Proceedings of the 2nd International Workshop on Bidirectional Transformations (BX 2013), Electronic communications of the EASST, vol. 57. https://doi.org/10.14279/tuj.eceasst.57.875.

Lack, Stephen, and Paweł Sobociński. 2005. “Adhesive and Quasiadhesive Categories.” RAIRO – Theoretical Informatics and Applications 39 (3): 511–45. https://doi.org/10.1051/ita:2005028.

Lawvere, F. William. 1963. “Functorial Semantics of Algebraic Theories.” PhD thesis, Columbia University.

Litt, Geoffrey, Martin Kleppmann, and Marc Shapiro. 2022. Project Cambria: Schema Evolution for CRDTs. Ink & Switch research essay. https://www.inkandswitch.com/cambria/.

Löwe, Michael. 1993. “Algebraic Approach to Single-Pushout Graph Transformation.” Theoretical Computer Science 109 (1–2): 181–224. https://doi.org/10.1016/0304-3975(93)90068-5.

Lynch, Owen, Kris Brown, James Fairbanks, and Evan Patterson. 2024. “GATlab: Modeling and Programming with Generalized Algebraic Theories.” arXiv Preprint arXiv:2404.04837, ahead of print. https://doi.org/10.48550/arXiv.2404.04837.

Mac Lane, Saunders. 1998. Categories for the Working Mathematician. 2nd ed. Vol. 5. Graduate Texts in Mathematics. Springer. https://doi.org/10.1007/978-1-4757-4721-8.

Schultz, Patrick, David I. Spivak, Christina Vasilakopoulou, and Ryan Wisnesky. 2017. “Algebraic Databases.” Theory and Applications of Categories 32 (16): 547–619. http://www.tac.mta.ca/tac/volumes/32/16/32-16abs.html.

Spivak, David I. 2012. “Functorial Data Migration.” Information and Computation 217: 31–51. https://arxiv.org/abs/1009.1166.

Spivak, David I., and Ryan Wisnesky. 2015. “Relational Foundations for Functorial Data Migration.” Proceedings of the 15th Symposium on Database Programming Languages (DBPL 2015), 21–28. https://doi.org/10.1145/2815072.2815075.

Tarjan, Robert Endre. 1975. “Efficiency of a Good but Not Linear Set Union Algorithm.” Journal of the ACM 22 (2): 215–25. https://doi.org/10.1145/321879.321884.

Uotila, Valter, and Jiaheng Lu. 2022. “A Formal Category Theoretical Framework for Multi-Model Data Transformations.” arXiv Preprint arXiv:2201.04905, ahead of print. https://doi.org/10.48550/arXiv.2201.04905.

28.1 Categories, functors, and natural transformations

28.1.1 Categories

28.1.2 Functors

28.1.3 Natural transformations

28.1.4 Limits and colimits

28.1.5 Adjunctions

28.1.6 Kan extensions

28.1.7 Fibrations

28.2 Algebraic theories and Lawvere’s framework

28.2.1 Lawvere theories

28.2.2 The functorial insight

28.2.3 Limitations of ordinary algebraic theories

28.3 Generalized algebraic theories

28.3.1 Cartmell’s Framework

28.3.2 The GAT-EAT Equivalence

28.3.3 Locally finitely presentable categories

28.4 Functorial data migration

28.4.1 Spivak’s Framework

28.4.2 Schemas as categories, instances as functors

28.4.3 Algebraic databases

28.5 Lenses and bidirectional transformations

28.5.1 Asymmetric lenses

28.5.2 Symmetric lenses

28.5.3 Delta lenses

28.5.4 Internal lenses

28.5.5 Lens categories

28.5.6 Sequoia and comonadic lenses

28.6 Graph and hypergraph transformation

28.6.1 The algebraic approach

28.6.2 Adhesive categories

28.6.3 Single-pushout rewriting

28.6.4 Hypergraph rewriting

28.7 W-types and polynomial functors

28.7.1 Polynomial functors

28.7.2 W-types

28.7.3 Dependent polynomial functors

28.7.4 Containers

28.8 Correctness proofs

28.8.1 Index cache correctness

28.8.2 Path compression correctness

28.8.3 Fused pipeline correctness

28.8.4 FxHash substitutability

28.8.5 Arc<str> representation equivalence

28.9 Guide to the Literature

28.9.1 Foundations: GATs, algebraic theories, and categories

28.9.2 Functorial data migration

28.9.3 Lenses and bidirectional transformations

28.9.4 Graph and hypergraph transformation

28.9.5 W-types and polynomial functors

28.9.6 Software systems

28.8.4 `FxHash` substitutability

28.8.5 `Arc<str>` representation equivalence