Related work - panproto

In plain terms

panproto sits at the intersection of five older research lines: schemas as algebraic theories, bidirectional lenses, format-preserving parsing, structured version control, and cross-protocol data exchange. Each of those lines has its own canonical papers, its own working systems, and its own vocabulary. This chapter walks each one in turn, names the works that shaped panproto’s design, and says where panproto’s contribution differs.

The chapter is a capstone. Each individual idea is treated in depth in the topical explanation chapters; the goal here is the relationship between panproto and the literature, in one place.

1. Schemas as theories

The idea that a database schema is an algebraic object, and that data conforming to a schema is a model of that object, is mature. The line panproto follows most closely begins with Cartmell’s generalised algebraic theories in the 1970s and continues through David Spivak’s functorial data model in the 2010s, the CQL line developed by Schultz and Wisnesky, the ACSets framework built by Patterson, Lynch, and Fairbanks, and a recent thread of multi-model unification work led by Jiaheng Lu. panproto’s panproto-gat and panproto-protocols crates are an instance of this lineage rather than a departure from it.

Cartmell (1986) introduced generalised algebraic theories in a thesis (1978) and the closely related 1986 paper in the Annals of Pure and Applied Logic. GATs extend many-sorted equational logic with dependent sorts: the sort of a term may depend on the value of another term, so a theory can express constructions like “an edge with vertex parameters” without leaving the algebraic-theory framework. This is the formal apparatus panproto uses; the comment block at the top of crates/panproto-gat/src/lib.rs cites Cartmell directly. The Sort, Operation, Equation, and Theory types in panproto-gat cover Cartmell’s four ingredient classes (sorts/types, operator symbols, axioms, theories). The feature that makes GATs generalised rather than ordinary many-sorted equational logic is dependent sorts, the sort of a term may itself depend on terms; the type-equality / term-equality axiom split (which Cartmell treats as two separate axiom kinds) is a consequence of dependent sorts, since type-equality becomes nontrivial in their presence. panproto collapses both axiom kinds into one Equation constructor with a discriminating tag. GATs sit one rung above many-sorted equational logic and one rung below full dependent type theory, with enough expressive power to handle dependent record-style schemas without inheriting decidability problems from the higher rung.

Spivak (2012) is the headline categorical-database paper. A schema is a finitely-presented category, an instance is a Set-valued functor on it, and a translation between schemas is a functor of categories. The contribution is the equivalence between schemas-as-categories and the category of small categories: every theorem about small categories transfers to schemas. From a translation $F : C \to D$ , Spivak constructs three migration functors. The pullback $Δ_{F}$ transports D-instances back to C-instances contravariantly; the right pushforward $Π_{F}$ performs joins; the left pushforward $Σ_{F}$ performs Skolemized unions. The triple $(Σ_{F} ⊣ Δ_{F} ⊣ Π_{F})$ is an adjoint chain. panproto’s migrate operation in panproto-mig::lift and the three Σ/Δ/Π modes of CQL-style migration are direct descendants of these constructions, although panproto works with GAT presentations rather than free categories with path equivalences.

Schultz & Wisnesky (2017) is the more directly load-bearing precedent for panproto’s design. Where Spivak treats schemas as categories, Schultz and Wisnesky treat schemas as multi-sorted equational theories extending a fixed type-side: sorts are entities, function symbols are foreign keys and attributes, equations are integrity constraints. An instance is itself a theory that extends the schema by adjoining 0-ary constant symbols and ground equations; its denotation is the term model, which is initial among models of the extended theory. A schema mapping is a morphism of theories that is the identity on the type side. The same Σ⊣Δ⊣Π adjoint triple appears, but now derived from substitution along a theory morphism rather than precomposition with a functor. The TAC version of the algebraic-databases programme (Schultz et al. 2017) develops the mathematical theory at length; the JFP article is the implementer-facing presentation. Schultz and Wisnesky also push the design pattern further than the older paper does, using pushouts of schemas and instances as a global-as-view design pattern for data integration, with the integrated schema literally the pushout of source schemas. This is the construction that panproto’s Composing protocols by colimit imports wholesale and turns into the colimit and pushout_by_name functions in panproto-gat::colimit. The choice to make schemas theories rather than categories, with the type-side fixed by a meta-theory, is also panproto’s choice; the ThWType, ThGraph, ThConstraint, ThMulti, and ThMeta building blocks in crates/panproto-protocols/src/theories.rs play the role Schultz and Wisnesky’s type-side plays.

Patterson et al. (2022) sit in the same family but optimise for a different setting. An attributed C-set is a functor $C \to Set$ from a fixed finitely-presented category $C$ together with attribute data attached via a slice over a chosen C-set. The paper unifies graphs, data frames, Petri nets, and wiring diagrams in one framework and gives a Julia implementation in Catlab.jl that is performance-competitive with hand-written domain-specific data structures. Limits and colimits are computed pointwise, which makes pullback, pushout, and limit-based queries cheap. panproto and the ACSets work agree about the goal (one categorical framework, many concrete data shapes) but differ in mechanics: Patterson, Lynch, and Fairbanks define a schema as a small category $∣ S ∣$ equipped with a partition map $S : ∣ S ∣ \to 2$ separating combinatorial sorts from attribute-type sorts, then build the attributed C-set category as a slice in the C-set topos (Theorem 2 of the paper) and prove functoriality in both the schema $S$ and the attribute-type structure $K$ . panproto fixes a different meta-theory (the category of GAT presentations rather than the category of finite-product categories) and takes colimits there. The two approaches are most easily compared as instances of the same idea (a categorical schema with a parameterised instance), with ACSets giving an unusually clean implementation story for scientific computing and panproto giving an unusually direct path to cross-protocol schema engineering.

Lu (2025) is, as of 2025, the most direct adjacent claim of “categorical framework that covers several data models at once.” Lu unifies relational, XML, and property-graph data via a thin set category: a thin category with three object kinds (entity, attribute, relationship), where pullback realises the join operator and is also used to characterise multivalued dependencies, pushout realises the connected-component construction of graph data, and two reduced representations (1RR and 2RR) recover Boyce-Codd/XML normal form and fourth normal form respectively. The work is competitive with panproto on the unification axis but covers a different set of formats (relational, XML, and property graph rather than the wire-format protocols panproto targets) and stays close to the relational-DB tradition with its emphasis on normal forms and dependency closures. The honest framing for panproto is that the unification ambition is now common [Lu (2025); Patterson et al. (2022); Schultz & Wisnesky (2017)]; what panproto adds is coverage of the schema languages developers actually ship against (Avro, OpenAPI, ATProto, FHIR, the eighteen linguistic-annotation formats listed in crates/panproto-protocols/src/) and a colimit-of-presentations methodology rather than a fixed meta-theory.

Two older lines deserve named mentions even though they do not directly drive panproto’s code. The theory-colimit construction itself is older than the institutions framework: Burstall and Goguen introduced “putting theories together” via pushouts in (1977) (IJCAI 1977), developed the construction as the semantics of the Clear specification language (Burstall and Goguen 1980), and later abstracted it institution-theoretically (Goguen and Burstall 1992), with the modern textbook treatment in Sannella & Tarlecki (2012). An institution packages signatures, sentences, models, and a satisfaction relation; theory presentations in any institution form a category with colimits, computable in the signature category, and pushouts of theories then merge specifications. Concrete institutions have been worked out for equational logic, conditional equational logic, Horn-clause logic, first-order logic, and (more recently) dependent type theories; a GAT institution has not been published under that name to our knowledge, though organising GAT presentations as an institution is straightforward (signatures = GAT signatures, sentences = type and term equations, models = standard GAT models, satisfaction = the usual semantic relation). panproto’s pushout_by_name is the Burstall & Goguen (1977) move at the GAT level: pushout of GAT presentations along a shared sub-presentation. The relationship is that panproto is one instance of the theory-colimit construction, organising GAT presentations of wire-format schema languages into a category and pushing them out there. Spivak and Kent’s (2012) olog paper is the categorical-knowledge-representation cousin. An olog is a finite-limit, finite-colimit sketch with English-labelled boxes and arrows, intended as a user-facing way of authoring categorical schemas. The intuition that a schema in any of these formalisms is a category-shaped object with labels in natural language is the olog intuition; the difference is that panproto enforces the labels through the GAT type system and uses tree-sitter for the layout layer rather than freehand authoring.

Against this background, panproto’s specific contribution comes from three moves. First, presenting each industrial schema language as its own GAT and taking a colimit in the category of GAT presentations to obtain a polyglot schema theory carries the Burstall & Goguen (1977) theory-colimit construction (later abstracted institution-theoretically in Goguen & Burstall (1992) and given textbook treatment in Sannella & Tarlecki (2012)) over to wire formats. Spivak’s (2012) framework, the Schultz & Wisnesky (2017) line (CQL), the ACSets work of Patterson et al. (2022) (Catlab.jl), and Lu (2025) each fix a single meta-theory (finitely-presented categories, multi-sorted equational logic, attributed C-sets, thin set categories) and vary schemas inside it; panproto varies the theory presentation and composes presentations. Second, panproto’s crates/panproto-protocols/src/atproto.rs is the first formal categorical treatment of ATProto Lexicon. Third, the combination of GAT-presented schemas with format-preserving CST complements (Section 3) and four-artifact pushout merge (Section 4) runs the categorical machinery against the byte-level representation that the precedents do not address.

2. Bidirectional lenses and the round-trip laws

Lenses are pairs of functions $get : S \to A$ and $put : A \times S \to S$ satisfying laws that make updating the small piece $A$ and propagating back to $S$ behave coherently. The modern form is from Foster et al. (2007), and the family has since grown to cover relational lenses (Bohannon et al. 2006), symmetric and edit lenses [Hofmann et al. (2011); Hofmann et al. (2012)], profunctor optics [Pickering et al. (2017); Clarke et al. (2024)], and generic-deriving libraries that automatically dispatch optic kind (Lens, Prism, Traversal) from a structural classification of focus. panproto’s panproto-lens crate, and especially its Protolens type, sit inside this family.

Foster et al. (2007) set the asymmetric template: a lens is a pair $(get, put)$ between source $C$ and view $A$ (partial functions $C \to A$ and $A \times C \to C$ ), with totality, well-behavedness (GetPut, PutGet), and the stronger PutPut law for the “very well-behaved” subclass. Creation of a fresh source from a view alone is handled by extending the universe with a missing-element marker $Ω$ so that put a Ω defines the creation case. The Foster paper does not stop at the algebra: it builds a combinator library (id, composition, xfork, filter, prune, focus, rename, map, wmap, copy, conditional combinators). Bohannon and colleagues’ (2006) relational lenses then port the discipline from trees to relations. Their select, join_dl, and drop primitives have typing rules that require the schema’s functional dependencies to be in tree form (the attributes appearing in the FDs split into pairwise disjoint sets whose dependency graph is a forest), so that an internal record revision operation (used to define the putback of each combinator) can re-write tuples to restore the FDs after a view update. The combination gives a small calculus in which projection-, join-, and union-views admit lawful updaters, which is essentially what panproto’s ThConstraint building block in crates/panproto-protocols/src/theories.rs captures at the schema-language level: foreign-key constraints become equations in the schema theory, and any lens that respects those equations propagates them.

The closest direct precedent for “optic kind determined by edge type” is the profunctor-optics work that licenses it. Pickering and colleagues (2017) represent each optic as a polymorphic function $\forall p . C p \Rightarrow p a b \to p s t$ , where the constraint class $C$ on the profunctor $p$ selects the optic kind: $C = Cartesian$ gives lenses (because first lifts a transformer on the component to a transformer on the pair), $C = Cocartesian$ gives prisms (because left lifts to the sum), and $C = Cartesian + Cocartesian + Monoidal$ gives traversals. Profunctor optics compose by ordinary function composition; concrete representations do not. The paper’s Figure 5 is a diamond of inclusions (Adapter into Lens and into Prism, Lens and Prism into Traversal) that drops out of which constraints are conjoined. The categorical update by Clarke et al. (2024) unifies the framework further: an optic is the coend $Optic ((A, B), (S, T)) := \int^{M} C (S, M ⊙ A) \otimes D (M ⊛ B, T)$ , varying the two monoidal actions $⊙$ and $⊛$ recovers fifteen optic families in a single table (adapter, lens, monoidal lens, algebraic lens, monadic lens, linear lens, prism, coalgebraic prism, grate, glass, affine traversal, traversal, kaleidoscope, setter, fold), and the construction extends to mixed and $V$ -enriched settings. panproto’s OpticKind enumeration in crates/panproto-lens/src/optic.rs and the prop / variant / item edge classification in panproto-gat::sort implement the dispatch as a runtime classifier over a cross-protocol IR; the contribution is not the dispatch itself but that the indexing object is panproto’s schema graph rather than a single host-language ADT or a single-protocol schema.

Hofmann et al. (2011) introduced the symmetric variant of the lens laws. A symmetric lens carries a complement object $C$ that captures information about $S$ not present in $A$ , plus a pair $(putl, putr)$ that propagates in both directions through the complement. The symmetric category $LENS$ has a symmetric monoidal tensor product but Theorem 5.1 of the paper proves it has neither categorical products nor categorical sums. Hofmann et al. (2012) pushes the framework further by replacing whole-state transformations with edits: each object is a module $⟨ X, init_{X}, \partial X, ⊙_{X} ⟩$ pairing a set with a monoid of edits and a partial monoid action, and a lens is a pair of stateful monoid homomorphisms threading a complement set $C$ under a consistency relation $K$ . Theorem 7.1 of the edit-lens paper proves a one-to-one correspondence between equivalence classes of edit lenses over canonical “differ” modules (whose edit monoid is overwrite-only) and the state-based symmetric lenses of Hofmann et al. (2011). This is what licenses panproto treating the complement as a first-class versioned artifact: an edit-lens-style complement is exactly the thing the schematic VCS commits alongside data and schema. Pre-flight check Complement::is_compatible, the Object::Complement and Object::CstComplement variants in panproto-vcs::object, and the partial-monoid merge rule on the complement (see Lenses and round-trip laws) all sit on this foundation.

Litt and colleagues’ (2020) Cambria essay is the engineering precedent for “schema migration as a graph of composable lenses” and was directly influential on the panproto migration story. The PaPoC 2021 paper that grew out of the essay (Schema Evolution in Distributed Systems with Edit Lenses) makes the connection to Hofmann et al. (2012) explicit: a schema migration is an edit lens, replays of past edits compose through the lens, and divergent branches reconcile through the symmetric structure. panproto’s panproto-mig::compose function composes morphisms in essentially the Cambria way; the difference is that the migration is decided in the GAT category rather than over JSON-schema lenses alone, and the four-artifact bundle (protocol theory, schema, instance data, lens complement) is versioned together in panproto-vcs.

Pacheco & Cunha (2011) give the algebraic counterpart: an equational calculus over point-free lenses in which the standard point-free laws (id, $\circ$ -assoc, $\times$ -functor, $+$ -fusion, distl-nat, $η$ -nat) lift to the bidirectional setting, and bidirectional fold and unfold satisfy uniqueness so that fold-fusion can be applied directly at the lens level. They show that the lens category is not as well-behaved as $Set$ (no categorical products, sums, or exponentials, only the corresponding bi-functors and projections-with-defaults) and mechanise the calculus in the pointless-lenses and pointless-rewrite Haskell libraries. Pacheco et al. (2012) carries the same idea over to delta lenses, where the change-propagation account replaces whole-state updates. panproto’s symbolic simplification of ProtolensChain in crates/panproto-lens/src/protolens.rs extends the same move: where the Pacheco line operates on XML schemas and inductive data types in Haskell, panproto’s chains span protocol theories joined by coercion morphisms across the 50-protocol catalogue.

The categorical foundations of lenses have been deepened considerably since Foster et al. (2007). Diskin and colleagues’ (2011) delta-lens line gives the change-propagation account that panproto’s data-migration story implicitly relies on. The dependent-optics and categorical-cybernetics line (Riley’s Categories of Optics; Spivak’s Generalized Lens Categories via Functors; Capucci, Gavranović, Hedges, and Rischel on cybernetic systems; Niu and Spivak on polynomial functors) generalises the framework to indexed and parameterised optic families and to dependent types. The contrast with the recent dependent-optics literature is that panproto ships a runtime for prop/variant/item dispatch over a cross-protocol IR, applying the existing categorical machinery to the 50-protocol catalogue.

3. Format-preserving parse and emit

The byte-equal round-trip law $emit (parse (b)) = b$ has been the goal of a long line of lossless-CST work: rust-analyzer’s rowan, Microsoft Roslyn, Instagram’s LibCST, the Prettier and Biome formatter trees, and a research line around Spoofax and Stratego. Each of these targets one language. Two threads from the bidirectional-transformations literature give the more general design: lens-based config editors (Augeas, Boomerang) and grammar-based bidirectional pretty-printers (BiYacc and its cousins). panproto’s ParseEmitLens (in crates/panproto-parse/src/parse_emit_lens.rs) is positioned between the two.

Lutterkort (2008) is the closest analogue to panproto’s approach at a framework level. Augeas treats every config file as an XPath-like tree backed by the file bytes, and ships hand-written lenses (in a small ML-style DSL) that map file ↔ tree. The lens primitives (key, label, seq, store, del) act on regular-expression matches; the combinators (subtree, concatenation, union, iteration) compose them with a Boomerang-derived typechecker that enforces unambiguous concatenation, union, and iteration over regular languages. Augeas’s grammar formalism is restricted to regular languages (the paper’s Section 6 lists removing this restriction as future work, since it precludes context-free formats like httpd.conf), so it handles /etc/hosts, /etc/passwd, and the like, but not nested-context formats. panproto’s ParseEmitLens reuses the same lens factorisation but lifts the parser layer to tree-sitter (or one of the structured format codecs in crates/panproto-io), so the regular-language ceiling is gone.

The Boomerang line is the algebraic core that Augeas builds on. Foster et al. (2007) give the original combinators for bidirectional tree transformations; Bohannon et al. (2008) adds resourceful lenses with key and match primitives so that the put direction can align reorderable chunks by key rather than by position. They prove that resourceful lenses are quasi-oblivious: a lens is quasi-oblivious with respect to an equivalence $\sim$ if $c \sim c^{'} ⟹ put a c = put a c^{'}$ (the EquivPut law), which is the condition that lets the put direction respect reorderings as if they were equalities. Foster et al. (2008) then generalises to quotient lenses: lenses well-behaved modulo programmer-specified equivalences $\sim_{C}$ and $\sim_{A}$ , with canonizers (canonize / choose) inserted anywhere in a chain via lquot and rquot operators. The quotient-lens typechecker uses a coarse Identity / Any approximation that scales to the 4200-line Boomerang lens for the UniProtKB/SwissProt protein-sequence format reported in Section 8 of the quotient-lens paper. panproto’s complement carries the same role as the dictionary skeleton in resourceful lenses and the canonizer state in quotient lenses: a sidecar that records what get discarded so that put can put it back in a position-and-key-preserving way. The ParseEmitLens does not yet expose lquot/rquot to users (the canonizer story for parse/emit is implicit in the layout-policy machinery), but the schema-level constraint vocabulary already carries the equivalence relations the canonizers operate on.

The grammar-based thread is the other ancestor. Zhu et al. (2015)’s BiYacc shows that a single DSL program can denote both a parser and a reflective printer: a putback that takes both the abstract syntax tree and the original concrete string and produces an updated concrete string that preserves syntactic characteristics (extra parentheses, whitespace, encoded negations) of the original. BiYacc programs satisfy GetPut and PutGet, and compile to BiFluX for execution. panproto does not compile from a bidirectional grammar; it inherits the grammar from tree-sitter and supplies the bidirectional layer separately, but the contract is the same one BiYacc proves: parse ∘ emit is the identity on the canonical schema, and emit ∘ parse preserves layout where the lens complement records it. Brunsfeld (2018) is the underlying parser-generator citation; the panproto contribution is the lens factorisation on top of 261 grammars, not the parser technology itself.

Two adjacent threads belong in the same neighbourhood. de & Visser (2012) give a token-stream-and-origin-tracking algorithm for layout preservation under AST refactorings: each AST node points into the token stream, the ConstructText reconstruction algorithm (abbreviated CT in proofs) reuses original text for unchanged subterms and pretty-prints only newly constructed ones, and a DIFF over the new list against origin tags handles inserted and deleted children. They prove correctness ( $PARSE (CT (t)) = t$ ) and a maximal-layout-preservation theorem, and round out the system with heuristic comment-binding patterns (Preceding(1), Preceding(2), Succeeding(1..3)). Their Section 7.5 notes that the lens framework loses the fine-grained relation between subterms and text fragments needed for layout preservation, and singles this out as the obstacle the reflective-printer trick in BiYacc responds to. panproto sits closer to the de Jonge–Visser side: the complement records exactly the byte-position and interstitial information the AST omits, and emit_pretty reuses it when it is present and falls back to a layout policy when it is not. Pombrio & Krishnamurthi (2014) takes the same lens framing in a different domain: their CONFECTION system desugars and resugars syntactic-sugar pattern rules, Section 6.1 frames the desugar/unexpand pair explicitly as a lens, and Theorem 1 in Section 6.1.2 proves PutGet iff a disjointness condition on the rule list holds. The resugaring properties are Emulation and Abstraction (Theorems 3, 4, 5) plus Coverage (informal; the paper explicitly says Coverage is not formalised and can only be strived for in practice), and they translate directly to what panproto’s layout-enrichment fibration is doing one level up: stripping a constraint sort is the desugaring step, and re-attaching it via a registered synthesis driver is the resugaring step.

4. Schema versioning as structured merge

Two distinct precedents shape panproto’s VCS layer. The categorical-VCS line owns the merge-as-pushout semantics. Mimram & Giusto (2013) gave the formal account; Pijul is the working CLI implementation; Angiuli and colleagues’ (2014) homotopical patch theory recasts the same construction inside homotopy type theory. In all three, files (or patches between files) are objects and morphisms in a category, and a three-way merge is the pushout of the patches against the common ancestor. Conflicts are failures of the pushout to be defined, or of the universal property to factor. Roundy’s (2005) Darcs is the engineering ancestor that motivated this thread.

Mimram & Giusto (2013) fix the simplest interesting case: a file is a function $A : [n] \to L$ from a length- $n$ index set into an alphabet of lines, and a patch $f : A \to B$ is an injective increasing partial function on the index sets that preserves labels wherever it is defined (undefined positions are deletions; positions in the codomain outside the image are insertions). The category $L$ of files and patches is the free monoidal category on insertion generators $η_{a}$ and deletion generators $ε_{a}$ subject to $ε_{a} \circ η_{a} = id_{I}$ (deleting an inserted line is identity); the subcategory $L^{+}$ of total injective patches is the free monoidal category on the insertion generators alone. Concurrent edits give a span $A_{1} f_{1} A f_{2} A_{2}$ , and the natural definition of “merging” is the pushout of this span. The central observation is that $L$ is not finitely cocomplete: the patches $ab c ab a d b$ that insert different lines at the same position have no pushout in $L$ , which is the categorical face of a textual conflict. Theorem 15 of the paper describes the free finite conservative cocompletion of the insertion-only subcategory $L^{+}$ as the category of finite sets equipped with a transitive relation $<$ and morphisms that are functions respecting the relation; the labelled and partial-function generalisation (Theorem 20, p. 12) adds labels $ℓ : A \to L$ and replaces functions with partial functions, giving the full cocompletion $P$ of $L$ . Pushouts in $P$ are computed as disjoint unions quotiented by the span equivalence and equipped with the transitive closure of the inherited relations; cycles can arise during cocompletion (Example 18 of the paper exhibits one explicitly), which is how the framework records “the user must pick between two orderings of a conflicting line.” The single-file insertion-only construction generalises to insertion-and-deletion (Theorem 20, partial functions instead of total ones) and is left in the paper as a target for richer base categories (multi-file, XML trees, event-structure repositories). panproto’s merge.rs does not implement the partial-function version literally, but the move it makes is the same one: detect that the naive pushout fails in the schema-theory category $Th_{GAT}$ , cocomplete by adjoining a conflict descriptor that records the divergent additions as parallel arrows, and resolve by choosing a section.

Angiuli, Morehouse, Licata, and Harper (2014) make the same construction but inside homotopy type theory, where the answers come out cleaner because the groupoid laws are free. A patch theory in their setting is a higher inductive type whose points are repository contexts, whose 1-paths are patches, and whose 2-paths are patch laws. Identity, inverse, and composition come from the path operations refl, !, and ∘; the groupoid laws (associativity, units, inverses cancel) are automatic because the identity type is already a $\infty$ -groupoid. A model of a patch theory is then a function out of the HIT into a universe of sets and bijections, and functorial semantics makes the model preserve patch laws automatically: every program that consumes the HIT must respect the equations between paths. The paper’s three running examples (an integer counter that is literally the circle; a fixed-length document with a swap-lines patch and an indep / noop law saying edits to independent lines commute; an add/remove patch theory indexed by histories of compositions) culminate in a merge operation that takes a span of diverging patches and returns a cospan completing it into a commuting square (Section 3.2). Their merge is more flexible than the Mimram–Di Giusto pushout in one direction (it accepts arbitrary divergent patches, with a “merge undoes both” fallback when nothing better exists) and more restrictive in another (it requires inverses to live inside the theory, which the authors flag in Section 7 as “conceptually questionable and practically problematic”). Their workaround is to index contexts by patch histories and define merge over History rather than over the underlying file (Section 6.4), which is structurally what panproto-vcs already does when it commits the lens complement alongside the schema: the complement carries the history information that a pure file-level patch category cannot. The homotopical framing also makes one of panproto’s design choices explicit. The Mimram–Di Giusto $P$ allows “cyclic” objects to represent conflicts; the Angiuli et al. patch-with-laws example uses a 2-path to equate the two orderings rather than equipping the merge result with a cyclic order. panproto sits between the two: a vertex-set conflict is reified as a conflict object the user resolves, but a permutation of independent additions does not produce a new object because the underlying GAT theory already quotients by the appropriate equations.

The schema-and-data VCS line owns the engineering of “a Merkle DAG of commits with schema + data inside each commit.” TerminusDB, Dolt, and the Noms prolly-tree lineage are working examples; Merkle (1988) is the hash-tree primitive they all rely on, and the operational git tradition that Chacon & Straub (2014) document is the immediate engineering precedent. None of these systems frames merge categorically; they treat schema conflicts operationally (cell-by-cell, table-by-table) and have no notion of a lens complement attached to a commit.

Curino, Moon, and Zaniolo’s PRISM workbench (2008) is the load-bearing precedent at the operator level. PRISM defines eleven Schema Modification Operators (SMOs): CREATE TABLE, DROP TABLE, RENAME TABLE, COPY TABLE, MERGE TABLE, PARTITION TABLE, DECOMPOSE TABLE, JOIN TABLE, ADD COLUMN, DROP COLUMN, RENAME COLUMN (plus NOP), each translated to a pair of Disjunctive Embedded Dependencies (DEDs): forward DEDs that describe the migration of data from the old to the new schema, and backward DEDs that describe the reverse. The DED pairs feed a chase-and-backchase query rewriter (Deutsch and Tannen’s MARS engine) so that legacy queries written against an old schema version are automatically translated to run against the new one. Section 4.2 of the paper lists six operators with a perfect unique inverse (RENAME COLUMN, RENAME TABLE, PARTITION TABLE, CREATE TABLE, ADD COLUMN, plus NOP) and Table 3 marks JOIN TABLE and DECOMPOSE TABLE as perfect-inverse-only under additional conditions (“yes/no”); the remaining operators (DROP TABLE, COPY TABLE, MERGE TABLE, DROP COLUMN) require a quasi-inverse in the sense of Fagin, Kolaitis, Popa, and Tan: the inverse is not the strict mapping inverse but the “best one can do to recover ground instances.” Figure 3 places each operator on a redundancy / information-preservation grid that tells the database administrator what they are getting up front. PRISM was stress-tested against the full 170-version, 4.5-year schema-evolution history of MediaWiki/Wikipedia and reported 97.2% of evolution steps fully automated; on the 66 most common query templates from the Wikipedia query log, 74% of queries (worst-case across the version timeline) were automatically supported, compared to 84% under manual rewriting. A separate 27-query subset of those templates was used for the runtime-performance comparison in Figure 6 of the paper. panproto inherits two diagnostic ideas directly. First, the PRISM observation that schema change without a record-level lift story is incomplete survives intact: panproto’s panproto-mig::existence check is the equivalent gate, refusing to certify a migration whose data side is not constructible. Second, the quasi-inverse / perfect-inverse partition is the precedent for the CoercionClass lattice (Iso / Retraction / Projection / Opaque) in crates/panproto-gat/src/sort.rs: each class is a precise statement about which kind of inverse a lens carries, the same diagnostic move PRISM makes at the SQL-operator level. Where panproto goes elsewhere is the embedding into a Merkle DAG with categorical merge: PRISM treats each schema-evolution step in isolation and rewrites queries through a chain of DEDs; panproto commits each step into a content-addressed DAG and merges divergent branches by pushout, with the SMO-style information-preservation report as a side output rather than the central artifact.

Litt and colleagues’ (2020) Project Cambria essay, and the PaPoC 2021 paper it grew into (Litt et al. 2021), are the immediate lens-based precedent. Cambria’s data model is a directed graph in which nodes are schema versions and edges are bidirectional lenses; a document tagged with its writing schema is translated to a reading schema by composing the lenses along the shortest path through the graph. The essay credits Hofmann, Pierce, and Wagner’s edit lenses (Hofmann et al. 2012) as the direct inspiration; the PaPoC paper extends the construction with the framing that JSON-patch operations are the edits a Cambria lens propagates. The system was integrated with the Automerge CRDT to demonstrate read-time, per-replica translation in a peer-to-peer setting. panproto’s panproto-vcs::merge builds on the same idea but takes a step the Cambria graph does not: the schema-version graph is promoted from a navigation aid (find a path of lenses) to a content-addressed DAG with pushout-based three-way merge, and the lens complement, the schema, the protocol theory, and the data instances are committed together as one Merkle object rather than living in separate artifacts. The Cambria authors’ own list of open problems (recursive schemas, cross-document migrations, data augmentation, lens repositories) reads as a roadmap that panproto’s GAT-and-CST machinery is set up to answer, although none of those answers should be claimed without further work.

panproto’s contribution is the four-artifact unification of these two lines. A panproto-vcs::CommitObject (in crates/panproto-vcs/src/object.rs) carries a protocol theory ID, schema ID, data set IDs, lens complement IDs, and CST complement IDs together. Three-way structural merge (crates/panproto-vcs/src/merge.rs) detects overlap via pullback and promotes it to pushout in the underlying theory category, with verify_pushout_universal checking the real universal property at the vertex level. Vertex-level factorisation is verified; edge-level factorisation is deferred. No prior system places protocol theory, schema, instance data, and lens complement into one DAG with pushout-based merge semantics; the combination is what panproto-vcs contributes on top of the two precedents.

A second contrast worth recording is with Shapiro and colleagues’ (2011) conflict-free replicated data types. CRDTs and structured-merge VCS solve nearby problems by very different means: CRDTs guarantee strong eventual convergence by restricting the algebra of allowed operations; panproto-vcs allows arbitrary schema operations and decides merge by category-theoretic universal property. The two are complementary rather than competing; cf. Shapiro et al. (2011) for the canonical statement of the CRDT side, and the Cambria integration with Automerge (Litt et al. 2021) for one way the two layers can compose.

5. Cross-protocol translation and data exchange

Translation between schema languages has two well-developed precedents: the data-exchange tradition that runs from Fagin, Kolaitis, Miller, and Popa’s ICDT 2003 paper through Fagin, Kolaitis, Popa, and Tan and the Clio project, and the engineering-IR tradition that runs from Apache Calcite through Substrait, Apache Arrow, MLIR, and ONNX. panproto sits between them: it inherits the universal-solution and second-order-tgd machinery from the data-exchange line and the hub-and-spoke adapter architecture from the IR line, and turns both into one pipeline over GAT-presented schemas.

Fagin et al. (2005) establish the core of a universal solution as the canonical materialised target instance. A data-exchange setting is a quadruple $(S, T, Σ_{s t}, Σ_{t})$ of source schema, target schema, source-to-target tuple-generating dependencies (tgds) $\forall x (ϕ_{S} (x) \to \exists y ψ_{T} (x, y))$ , and target tgds and equality-generating dependencies (egds). A universal solution $J$ for a source instance $I$ is a target instance from which every other solution receives a homomorphism; universal solutions exist iff solutions exist, and one can be computed by the chase when $Σ_{t}$ is weakly acyclic (Theorem 2.6, originally from the 2003 paper). Universal solutions are not unique up to isomorphism: Example 3.1 exhibits an infinite family $J_{m}$ of nonisomorphic universal solutions of different sizes for one source instance. The paper picks out the core of $J$ (the smallest substructure that is also a homomorphic image of $J$ ) as the unique smallest universal solution up to isomorphism and proves that it is preserved under chase steps for tgds and egds (Proposition 3.4). Computing the core of an arbitrary structure is intractable: Theorem 4.2 shows CORE IDENTIFICATION is DP-complete, even on undirected graphs. Inside data exchange the picture is cleaner: Theorem 5.2 (greedy algorithm) gives a polynomial-time algorithm for computing the core when $Σ_{s t}$ is a set of tgds and $Σ_{t}$ is a set of egds, Theorem 5.9 (blocks algorithm, $Σ_{t} = \emptyset$ ) extends to the case where the source is unavailable (relevant when Clio has materialised a universal solution and discarded the source), and Theorem 5.15 (Algorithm 5.14) carries the blocks algorithm to the egd case. The core is also the right answer for certain-answer semantics: Proposition 6.4 proves that for existential queries $q$ , $q (core (J))_{↓}$ equals the certain answers on universal solutions, and the core is the unique universal solution (up to isomorphism) with this property for every conjunctive query with inequalities $\neq =$ . The CoercionClass lattice (Iso / Retraction / Projection / Opaque) in crates/panproto-gat/src/sort.rs is the Fagin-style invariant transplanted to the schema-presentation level: each class records what kind of homomorphism a cross-protocol coercion carries, which directly determines whether the lifted instance is universal, sub-universal, or projection-only at the target.

Fagin et al. (2005) answer the next question: what is the right language for composing schema mappings? A composition $M_{12} \circ M_{23}$ takes the source-to-target relation defined by two successive mappings and asks whether it can be defined by a single mapping in the same logic. Proposition 3.4 of the paper exhibits a composition definable by an infinite set of source-to-target tgds but by no finite set; Proposition 3.5 strengthens this to a composition undefinable by any (finite or infinite) set of source-to-target tgds; Theorem 3.6 sharpens further with an explicit construction whose composition query is NP-complete (by reduction from 3-COLORABILITY) and is not definable in $L_{\infty ω}^{ω}$ , hence not in least-fixed-point logic. The paper introduces second-order tgds (SO tgds): existential second-order formulas $\exists f_{1} \dots \exists f_{m} (⋀_{i} \forall x_{i} (ϕ_{i} \to ψ_{i}))$ with Skolem function symbols and equalities between terms, satisfying a safety condition that makes them domain-independent. Theorem 4.6 proves that SO tgds are closed under composition with an explicit composition algorithm (introduce a second-order existential $\exists f$ for each existential first-order $\exists y$ in $M_{12}$ ; substitute into $M_{23}$ ; collect implications). Section 5 extends the chase to SO tgds and proves the universal-solution / polynomial-time-certain-answers results carry through (Proposition 5.2, Corollary 5.3). The takeaway for panproto: the right object to compose across protocols is not the underlying schema mapping but a Skolemised second-order generalisation of it, with explicit function symbols standing for the choices the existential quantifiers make. panproto’s ProtolensChain in crates/panproto-lens/src/protolens.rs carries exactly this structure: the lens chain is a sequence of mappings with explicit complement objects standing in for the Skolem choices, composed by the chain combinator with the same closure property the SO tgd composition theorem provides.

The categorical version of the same construction is in the CQL line. Schultz & Wisnesky (2017) (CQL; previously called AQL) presents a schema mapping as a functor between schemas-as-theories, defines the $Σ ⊣ Δ ⊣ Π$ adjoint triple of data migration functors, computes universal solutions as colimits in the category of instances, and gives a uniform pushout-based design pattern for data integration. Schultz et al. (2017) is the long-form mathematical treatment. The CQL implementation is the closest existing analogue of what panproto’s cross-protocol pipeline does, and the construction panproto uses for panproto-gat::colimit and pushout_by_name is structurally the same one CQL implements for relational schemas; panproto carries the construction to GAT-presented wire-format schemas and to instance data that lives natively in those formats (JSON, CBOR, Protobuf, Avro, FHIR) rather than relational tuples.

Apache Calcite (Begoli et al. 2018) is the engineering precedent for hub-and-spoke query-language IRs. Calcite’s architecture is a modular query optimiser (rules, metadata providers, two planner engines: a cost-based Volcano-style dynamic-programming engine and an exhaustive rule-application engine), a relational-algebra core, and an adapter architecture in which each backend (Cassandra, Druid, Elasticsearch, MongoDB, Pig, Spark, Splunk, JDBC dialects) supplies a SchemaFactory, a Schema, and a calling convention trait that the optimiser uses to push down operators into the backend’s native query language. The paper reports Calcite is embedded in Apache Drill, Hive, Solr, Phoenix, Kylin, Apex, Flink, Samza, Storm, MapD, Lingual, and Qubole Quark (Table 1), with several hundred optimisation rules in the rule set. panproto’s protocol-colimit machinery is the same hub-and-spoke shape applied to schemas rather than queries: each protocol crate in crates/panproto-protocols/ plays the role of a Calcite adapter, the universal protocol-graph plays the role of Calcite’s logical relational algebra, and panproto-mig::hom_search is the analogue of Calcite’s rule-application engine. The substantive difference is that Calcite’s IR is SQL-typed relational algebra (with full column types including complex ARRAY, MAP, and MULTISET constructors), augmented with pluggable traits that record physical properties such as ordering, partitioning, and the calling convention used to drive the optimiser, and the adapter contracts are written by hand; panproto’s IR is a GAT-presented schema theory, and the adapter contracts are derived by colimit from the per-protocol theory presentations. Substrait and MLIR are the same hub-and-spoke pattern in other domains: Substrait defines a cross-engine query-plan IR with producers and consumers across DuckDB, Velox, Apache Arrow, and others; MLIR provides a dialect framework for compiler IRs spanning machine learning, parallel computation, and hardware synthesis. Apache Arrow plays the role of the in-memory columnar data layout that several of these systems use as the physical exchange format.

panproto’s cross-protocol pipeline is an instance of the data-exchange/IR construction with one substantive addition: the schema, the instance data, and the lens complement are all transported together along the chain. The pipeline composes pushouts in panproto-gat::colimit, hom-search migrations in panproto-mig::hom_search, and coverage reports in panproto-mig::coverage; the CoercionClass lattice records translation loss at the schema level the way Fagin’s universal-solution / core distinction records it at the instance level; the four-artifact bundle (protocol theory, schema, instance, complement) carries the data exchange forward in a single step that a Calcite adapter or a Substrait plan cannot represent because they have no notion of a complement object.

panproto ships 50 protocols today, drawn from wire formats (Protobuf, Avro, Thrift, FlatBuffers, Cap’n Proto), JSON family (JSON Schema, OpenAPI, AsyncAPI, CloudEvents, JSON-LD, GeoJSON), graph and RDF (RDF/Turtle, SPARQL, GraphQL), columnar (Parquet, Arrow, ORC), document and configuration (XML, YAML, TOML, CUE, Dhall, Nickel), domain-specific (FHIR, ATProto Lexicon, GTFS, OpenAPI Routes), and linguistic-annotation (CoNLL-U, Universal Dependencies, PropBank, FrameNet, TimeML, etc.). Each is presented as a GAT inside crates/panproto-protocols/, and the polyglot schema theory is the colimit of those presentations in $Th_{GAT}$ . The compositional construction (pushout of two presentations $\to$ derived migration $\to$ coverage report) is what Calcite gives you for SQL dialects and what CQL gives you for relational schemas. Doing it across 50 industrial wire-format schema languages with one categorical engine is, as an engineering artifact, what panproto’s cross-protocol layer adds to the literature.

What panproto adds

The five sections above place panproto in five mature research lines. Across those lines, panproto delivers the following:

A polyglot schema theory constructed as a colimit in the category of GAT presentations, with each of 50 industrial wire-format schema languages (Protobuf, Avro, Thrift, FlatBuffers, Cap’n Proto, JSON Schema, OpenAPI, AsyncAPI, CloudEvents, JSON-LD, GeoJSON, GraphQL, SPARQL, RDF/Turtle, Parquet, Arrow, ORC, XML, YAML, TOML, CUE, Dhall, Nickel, FHIR, ATProto Lexicon, GTFS, and the linguistic-annotation family) presented as its own GAT and composed with the rest. This is the Goguen & Burstall (1992) theory-colimit move transplanted from logics to wire formats, run at industrial scale rather than as a single worked example.
A universal CST factorisation across 261 tree-sitter grammars plus the major data-format codecs, with the formatting residue committed as an explicit lens complement. No prior system carries layout as a first-class versioned artifact across this breadth of languages.
A four-artifact pushout merge over a content-addressed DAG: protocol theory, schema, instance data, and lens complement are merged together in one operation that verifies the categorical universal property at the vertex level. The Mimram–Di Giusto, Pijul, and homotopical-patch-theory line gives the pushout-as-merge semantics for files; the PRISM line gives schema-modification operators; the Cambria line gives schema-version graphs with edit lenses; panproto unifies the three with shared content-addressing and shared merge semantics.
The first formal categorical semantics for ATProto Lexicon, produced as one entry in the protocol catalogue rather than as a standalone study.
An end-to-end pipeline (parse → decorate → lens → migrate → emit) that runs the categorical machinery against real bytes: GAT colimits for protocol composition, hom-search for migration derivation, profunctor-optics dispatch for runtime lens application, CST complements for layout preservation, and pushout merge for version control. Each stage is one named precedent from the literature; running them as one composable engine against a 50-protocol catalogue is the engineering contribution.

The topical explanation chapters elsewhere in this book work out each stage in detail.

Keyboard shortcuts

panproto