3 Protocols as Parameters

4 Protocols as Parameters

A theory tells panproto what a schema looks like. But knowing the shape of schemas is only half the story. To migrate data, the engine also needs to know what data looks like: is it rows in a table? Nested JSON documents? Flat key-value records? The answer varies by protocol, and it determines how migration actually works.

panproto captures this with a simple idea: a protocol is a pair of theories. One theory for the shape of schemas, one for the shape of data. From that pair, the engine derives everything else—migration operators, existence checks, breaking-change classifiers, bidirectional lenses—without any protocol-specific code.

4.1 The two-theory architecture

Every supported protocol is described by exactly two parameters:

A schema theory $\mathbb{T}_S$, whose models are schemas. This determines what “a schema” looks like: vertices and edges, hyperedges, constraints, interfaces, or some combination.
An instance theory $\mathbb{T}_I$, whose models are data conforming to a schema. This determines what “a record” or “a row” looks like once you have a schema in hand.

A protocol is thus a pair $\mathcal{P} = (\mathbb{T}_S, \mathbb{T}_I)$. The following table shows representative protocols from each of the six theory groups that panproto identifies. (The full catalog of all 76 supported protocols is in Appendix C.)

A note on the naming convention: the Th prefix means “theory of.” ThGraph is the theory of directed graphs. ThConstraint is the theory of constraint-decorated structures. And $\mathrm{colimit}(\text{ThGraph}, \text{ThConstraint}, \text{ThMulti})$ means “merge these three theories into one, identifying their shared parts.” The colimit construction is how panproto assembles composite schema theories; we will come back to it later in this chapter.

Table 4.1: Representative protocols from each theory group.

Protocol	Schema Theory	Instance Theory	Restriction Operator
Group A: Constrained Multigraph + W-Type
ATProto	colimit(ThGraph, ThConstraint, ThMulti)	ThWType + ThMeta	Signature restriction + BFS + contraction
JSON Schema	colimit(ThGraph, ThConstraint, ThMulti)	ThWType	Signature restriction + BFS + contraction
Group B: Hypergraph + Set-Valued Map
SQL	colimit(ThHypergraph, ThConstraint)	ThFunctor	`restrict` (precomposition)
Parquet	colimit(ThHypergraph, ThConstraint)	ThFunctor	`restrict` (precomposition)
Group C: Simple Graph + Flat
Protobuf	ThSimpleGraph + ThConstraint	ThFlat	Field projection
Avro	ThSimpleGraph + ThConstraint	ThFlat	Field projection
Group D: Typed Graph + W-Type with Interfaces
GraphQL	colimit(ThGraph, ThConstraint, ThMulti, ThInterface)	ThWType	Signature restriction + BFS + contraction
TypeScript	colimit(ThGraph, ThConstraint, ThMulti, ThInterface)	ThWType	Signature restriction + BFS + contraction
Group E: Constrained Multigraph + W-Type + Meta
XML/XSD	colimit(ThGraph, ThConstraint, ThMulti, ThMeta)	ThWType	Signature restriction + BFS + contraction
Neo4j	colimit(ThGraph, ThConstraint, ThMulti, ThMeta)	ThWType	Signature restriction + BFS + contraction

Three patterns jump out of this table.

First, schema theories are assembled from building blocks. Real-world protocols combine orthogonal concerns—graph structure, constraints, multiplicity, interfaces—and panproto assembles the theory by merging smaller component theories.¹

Second, instance theories cluster into just three families: set-valued maps ($\text{ThFunctor}$), tree-shaped data ($\text{ThWType}$), and flat records ($\text{ThFlat}$). This is a surprisingly small number, given the diversity of formats.

Third, the restriction operator—the thing that actually migrates data—is determined by the instance theory. You do not choose a migration strategy. It follows from the theory. Set-valued map instances migrate via restrict (precomposition). W-type instances migrate via a tree surgery pipeline. Flat instances use field projection. The same migration morphism can be applied to any of these, but the data-level operation is different in each case.

4.2 Three families of instances

The instance theory determines what “data” means for a given schema. Each of the three families has a different data shape, a different migration operator, and different performance characteristics.

4.2.1 Set-valued maps ($\text{ThFunctor}$)

Think of a set-valued map instance as a filing cabinet. Each table gets a drawer. Each drawer holds a set of rows. Foreign keys are cross-references between drawers: “for this row in the posts drawer, the author is this row in the authors drawer.”

Formally, a schema is a collection of tables connected by foreign keys, and an instance assigns to each table a set of rows and to each foreign key a lookup function that maps rows of the source table to rows of the target table.² Referential integrity is built into the structure: following two foreign keys in sequence gives the same result whether you compose them first or chase them one at a time.

Migration in this family is restrict: given a schema morphism $f\colon S_1 \to S_2$ and an instance $I$ over $S_2$, you get an instance $f^* I$ over $S_1$ by looking up each old table’s corresponding new table and pulling back its rows. Foreign keys compose automatically.³

SQL databases, Confluent Schema Registry with Avro, Parquet files, and Arrow tables all fall into this family.

4.2.2 W-Types: tree-shaped data ($\text{ThWType}$)

Most protocols—ATProto, GraphQL, JSON Schema, TypeScript, and the majority of the 76 panproto supports—traffic in tree-shaped data. A Bluesky post is not a row in a table. It is a JSON document with nested objects, arrays, and references.

A W-type (wellfounded tree type) instance over a schema is a finitely-branching tree where each node is anchored to a schema vertex, each branch is labeled by a schema edge, and every path from root to leaf is finite.⁴ The root might be anchored to Post, a child to Text, a sibling to CreatedAt.

Migration here runs a five-step pipeline: signature restriction, reachability BFS, ancestor contraction, edge resolution, and fan reconstruction. The chapter on data lifting covers each step in detail.

4.2.3 Flat instances ($\text{ThFlat}$)

$\text{ThFlat}$ is a disciplined restriction of $\text{ThWType}$. When the schema is a simple graph (no parallel edges, no recursive types) and instances are constrained to depth-1 trees, the full W-type machinery reduces to field projection: keep the fields whose schema edges appear in the image of the migration map, discard the rest.

This captures a real semantic boundary. Protocols like Protobuf, Avro, and Thrift are deliberately flat. They are designed for high-throughput serialization, where recursive nesting would impose unacceptable overhead. Treating $\text{ThFlat}$ as a proper theory (rather than a runtime flag on $\text{ThWType}$) lets panproto enforce the depth-1 constraint structurally.

Every flat record is trivially a depth-1 tree, so any $\text{ThFlat}$ instance is automatically a valid $\text{ThWType}$ instance. The converse does not hold.

Exercise

Protobuf uses $\text{ThFlat}$ for its instance theory, but Protobuf messages can contain nested messages. Why is $\text{ThFlat}$ sufficient? What structural property of Protobuf makes nesting different from, say, JSON nesting?

Solution

In Protobuf, each nested message is a separately addressable, independently encoded unit with its own field numbers. The nesting is a reference, not a recursive tree structure. A Protobuf message is fully described by its immediate fields, which makes it depth-1 by design. Migration reduces to field projection, which is cheaper than the full tree surgery pipeline.

4.2.4 Comparing the three families

Table 4.2: The three instance theory families.

Aspect	Set-Valued Map ($\text{ThFunctor}$)	W-Type ($\text{ThWType}$)	Flat ($\text{ThFlat}$)
Data shape	Tables of rows	Nested trees	Depth-1 key-value records
Migration	`restrict` (precomposition)	Tree surgery pipeline	Field projection
Canonical users	SQL, Parquet, Arrow	ATProto, GraphQL, TypeScript	Protobuf, Avro, Thrift
Compositionality	Automatic	Requires reachability analysis	Trivial

No family is universally better. Set-valued maps have the best compositional properties (migration maps compose, and extend/complete give you round-tripping for free), but they require tabular data. W-types handle trees natively, at the cost of more complex migration logic. $\text{ThFlat}$ trades away arbitrary nesting for a migration operator so cheap it is nearly free to compute.

4.3 Assembling schema theories from building blocks

Look at the schema theory column in Table 4.1 again. Almost every entry is a colimit of simpler theories. This is not a coincidence: real-world protocols are assembled from orthogonal concerns, and the colimit construction lets panproto mirror that assembly mathematically.

The building blocks are:

ThGraph: vertices and directed edges.
ThConstraint: invariants on vertex/edge data.
ThMulti: arrays, optional fields, repeated edges.
ThInterface: abstract types implemented by concrete ones.
ThHyperEdge: edges connecting more than two vertices simultaneously.
ThMeta: out-of-band annotation layers.

ATProto needs graphs, constraints, and multiplicity: $\mathrm{colimit}(\text{ThGraph}, \text{ThConstraint}, \text{ThMulti})$. GraphQL adds interfaces: $\mathrm{colimit}(\text{ThGraph}, \text{ThConstraint}, \text{ThMulti}, \text{ThInterface})$. SQL replaces graphs with hypergraphs and drops multiplicity: $\mathrm{colimit}(\text{ThHypergraph}, \text{ThConstraint})$. Each protocol gets exactly the combination it needs.

The colimit itself is a category-theoretic construction that merges two theories while identifying their shared sub-theory.⁵ In practice, it works like a type-safe merge: sorts with the same name in both theories are unified, operations with the same signature are unified, and conflicts (same name, different structure) are reported as errors. The result is a single theory that simultaneously satisfies both components.

The pattern is extensible. If a new protocol needs graph structure, constraints, and some novel “versioning” concern, you write $\text{ThVersioning}$ as a new theory and take its colimit with the existing pieces. The engine derives migration, existence checks, and breaking-change detection for the new combination automatically.

4.4 How the engine uses the pair

When you instantiate panproto for a particular protocol, the engine reads the two theories and derives everything it needs:

From $\mathbb{T}_S$ it determines what sorts exist (vertices, edges, constraints, hyperedges, interfaces, …) and what operations connect them. This gives the engine the migration type: a migration must carry a map for each sort.

From $\mathbb{T}_I$ it determines the data representation (set-valued map, W-type, or flat) and therefore the restriction operator: how to apply a migration to an instance.

From the axioms in $\mathbb{T}_S$ it derives the existence checks: which conditions a proposed migration must satisfy to be valid (the next chapter covers these in detail).

From the interaction of $\mathbb{T}_S$ and $\mathbb{T}_I$ it derives the breaking-change classifier: which migrations are safe to deploy and which are not.

None of this requires protocol-specific code. The protocol author provides two theories; the engine provides the rest.

4.5 Defining a protocol in code

Here is how a protocol is defined using panproto’s TypeScript API:

import { Panproto } from '@panproto/core';

const panproto = await Panproto.init();

// start snippet define-protocol
const tomlConfig = panproto.defineProtocol({
  name: 'toml-config',
  schemaTheory: 'ThConstrainedGraph',
  instanceTheory: 'ThWType',
  edgeRules: [
    { edgeKind: 'key', srcKinds: ['section'], tgtKinds: [] },
    { edgeKind: 'subsection', srcKinds: ['section'], tgtKinds: ['section'] },
  ],
  objKinds: ['section'],
  constraintSorts: ['required', 'deprecated'],
});
// end snippet define-protocol

// start snippet use-protocol
const configSchema = tomlConfig.schema()
  .vertex('root', 'section')
  .vertex('root.database', 'section')
  .vertex('root.database.host', 'string')
  .vertex('root.database.port', 'integer')
  .edge('root', 'root.database', 'subsection', { name: 'database' })
  .edge('root.database', 'root.database.host', 'key', { name: 'host' })
  .edge('root.database', 'root.database.port', 'key', { name: 'port' })
  .constraint('root.database.host', 'required', 'true')
  .build();
// end snippet use-protocol

A protocol is a record with two fields. The engine reads the theory structure and derives migration types, restriction operators, existence checks, and breaking-change classifiers from it.

Exercise

If you designed a protocol with $\text{ThHypergraph}$ for schemas and $\text{ThWType}$ for instances, what kind of data format would it describe? Can you think of a real-world format that fits?

Solution

A format with multi-ary structural relationships (like SQL’s compound foreign keys) but tree-shaped data instances (like JSON documents). This would describe something like a document database with compound keys and nested values. MongoDB comes close: its schemas involve multi-field indexes and compound shard keys (hyperedge structure) while its documents are nested JSON trees (W-type instances).

4.6 The protocol design space

The two-parameter architecture organizes all 76 supported protocols into a design space. Each point is a $(\mathbb{T}_S, \mathbb{T}_I)$ pair; the six theory groups in Table 4.1 are six clusters in that space.

graph LR
    subgraph "Set-Valued Map Instances"
        B["Group B<br/>Hypergraph + Set-Valued Map<br/>(11 protocols)<br/>SQL, Parquet, Arrow, ..."]
    end
    subgraph "W-Type Instances"
        A["Group A<br/>Constrained Multigraph + W-Type<br/>(30 protocols)<br/>ATProto, JSON Schema, OpenAPI, ..."]
        D["Group D<br/>Typed Graph + W-Type + Interfaces<br/>(16 protocols)<br/>GraphQL, TypeScript, Rust Serde, ..."]
        E["Group E<br/>Constrained Multigraph + W-Type + Meta<br/>(9 protocols)<br/>XML/XSD, HTML, Neo4j, ..."]
    end
    subgraph "Flat Instances"
        C["Group C<br/>Simple Graph + Flat<br/>(9 protocols)<br/>Protobuf, Avro, Thrift, ..."]
    end

Figure 4.1: The protocol design space. The horizontal axis is schema theory complexity; the vertical axis is instance theory family.

Some questions this framing makes natural:

What separates Group A from Group D? Both use W-type instances over constrained multigraph schemas. The difference is that Group D adds $\text{ThInterface}$: abstract types (interfaces, traits, type classes) as first-class schema citizens.

What makes Group E distinct from Group A? Group E adds $\text{ThMeta}$ to the schema colimit and lifts the instance theory to $\mathrm{colimit}(\text{ThWType}, \text{ThMeta})$. The metadata layer participates in migration. This captures the fact that formats like XML, HTML, and Neo4j carry an annotation layer (XML attributes, HTML metadata, Neo4j property maps) alongside the tree structure.

What lives between Group B and Group A? A protocol with hypergraph schemas but W-type instances—perhaps a document database with compound keys and nested values. panproto can accommodate this by combining $\text{ThHypergraph}$, $\text{ThConstraint}$, and $\text{ThWType}$.

The design space is open. Adding a new protocol means adding a new point.

4.7 Historical context

The idea that data migration can be parameterized by algebraic structure has roots in several traditions.

Spivak (2012) showed that migration via structure-preserving maps between schema categories gives a clean, compositional account of SQL-style migration. The CQL project (Schultz et al. 2017) turned this into a practical tool for relational databases.

The W-type framework comes from type theory, where Gambino and Hyland (2004) showed that tree-shaped data structures arise naturally as initial algebras of polynomial functors. Abbott et al. (2005) connected this to “containers,” a way of describing data shapes that panproto uses to determine how a schema’s structure constrains the tree shape of its instances.

The insight that these traditions can be unified—that set-valued map instances, W-type instances, and flat instances all arise as models of different instance theories, with migration derived from the theory structure, and that 76 real-world protocols fall cleanly into six theory groups under this classification—is, as far as we know, original to panproto.

We now have the mathematical framework. The next chapter puts it to work: we will walk through a concrete migration, step by step, and see how the engine uses the theory pair to transform data.

Abbott, Michael, Thorsten Altenkirch, and Neil Ghani. 2005. “Containers: Constructing Strictly Positive Types.” Theoretical Computer Science 342 (1): 3–27. https://doi.org/10.1016/j.tcs.2005.06.002.

Barr, Michael, and Charles Wells. 1990. Category Theory for Computing Science. Prentice Hall.

Gambino, Nicola, and J. Martin E. Hyland. 2004. “Wellfounded Trees and Dependent Polynomial Functors.” Types for Proofs and Programs (TYPES 2003), Lecture notes in computer science, vol. 3085: 210–25. https://doi.org/10.1007/978-3-540-24849-1_14.

Schultz, Patrick, David I. Spivak, Christina Vasilakopoulou, and Ryan Wisnesky. 2017. “Algebraic Databases.” Theory and Applications of Categories 32 (16): 547–619. http://www.tac.mta.ca/tac/volumes/32/16/32-16abs.html.

Spivak, David I. 2012. “Functorial Data Migration.” Information and Computation 217: 31–51. https://arxiv.org/abs/1009.1166.

The merge operation is formally a colimit (pushout) in the category of theories. Appendix A has the details.↩︎
An instance is a functor $I\colon \mathcal{C}^{\mathrm{op}} \to \mathrm{Set}$, where $\mathcal{C}$ is the schema viewed as a category. See Spivak (2012) and Appendix A.↩︎
In categorical language, $f^*$ is the pullback functor $\Delta_f$, with left adjoint extend ($\Sigma_f$) and right adjoint complete ($\Pi_f$).↩︎
The mathematical model is a W-type from dependent type theory (Gambino and Hyland 2004), related to the containers of Abbott et al. (2005). Appendix A spells this out.↩︎
See Barr and Wells (1990) for the general construction.↩︎

--- title: "Protocols as Parameters" id: sec-protocols-as-parameters --- # Protocols as Parameters {#sec-protocols-as-parameters} A theory tells panproto what a schema looks like. But knowing the shape of schemas is only half the story. To migrate data, the engine also needs to know what *data* looks like: is it rows in a table? Nested JSON documents? Flat key-value records? The answer varies by protocol, and it determines how migration actually works. panproto captures this with a simple idea: a protocol is a *pair* of theories. One theory for the shape of schemas, one for the shape of data. From that pair, the engine derives everything else---migration operators, existence checks, breaking-change classifiers, bidirectional lenses---without any protocol-specific code. ## The two-theory architecture Every supported protocol is described by exactly two parameters: 1. A **schema theory** $\mathbb{T}_S$, whose models are schemas. This determines what "a schema" looks like: vertices and edges, hyperedges, constraints, interfaces, or some combination. 2. An **instance theory** $\mathbb{T}_I$, whose models are data conforming to a schema. This determines what "a record" or "a row" looks like once you have a schema in hand. A protocol is thus a pair $\mathcal{P} = (\mathbb{T}_S, \mathbb{T}_I)$. The following table shows representative protocols from each of the six theory groups that panproto identifies. (The full catalog of all 76 supported protocols is in [Appendix C](../appendices/C-protocol-catalog.qmd).) A note on the naming convention: the `Th` prefix means "theory of." `ThGraph` is the theory of directed graphs. `ThConstraint` is the theory of constraint-decorated structures. And $\mathrm{colimit}(\text{ThGraph}, \text{ThConstraint}, \text{ThMulti})$ means "merge these three theories into one, identifying their shared parts." The colimit construction is how panproto assembles composite schema theories; we will come back to it later in this chapter. | Protocol | Schema Theory | Instance Theory | Restriction Operator | |----------|--------------|----------------|----------| | **Group A: Constrained Multigraph + W-Type** | | | | | ATProto | colimit(ThGraph, ThConstraint, ThMulti) | ThWType + ThMeta | Signature restriction + BFS + contraction | | JSON Schema | colimit(ThGraph, ThConstraint, ThMulti) | ThWType | Signature restriction + BFS + contraction | | **Group B: Hypergraph + Set-Valued Map** | | | | | SQL | colimit(ThHypergraph, ThConstraint) | ThFunctor | `restrict` (precomposition) | | Parquet | colimit(ThHypergraph, ThConstraint) | ThFunctor | `restrict` (precomposition) | | **Group C: Simple Graph + Flat** | | | | | Protobuf | ThSimpleGraph + ThConstraint | ThFlat | Field projection | | Avro | ThSimpleGraph + ThConstraint | ThFlat | Field projection | | **Group D: Typed Graph + W-Type with Interfaces** | | | | | GraphQL | colimit(ThGraph, ThConstraint, ThMulti, ThInterface) | ThWType | Signature restriction + BFS + contraction | | TypeScript | colimit(ThGraph, ThConstraint, ThMulti, ThInterface) | ThWType | Signature restriction + BFS + contraction | | **Group E: Constrained Multigraph + W-Type + Meta** | | | | | XML/XSD | colimit(ThGraph, ThConstraint, ThMulti, ThMeta) | ThWType | Signature restriction + BFS + contraction | | Neo4j | colimit(ThGraph, ThConstraint, ThMulti, ThMeta) | ThWType | Signature restriction + BFS + contraction | : Representative protocols from each theory group. {#tbl-protocol-table} Three patterns jump out of this table. First, schema theories are *assembled from building blocks*. Real-world protocols combine orthogonal concerns---graph structure, constraints, multiplicity, interfaces---and panproto assembles the theory by merging smaller component theories.^[The merge operation is formally a colimit (pushout) in the category of theories. [Appendix A](../appendices/A-formal-foundations.qmd) has the details.] Second, instance theories cluster into just *three families*: set-valued maps ($\text{ThFunctor}$), tree-shaped data ($\text{ThWType}$), and flat records ($\text{ThFlat}$). This is a surprisingly small number, given the diversity of formats. Third, the restriction operator---the thing that actually migrates data---is *determined by the instance theory*. You do not choose a migration strategy. It follows from the theory. Set-valued map instances migrate via `restrict` (precomposition). W-type instances migrate via a tree surgery pipeline. Flat instances use field projection. The same migration morphism can be applied to any of these, but the data-level operation is different in each case. ## Three families of instances The instance theory determines what "data" means for a given schema. Each of the three families has a different data shape, a different migration operator, and different performance characteristics. ### Set-valued maps ($\text{ThFunctor}$) {#sec-functor-instances} Think of a set-valued map instance as a filing cabinet. Each table gets a drawer. Each drawer holds a set of rows. Foreign keys are cross-references between drawers: "for this row in the `posts` drawer, the author is this row in the `authors` drawer." Formally, a schema is a collection of tables connected by foreign keys, and an instance assigns to each table a set of rows and to each foreign key a lookup function that maps rows of the source table to rows of the target table.^[An instance is a functor $I\colon \mathcal{C}^{\mathrm{op}} \to \mathrm{Set}$, where $\mathcal{C}$ is the schema viewed as a category. See @spivak2012 and [Appendix A](../appendices/A-formal-foundations.qmd).] Referential integrity is built into the structure: following two foreign keys in sequence gives the same result whether you compose them first or chase them one at a time. Migration in this family is `restrict`: given a schema morphism $f\colon S_1 \to S_2$ and an instance $I$ over $S_2$, you get an instance $f^* I$ over $S_1$ by looking up each old table's corresponding new table and pulling back its rows. Foreign keys compose automatically.^[In categorical language, $f^*$ is the pullback functor $\Delta_f$, with left adjoint `extend` ($\Sigma_f$) and right adjoint `complete` ($\Pi_f$).] SQL databases, Confluent Schema Registry with Avro, Parquet files, and Arrow tables all fall into this family. ### W-Types: tree-shaped data ($\text{ThWType}$) {#sec-wtype-instances} Most protocols---ATProto, GraphQL, JSON Schema, TypeScript, and the majority of the 76 panproto supports---traffic in *tree-shaped* data. A Bluesky post is not a row in a table. It is a JSON document with nested objects, arrays, and references. A W-type (wellfounded tree type) instance over a schema is a finitely-branching tree where each node is anchored to a schema vertex, each branch is labeled by a schema edge, and every path from root to leaf is finite.^[The mathematical model is a W-type from dependent type theory [@gambino2004], related to the containers of @abbott2005. [Appendix A](../appendices/A-formal-foundations.qmd) spells this out.] The root might be anchored to `Post`, a child to `Text`, a sibling to `CreatedAt`. Migration here runs a five-step pipeline: signature restriction, reachability BFS, ancestor contraction, edge resolution, and fan reconstruction. The chapter on data lifting covers each step in detail. ### Flat instances ($\text{ThFlat}$) {#sec-flat-instances} $\text{ThFlat}$ is a disciplined restriction of $\text{ThWType}$. When the schema is a simple graph (no parallel edges, no recursive types) and instances are constrained to depth-1 trees, the full W-type machinery reduces to *field projection*: keep the fields whose schema edges appear in the image of the migration map, discard the rest. This captures a real semantic boundary. Protocols like Protobuf, Avro, and Thrift are deliberately flat. They are designed for high-throughput serialization, where recursive nesting would impose unacceptable overhead. Treating $\text{ThFlat}$ as a proper theory (rather than a runtime flag on $\text{ThWType}$) lets panproto enforce the depth-1 constraint structurally. Every flat record is trivially a depth-1 tree, so any $\text{ThFlat}$ instance is automatically a valid $\text{ThWType}$ instance. The converse does not hold. ::: {.callout-caution} ## Exercise Protobuf uses $\text{ThFlat}$ for its instance theory, but Protobuf messages *can* contain nested messages. Why is $\text{ThFlat}$ sufficient? What structural property of Protobuf makes nesting different from, say, JSON nesting? ::: ::: {.callout-tip collapse=true} ## Solution In Protobuf, each nested message is a separately addressable, independently encoded unit with its own field numbers. The nesting is a reference, not a recursive tree structure. A Protobuf message is fully described by its immediate fields, which makes it depth-1 by design. Migration reduces to field projection, which is cheaper than the full tree surgery pipeline. ::: ### Comparing the three families | Aspect | Set-Valued Map ($\text{ThFunctor}$) | W-Type ($\text{ThWType}$) | Flat ($\text{ThFlat}$) | |--------|---------------------|-------------------|----------------| | Data shape | Tables of rows | Nested trees | Depth-1 key-value records | | Migration | `restrict` (precomposition) | Tree surgery pipeline | Field projection | | Canonical users | SQL, Parquet, Arrow | ATProto, GraphQL, TypeScript | Protobuf, Avro, Thrift | | Compositionality | Automatic | Requires reachability analysis | Trivial | : The three instance theory families. {#tbl-instance-comparison} No family is universally better. Set-valued maps have the best compositional properties (migration maps compose, and `extend`/`complete` give you round-tripping for free), but they require tabular data. W-types handle trees natively, at the cost of more complex migration logic. $\text{ThFlat}$ trades away arbitrary nesting for a migration operator so cheap it is nearly free to compute. ## Assembling schema theories from building blocks Look at the schema theory column in @tbl-protocol-table again. Almost every entry is a colimit of simpler theories. This is not a coincidence: real-world protocols are assembled from orthogonal concerns, and the colimit construction lets panproto mirror that assembly mathematically. The building blocks are: - **ThGraph**: vertices and directed edges. - **ThConstraint**: invariants on vertex/edge data. - **ThMulti**: arrays, optional fields, repeated edges. - **ThInterface**: abstract types implemented by concrete ones. - **ThHyperEdge**: edges connecting more than two vertices simultaneously. - **ThMeta**: out-of-band annotation layers. ATProto needs graphs, constraints, and multiplicity: $\mathrm{colimit}(\text{ThGraph}, \text{ThConstraint}, \text{ThMulti})$. GraphQL adds interfaces: $\mathrm{colimit}(\text{ThGraph}, \text{ThConstraint}, \text{ThMulti}, \text{ThInterface})$. SQL replaces graphs with hypergraphs and drops multiplicity: $\mathrm{colimit}(\text{ThHypergraph}, \text{ThConstraint})$. Each protocol gets exactly the combination it needs. The colimit itself is a category-theoretic construction that merges two theories while identifying their shared sub-theory.^[See @barr1990 for the general construction.] In practice, it works like a type-safe merge: sorts with the same name in both theories are unified, operations with the same signature are unified, and conflicts (same name, different structure) are reported as errors. The result is a single theory that simultaneously satisfies both components. The pattern is extensible. If a new protocol needs graph structure, constraints, and some novel "versioning" concern, you write $\text{ThVersioning}$ as a new theory and take its colimit with the existing pieces. The engine derives migration, existence checks, and breaking-change detection for the new combination automatically. ## How the engine uses the pair When you instantiate panproto for a particular protocol, the engine reads the two theories and derives everything it needs: From $\mathbb{T}_S$ it determines what sorts exist (vertices, edges, constraints, hyperedges, interfaces, ...) and what operations connect them. This gives the engine the *migration type*: a migration must carry a map for each sort. From $\mathbb{T}_I$ it determines the data representation (set-valued map, W-type, or flat) and therefore the *restriction operator*: how to apply a migration to an instance. From the axioms in $\mathbb{T}_S$ it derives the *existence checks*: which conditions a proposed migration must satisfy to be valid (the next chapter covers these in detail). From the interaction of $\mathbb{T}_S$ and $\mathbb{T}_I$ it derives the *breaking-change classifier*: which migrations are safe to deploy and which are not. None of this requires protocol-specific code. The protocol author provides two theories; the engine provides the rest. ## Defining a protocol in code Here is how a protocol is defined using panproto's TypeScript API: ```{.typescript include="../code/ts/10-custom-protocol.ts" snippet="define-protocol"} ``` A protocol is a record with two fields. The engine reads the theory structure and derives migration types, restriction operators, existence checks, and breaking-change classifiers from it. ::: {.callout-caution} ## Exercise If you designed a protocol with $\text{ThHypergraph}$ for schemas and $\text{ThWType}$ for instances, what kind of data format would it describe? Can you think of a real-world format that fits? ::: ::: {.callout-tip collapse=true} ## Solution A format with multi-ary structural relationships (like SQL's compound foreign keys) but tree-shaped data instances (like JSON documents). This would describe something like a document database with compound keys and nested values. MongoDB comes close: its schemas involve multi-field indexes and compound shard keys (hyperedge structure) while its documents are nested JSON trees (W-type instances). ::: ## The protocol design space The two-parameter architecture organizes all 76 supported protocols into a design space. Each point is a $(\mathbb{T}_S, \mathbb{T}_I)$ pair; the six theory groups in @tbl-protocol-table are six clusters in that space. ```{mermaid} %%| label: fig-design-space %%| fig-cap: "The protocol design space. The horizontal axis is schema theory complexity; the vertical axis is instance theory family." graph LR subgraph "Set-Valued Map Instances" B["Group B Hypergraph + Set-Valued Map (11 protocols) SQL, Parquet, Arrow, ..."] end subgraph "W-Type Instances" A["Group A Constrained Multigraph + W-Type (30 protocols) ATProto, JSON Schema, OpenAPI, ..."] D["Group D Typed Graph + W-Type + Interfaces (16 protocols) GraphQL, TypeScript, Rust Serde, ..."] E["Group E Constrained Multigraph + W-Type + Meta (9 protocols) XML/XSD, HTML, Neo4j, ..."] end subgraph "Flat Instances" C["Group C Simple Graph + Flat (9 protocols) Protobuf, Avro, Thrift, ..."] end ``` Some questions this framing makes natural: What separates Group A from Group D? Both use W-type instances over constrained multigraph schemas. The difference is that Group D adds $\text{ThInterface}$: abstract types (interfaces, traits, type classes) as first-class schema citizens. What makes Group E distinct from Group A? Group E adds $\text{ThMeta}$ to the schema colimit and lifts the instance theory to $\mathrm{colimit}(\text{ThWType}, \text{ThMeta})$. The metadata layer participates in migration. This captures the fact that formats like XML, HTML, and Neo4j carry an annotation layer (XML attributes, HTML metadata, Neo4j property maps) alongside the tree structure. What lives between Group B and Group A? A protocol with hypergraph schemas but W-type instances---perhaps a document database with compound keys and nested values. panproto can accommodate this by combining $\text{ThHypergraph}$, $\text{ThConstraint}$, and $\text{ThWType}$. The design space is open. Adding a new protocol means adding a new point. ## Historical context The idea that data migration can be parameterized by algebraic structure has roots in several traditions. @spivak2012 showed that migration via structure-preserving maps between schema categories gives a clean, compositional account of SQL-style migration. The CQL project [@schultz2017] turned this into a practical tool for relational databases. The W-type framework comes from type theory, where @gambino2004 showed that tree-shaped data structures arise naturally as initial algebras of polynomial functors. @abbott2005 connected this to "containers," a way of describing data shapes that panproto uses to determine how a schema's structure constrains the tree shape of its instances. The insight that these traditions can be unified---that set-valued map instances, W-type instances, and flat instances all arise as models of different instance theories, with migration derived from the theory structure, and that 76 real-world protocols fall cleanly into six theory groups under this classification---is, as far as we know, original to panproto. We now have the mathematical framework. The next chapter puts it to work: we will walk through a concrete migration, step by step, and see how the engine uses the theory pair to transform data.

3 Protocols as Parameters

4 Protocols as Parameters

4.1 The two-theory architecture

4.2 Three families of instances

4.2.1 Set-valued maps (\(\text{ThFunctor}\))

4.2.2 W-Types: tree-shaped data (\(\text{ThWType}\))

4.2.3 Flat instances (\(\text{ThFlat}\))

4.2.4 Comparing the three families

4.3 Assembling schema theories from building blocks

4.4 How the engine uses the pair

4.5 Defining a protocol in code

4.6 The protocol design space

4.7 Historical context