12 panproto-protocols: Protocol Definitions

JSON Schema, Protobuf, SQL DDL, GraphQL SDL—these look nothing alike on the surface. But each describes a shape (a schema) and a way to fill that shape with data (an instance), and each can be modeled as a pair of GATs. The panproto-protocols crate is where all of those pairs live. It defines 60+ protocols across 9 categories, each assembled from the same small set of reusable building-block theories via colimit (the “structured merge” from Chapter 6).

This chapter covers the protocol catalog, the building blocks, and how to wire up a new protocol from scratch.

12.1 Protocol categories

Category	Examples	Count
Serialization	Protobuf, Avro, Thrift, Cap’n Proto, FlatBuffers, ASN.1, Bond, MsgPack	8
Data Schema	JSON Schema, XML/XSD, CSV/Table Schema, YAML, TOML, CDDL, INI, BSON	8
API	GraphQL, OpenAPI, AsyncAPI, RAML, JSON:API	5
Database	SQL, MongoDB, Cassandra, DynamoDB, Neo4j, Redis	6
Type System	TypeScript, Python, Rust, Java, Go, Swift, Kotlin, C#	8
Web/Document	ATProto, HTML, CSS, DOCX, ODF, Markdown, JSX, Vue, Svelte	9
Data Science	Parquet, Arrow, DataFrame	3
Domain	GeoJSON, FHIR, RSS/Atom, vCard/iCal, EDI X12, SWIFT MT	6
Config	HCL, K8s CRD, Docker Compose, CloudFormation, Ansible	5

Despite this diversity, every protocol follows the same pattern: compose building-block theories via structured merge, define edge rules and vertex kinds, and implement a parser/emitter pair. This factorization avoids combinatorial explosion.

12.2 Building-block theories

The theories module defines reusable GATs that protocols compose. Each function returns a standalone Theory value that can be registered in a theory registry and referenced by name.

12.2.1 Graph theories

The three graph theories form the foundation for schema representation:

pub fn th_constraint() -> Theory {
    Theory::new(
        "ThConstraint",
        vec![
            Sort::simple("Vertex"),
            Sort::dependent("Constraint", vec![SortParam::new("v", "Vertex")]),
        ],
        vec![Operation::unary("target", "c", "Constraint", "Vertex")],
        vec![],
    )
}

/// `ThMulti`: multigraph (parallel edges).
///
/// Sorts: `Vertex`, `Edge`, `EdgeLabel`.
/// Ops: `edge_label : Edge → EdgeLabel`.
#[must_use]
pub fn th_multi() -> Theory {
    Theory::new(
        "ThMulti",
        vec![
            Sort::simple("Vertex"),
            Sort::simple("Edge"),
            Sort::simple("EdgeLabel"),
        ],
        vec![Operation::unary("edge_label", "e", "Edge", "EdgeLabel")],
        vec![],
    )
}

/// `ThWType`: W-type instance theory (tree-shaped data).
///
/// Sorts: `Node`, `Arc`, `Value`.
/// Ops: `anchor`, `arc_src`, `arc_tgt`, `arc_edge`, `node_value`.
///
/// Note: `anchor : Node → Vertex` and `arc_edge : Arc → Edge` reference
/// schema sorts. These are identified via colimit when the instance
/// theory is composed with the schema theory.
#[must_use]
pub fn th_wtype() -> Theory {
    Theory::new(
        "ThWType",
        vec![
            Sort::simple("Node"),
            Sort::simple("Arc"),
            Sort::simple("Value"),
        ],
        vec![
            Operation::unary("anchor", "n", "Node", "Vertex"),
            Operation::unary("arc_src", "a", "Arc", "Node"),
            Operation::unary("arc_tgt", "a", "Arc", "Node"),
            Operation::unary("arc_edge", "a", "Arc", "Edge"),
            Operation::unary("node_value", "n", "Node", "Value"),
        ],
        vec![],
    )
}

/// `ThMeta`: metadata extension for W-type instances.
///
/// Sorts: `Node`, `Discriminator`, `ExtraField`.
/// Ops: `discriminator`, `extra_field`.
///
/// Note: `extra_field` outputs `Value`, which is also declared in `ThWType`.
/// These sorts are identified when `ThMeta` is composed with `ThWType` via
/// colimit over shared `{Node, Value}`.
#[must_use]
pub fn th_meta() -> Theory {
    Theory::new(
        "ThMeta",
        vec![
            Sort::simple("Node"),
            Sort::simple("Discriminator"),

th_graph is the most fundamental: a directed graph with Vertex and Edge sorts and src/tgt operations. Nearly every protocol uses this as its base. Vertices represent types or fields; edges represent relationships.

th_simple_graph adds an equation enforcing edge uniqueness by endpoints. Use this for protocols like Protobuf where each field name is unique within a message.

th_hypergraph introduces multi-arity connections via HyperEdge and Label sorts. A hyper-edge connects multiple vertices through labeled positions, modeling SQL table definitions (where a table relates multiple columns via labeled positions) and multi-argument function signatures.

12.2.2 Constraint theory

The constraint theory demonstrates the dependent sort pattern:

            Sort::simple("Value"),
        ],
        vec![
            Operation::unary("discriminator", "n", "Node", "Discriminator"),
            Operation::new(
                "extra_field",
                vec![
                    ("n".into(), "Node".into()),
                    ("key".into(), "ExtraField".into()),
                ],
                "Value",
            ),
        ],
        vec![],
    )
}

The Constraint sort is parameterized by a vertex via SortParam::new("v", "Vertex"). This GAT-level encoding means constraints are attached to specific schema elements. A maxLength constraint on one field is a different sort instance than a maxLength on another field.

This dependent sort has practical consequences:

The migration engine’s existence checker (Chapter 9) scopes constraint compatibility checks to individual vertices.
The breaking change detector (Chapter 11) attributes constraint tightening to the specific vertex affected.
The lens system (Chapter 10) tracks per-vertex constraint changes in the complement.

Note

The dependent sort Constraint(v: Vertex) ensures constraint-related operations are automatically scoped to the correct vertex, without runtime dispatch or downcasting.¹

Why Dependent Sorts Over Flat Constraints?

If constraints were a flat sort (not parameterized by vertex), the existence checker couldn’t distinguish “constraint C tightened on vertex A” from “constraint C tightened on vertex B.” The dependent sort makes scoping automatic.

12.2.3 Instance theories

Beyond schema theories, panproto defines instance theories:

th_wtype: W-type trees with Node, Arc, and Value sorts. Operations include anchor (nodes to schema vertices), arc_src/arc_tgt (tree edges), arc_edge (arc labels), and node_value (payload). Used by JSON-like, XML-like, and document protocols.
th_functor: set-valued functors with Table, Row, and ForeignKey sorts. Operations include table_vertex (tables to schema vertices) and fk_src/fk_tgt (foreign key endpoints). Used by relational and columnar protocols.
th_flat: flat key-value structures with Node, Field, and Value sorts. Operations include field_node (owning node), field_value (payload), and node_anchor (schema vertex). Used by IDL protocols like Protobuf where data has flat field structure without deep nesting.
th_meta: metadata extension adding Discriminator and ExtraField sorts. Composes with th_wtype via structured merge over the shared Node sort. Used by document protocols with mixed content (XML, HTML) where elements have both structured children and unstructured attributes.

12.3 The structured merge pattern

Each protocol is defined by composing building-block theories via structured merge, gluing theories together along shared sorts.² Shared sorts act as the meeting point where two theories agree on concepts.

For example, ATProto’s schema theory merges ThGraph, ThConstraint, and ThMulti, all sharing the Vertex sort:

graph TD
    ThGraph["ThGraph<br/>(Vertex, Edge, src, tgt)"]
    ThConstraint["ThConstraint<br/>(Vertex, Constraint)"]
    ThMulti["ThMulti<br/>(Vertex, Edge, EdgeLabel)"]
    SharedV["Shared: Vertex"]
    SharedVE["Shared: Vertex, Edge"]

    ThGraph -->|merge over| SharedV
    ThConstraint -->|merge over| SharedV
    SharedV --> GC["ThGraph + ThConstraint"]
    GC -->|merge over| SharedVE
    ThMulti -->|merge over| SharedVE
    SharedVE --> Schema["ATProto Schema Theory"]

    ThWType["ThWType<br/>(Node, Arc, Value)"] --> Instance["ATProto Instance Theory"]

The merge ensures that the Vertex sort in ThGraph and the Vertex sort in ThConstraint are identified: they refer to the same concept. The resulting schema theory has all sorts and operations from both components, with shared sorts unified.

12.3.1 Why this construction?

The structured merge precisely controls which sorts are identified and which remain distinct. Alternatives either identify too much or too little.³

In practice:

ThGraph and ThConstraint share Vertex but not Edge and Constraint. Edges and constraints are distinct concepts attached to the same vertex sort.
ThGraph and ThMulti share both Vertex and Edge. Multi-edge labels extend the same edge concept.
ThWType and ThMeta share Node. Metadata extends the node concept.

12.4 Theory groups

Protocols cluster into six theory groups based on their schema/instance theory pair. Each group has a registration helper:

Group	Schema Theory	Instance Theory	Helper
A	ThGraph + ThConstraint + ThMulti	ThWType	`register_constrained_multigraph_wtype`
B	ThHypergraph + ThConstraint	ThFunctor	`register_hypergraph_functor`
C	ThSimpleGraph + ThConstraint	ThFlat	`register_simple_graph_flat`
D	ThGraph + ThConstraint + ThMulti + ThInterface	ThWType	`register_typed_graph_wtype`
E	ThGraph + ThConstraint + ThMulti	ThWType + ThMeta	`register_multigraph_wtype_meta`

12.4.1 Group a: constrained multigraph + w-type

The largest group: protocols with tree-shaped data and potentially parallel edges. ATProto, JSON Schema, OpenAPI, MongoDB, YAML Schema, and many more belong here. The schema theory combines directed graphs (for type relationships), constraints (for validation rules), and multi-edge support (for fields sharing the same type). Instances are W-type trees.

12.4.2 Group b: hypergraph + Functor

Relational and columnar protocols: SQL, Cassandra, DynamoDB, Parquet, Arrow, DataFrame, CSV. The schema theory uses hypergraphs because tables are multi-arity relations (a table “connects” its columns). Instances are set-valued mappings.⁴

12.4.3 Group c: simple graph + flat

IDL and serialization protocols: Protobuf, Avro, Thrift, Cap’n Proto, FlatBuffers, ASN.1, Bond, HCL, Redis. These have flat field structures (no deep nesting within a single message/record) and enforce field uniqueness. Instances are flat key-value structures.

12.4.4 Group d: typed graph + w-type with interfaces

Type systems and API schemas with inheritance: GraphQL, TypeScript, Python, Rust, Java, Go, Swift, Kotlin, C#. Adds ThInterface to Group A, modeling inheritance constructs like GraphQL interfaces and Java abstract classes.

12.4.5 Group e: multigraph + w-type + metadata

Document formats with mixed content: XML/XSD, HTML, CSS, DOCX, ODF, Neo4j. Combines Group A’s schema theory with instance-level metadata support (discriminators, extra fields) for elements with both structured children and unstructured attributes.

Tip

When adding a new protocol, first determine which theory group it belongs to. If it fits an existing group, reuse the corresponding helper and only define edge rules, vertex kinds, and parser/emitter.

12.5 The protocol interface

Each protocol module exports a protocol() function returning a Protocol struct:

name: the protocol identifier (e.g., "atproto", "json_schema", "sql").
schema_theory and instance_theory: names referencing entries in the theory registry.
edge_rules: which edge kinds are structurally significant and what source/target vertex kinds they connect. Used by panproto-check to classify breaking changes.
obj_kinds: vertex kinds representing composite objects (as opposed to scalars like "string" or "integer"). Used during parsing and by the lens system.
constraint_sorts: which constraint sorts this protocol recognizes (e.g., "maxLength", "pattern", "maxGraphemes"). Used by the migration engine’s existence checker and the breaking change detector.

The theory names in the Protocol struct reference entries in the theory registry, populated by the helper functions. This indirection keeps protocol definitions declarative: a protocol names theories rather than constructing them.

12.6 The two-level presentation architecture

panproto’s two-parameter architecture (schema theory + instance theory) is mirrored by two crates that provide the presentation layer:⁵

12.6.1 Schema-level presentations (`panproto-protocols`)

Every protocol implements bidirectional schema-level conversion:

Parser: parse_*(input: &str) -> Result<Schema, ProtocolError> converts a native format specification (.proto file, JSON Schema, SQL DDL, GraphQL SDL) into panproto’s Schema representation.
Emitter: emit_*(schema: &Schema) -> Result<String, ProtocolError> converts a Schema back to native format.

Examples: parse_json_schema(json_str) parses a JSON Schema; emit_protobuf(schema) emits a .proto file; parse_sql(ddl_str) parses SQL DDL.

12.6.2 Instance-level presentations (`panproto-io`)

The panproto-io crate provides complementary instance-level conversions for all 76 protocols:

Parser: parse_wtype(protocol, schema, &bytes) converts raw data (HTML document, CoNLL-U file, Protobuf wire message) into a WInstance or FInstance.
Emitter: emit_wtype(protocol, schema, &instance) converts an instance back to raw format bytes.

The choice of WInstance vs FInstance is determined by the protocol’s instance_theory field, not preference.

12.6.3 The complete pipeline

Together, these levels complete the data migration pipeline:

raw spec ──protocols parse──→ Schema v1 ──mig compile──→ CompiledMigration
raw data ──io parse──→ Instance v1 ──inst restrict──→ Instance v2 ──io emit──→ raw data

The commutativity guarantee from Spivak 2012 ensures that parse, restrict, and emit compose correctly across all 76 protocols. Both levels support round-trip testing: parse(emit(parse(input))) should produce structurally equivalent results.

What Does “Commutativity” Buy Us?

It means we don’t need separate integration tests for every (source protocol, target protocol) pair. Correctness of the individual presentations plus the algebraic guarantee gives us correctness of the full pipeline.

12.6.4 Parsing strategy

Parsers typically follow a two-phase approach:

Syntactic parsing: use a format-specific parser (e.g., serde_json for JSON Schema, tree-sitter for Protobuf) to produce an AST.
Semantic translation: walk the AST, creating vertices for types and fields, edges for relationships, constraints for validation rules, and hyper-edges for multi-arity relations.

12.6.5 Emitting strategy

Emitters reverse the process:

Topology traversal: walk the schema graph, identifying root vertices, following edges in topological order.
Format construction: generate native syntax for each vertex (type declaration, field definition, constraint annotation) and edge (relationship, containment).

12.7 Adding a new Protocol

To add a new protocol, follow five steps:

12.7.1 Step 1: identify component theories

Determine which building-block theories your protocol needs. Most fit one of the five existing groups. Ask:

Does the protocol have directed edges between fields/types? Use ThGraph.
Does it support parallel edges (multiple fields with the same type)? Add ThMulti.
Does it have constraints (min/max length, required fields, patterns)? Add ThConstraint.
Does it have multi-arity relations (SQL foreign keys, table definitions)? Use ThHypergraph instead of ThGraph.
Does it support interfaces or inheritance? Add ThInterface.
Does it have flat field structures with no deep nesting? Use ThFlat as the instance theory.
Does it have tree-structured data? Use ThWType.
Does it have mixed content (structured + unstructured)? Add ThMeta to ThWType.

12.7.2 Step 2: compose via structured merge

If your protocol fits an existing group, call the corresponding helper:

theories::register_constrained_multigraph_wtype(
    registry,
    "ThMyProtocolSchema",
    "ThMyProtocolInstance",
);

If you need a novel combination, compose manually using panproto_gat::colimit(), specifying shared sorts at each step.

12.7.3 Step 3: define edge rules and vertex kinds

Create a protocol() function:

pub fn protocol() -> Protocol {
    Protocol {
        name: "my_protocol".into(),
        schema_theory: "ThMyProtocolSchema".into(),
        instance_theory: "ThMyProtocolInstance".into(),
        edge_rules: vec![/* ... */],
        obj_kinds: vec!["object".into(), "message".into()],
        constraint_sorts: vec!["maxLength".into()],
    }
}

12.7.4 Step 4: implement schema presentation (`panproto-protocols`)

If your protocol has a schema definition language, write parse_my_protocol to convert native format specifications into a Schema, and emit_my_protocol to convert back.

Some protocols (like CoNLL-U) have no separate schema language; the schema is implicit. These only need Step 4b.

12.7.5 Step 4b: implement instance presentation (`panproto-io`)

Write an instance codec by implementing the InstanceParser and InstanceEmitter traits. For most protocols:

JSON data: use JsonCodec::new("my_protocol"), delegating to the SIMD simd-json pathway.
XML data: use XmlCodec::new("my_protocol"), delegating to quick-xml.
Tab/delimited data: use TabularCodec::tsv("my_protocol", "table_name"), delegating to the memchr SIMD pathway.
Custom format: implement both traits directly using a dedicated parsing library.

Register the codec in the appropriate category module’s register_all function and add it to the default_registry test.

12.7.6 Step 5: add integration tests

Add tests at both levels:

Schema round-trip (panproto-protocols): parse a native specification, emit it back, verify it matches.
Instance round-trip (panproto-io): parse native data, emit it back, re-parse, verify structural equality. Use real-world fixture data.
Migration integration: verify that parse → restrict → emit produces valid output.
Breaking-change tests: verify classification logic for your protocol’s edge rules.

Warning

When composing theories via structured merge, the shared theory must contain exactly the sorts that should be identified. Sharing too few creates duplicate sorts; sharing too many over-constrains the composition. Review existing helper functions for correct sharing patterns.

12.7.7 Worked example: taskml Protocol

Suppose we’re defining a protocol for a task management markup language:

Tasks have fields (name, description, priority).
Tasks can have subtasks (nesting).
Fields have validation constraints (maxLength on name, enum for priority).
Field names are unique within a task.

Step 1: This maps to Group C (simple graph + flat). But subtasks imply nesting, requiring W-type trees. So it’s actually Group A (constrained multigraph + W-type), without multi-edge support since field names are unique. We use Group A as-is (multi-edge support is unused but harmless).

Step 2: Using Group A:

theories::register_constrained_multigraph_wtype(
    registry,
    "ThTaskMLSchema",
    "ThTaskMLInstance",
);

Step 3: Define the protocol:

pub fn protocol() -> Protocol {
    Protocol {
        name: "taskml".into(),
        schema_theory: "ThTaskMLSchema".into(),
        instance_theory: "ThTaskMLInstance".into(),
        edge_rules: vec![EdgeRule {
            edge_kind: "field".into(),
            src_kinds: vec!["task".into()],
            tgt_kinds: vec!["string".into(), "enum".into(), "integer".into()],
        }],
        obj_kinds: vec!["task".into()],
        constraint_sorts: vec!["maxLength".into(), "enumValues".into()],
    }
}

Steps 4 and 5: Implement parse_taskml and emit_taskml, then write round-trip and migration tests.

ThHypergraph vs ThGraph?

Use ThGraph for binary relations (one source, one target). Use ThHypergraph for $n$-ary relations (SQL foreign keys connecting multiple columns). The edge structure of your format’s native syntax usually makes this obvious.

12.8 Theoretical foundations

The structured merge pattern is backed by a formal guarantee: the merge preserves the equations and well-formedness conditions of both component theories. If ThGraph guarantees every edge has source and target vertices, and ThConstraint guarantees every constraint is attached to a vertex, then the merged theory inherits both guarantees.⁶

The panproto-protocols and panproto-io crates together implement two levels of the data pipeline: schema presentations map format specifications to abstract schema models, and instance presentations map format data to abstract instance models. This two-level factorization mirrors panproto’s two-parameter architecture and ensures correctness by construction.

For a hands-on guide to building your own protocol, see the tutorial.

Cartmell, John. 1986. “Generalised Algebraic Theories and Contextual Categories.” Annals of Pure and Applied Logic 32: 209–43. https://doi.org/10.1016/0168-0072(86)90053-9.

Lynch, Owen, Kris Brown, James Fairbanks, and Evan Patterson. 2024. “GATlab: Modeling and Programming with Generalized Algebraic Theories.” arXiv Preprint arXiv:2404.04837, ahead of print. https://doi.org/10.48550/arXiv.2404.04837.

In categorical literature, this is a fibered sort (Cartmell 1986)—a sort whose elements depend on another sort’s elements.↩︎
The merge glues theories along shared sorts, producing a combined theory where shared elements are unified. This is a colimit in the category of GATs.↩︎
The merge precisely controls which concepts are shared. In category theory, this is a colimit (specifically a pushout), the universal way to combine theories.↩︎
Each table corresponds to a schema vertex; each row is an element of that table. These table-shaped mappings are set-valued functors from the schema category to Set.↩︎
Each presentation layer converts between a native format and panproto’s internal representation. This conversion is a functor mapping syntactic objects to their algebraic models.↩︎
The structured merge identifies shared sorts along a shared sub-theory. This is a colimit (pushout) in the category of GATs (Cartmell 1986). The architecture follows Lynch et al. (2024), where protocols are presentations of algebraic theories and interoperability is mediated by structure-preserving maps (theory morphisms).↩︎

# panproto-protocols: Protocol Definitions {#sec-protocols} JSON Schema, Protobuf, SQL DDL, GraphQL SDL—these look nothing alike on the surface. But each describes a shape (a schema) and a way to fill that shape with data (an instance), and each can be modeled as a pair of GATs. The `panproto-protocols` crate is where all of those pairs live. It defines 60+ protocols across 9 categories, each assembled from the same small set of reusable building-block theories via colimit (the "structured merge" from @sec-gat-engine). This chapter covers the protocol catalog, the building blocks, and how to wire up a new protocol from scratch. ## Protocol categories | Category | Examples | Count | |---|---|---| | Serialization | Protobuf, Avro, Thrift, Cap'n Proto, FlatBuffers, ASN.1, Bond, MsgPack | 8 | | Data Schema | JSON Schema, XML/XSD, CSV/Table Schema, YAML, TOML, CDDL, INI, BSON | 8 | | API | GraphQL, OpenAPI, AsyncAPI, RAML, JSON:API | 5 | | Database | SQL, MongoDB, Cassandra, DynamoDB, Neo4j, Redis | 6 | | Type System | TypeScript, Python, Rust, Java, Go, Swift, Kotlin, C# | 8 | | Web/Document | ATProto, HTML, CSS, DOCX, ODF, Markdown, JSX, Vue, Svelte | 9 | | Data Science | Parquet, Arrow, DataFrame | 3 | | Domain | GeoJSON, FHIR, RSS/Atom, vCard/iCal, EDI X12, SWIFT MT | 6 | | Config | HCL, K8s CRD, Docker Compose, CloudFormation, Ansible | 5 | Despite this diversity, every protocol follows the same pattern: compose building-block theories via structured merge, define edge rules and vertex kinds, and implement a parser/emitter pair. This factorization avoids combinatorial explosion. ## Building-block theories The `theories` module defines reusable GATs that protocols compose. Each function returns a standalone `Theory` value that can be registered in a theory registry and referenced by name. ### Graph theories The three graph theories form the foundation for schema representation: ```{.rust include="../../crates/panproto-protocols/src/theories.rs" start-line=55 end-line=127} ``` **`th_graph`** is the most fundamental: a directed graph with `Vertex` and `Edge` sorts and `src`/`tgt` operations. Nearly every protocol uses this as its base. Vertices represent types or fields; edges represent relationships. **`th_simple_graph`** adds an equation enforcing edge uniqueness by endpoints. Use this for protocols like Protobuf where each field name is unique within a message. **`th_hypergraph`** introduces multi-arity connections via `HyperEdge` and `Label` sorts. A hyper-edge connects multiple vertices through labeled positions, modeling SQL table definitions (where a table relates multiple columns via labeled positions) and multi-argument function signatures. ### Constraint theory The constraint theory demonstrates the dependent sort pattern: ```{.rust include="../../crates/panproto-protocols/src/theories.rs" start-line=129 end-line=144} ``` The `Constraint` sort is parameterized by a vertex via `SortParam::new("v", "Vertex")`. This GAT-level encoding means constraints are attached to specific schema elements. A `maxLength` constraint on one field is a different sort instance than a `maxLength` on another field. This dependent sort has practical consequences: - The migration engine's existence checker (Chapter 9) scopes constraint compatibility checks to individual vertices. - The breaking change detector (Chapter 11) attributes constraint tightening to the specific vertex affected. - The lens system (Chapter 10) tracks per-vertex constraint changes in the complement. ::: {.callout-note} The dependent sort `Constraint(v: Vertex)` ensures constraint-related operations are automatically scoped to the correct vertex, without runtime dispatch or downcasting.[^fibered] [^fibered]: In categorical literature, this is a *fibered sort* [@cartmell1986]—a sort whose elements depend on another sort's elements. ::: ::: {.callout-tip} ## Why Dependent Sorts Over Flat Constraints? If constraints were a flat sort (not parameterized by vertex), the existence checker couldn't distinguish "constraint C tightened on vertex A" from "constraint C tightened on vertex B." The dependent sort makes scoping automatic. ::: ### Instance theories Beyond schema theories, panproto defines instance theories: - **`th_wtype`**: W-type trees with `Node`, `Arc`, and `Value` sorts. Operations include `anchor` (nodes to schema vertices), `arc_src`/`arc_tgt` (tree edges), `arc_edge` (arc labels), and `node_value` (payload). Used by JSON-like, XML-like, and document protocols. - **`th_functor`**: set-valued functors with `Table`, `Row`, and `ForeignKey` sorts. Operations include `table_vertex` (tables to schema vertices) and `fk_src`/`fk_tgt` (foreign key endpoints). Used by relational and columnar protocols. - **`th_flat`**: flat key-value structures with `Node`, `Field`, and `Value` sorts. Operations include `field_node` (owning node), `field_value` (payload), and `node_anchor` (schema vertex). Used by IDL protocols like Protobuf where data has flat field structure without deep nesting. - **`th_meta`**: metadata extension adding `Discriminator` and `ExtraField` sorts. Composes with `th_wtype` via structured merge over the shared `Node` sort. Used by document protocols with mixed content (XML, HTML) where elements have both structured children and unstructured attributes. ## The structured merge pattern Each protocol is defined by composing building-block theories via structured merge, gluing theories together along shared sorts.[^colimit-pattern] Shared sorts act as the meeting point where two theories agree on concepts. [^colimit-pattern]: The merge glues theories along shared sorts, producing a combined theory where shared elements are unified. This is a *colimit* in the category of GATs. For example, ATProto's schema theory merges `ThGraph`, `ThConstraint`, and `ThMulti`, all sharing the `Vertex` sort: ```{mermaid} graph TD ThGraph["ThGraph<br/>(Vertex, Edge, src, tgt)"] ThConstraint["ThConstraint<br/>(Vertex, Constraint)"] ThMulti["ThMulti<br/>(Vertex, Edge, EdgeLabel)"] SharedV["Shared: Vertex"] SharedVE["Shared: Vertex, Edge"] ThGraph -->|merge over| SharedV ThConstraint -->|merge over| SharedV SharedV --> GC["ThGraph + ThConstraint"] GC -->|merge over| SharedVE ThMulti -->|merge over| SharedVE SharedVE --> Schema["ATProto Schema Theory"] ThWType["ThWType<br/>(Node, Arc, Value)"] --> Instance["ATProto Instance Theory"] ``` The merge ensures that the `Vertex` sort in `ThGraph` and the `Vertex` sort in `ThConstraint` are identified: they refer to the same concept. The resulting schema theory has all sorts and operations from both components, with shared sorts unified. ### Why this construction? The structured merge precisely controls which sorts are identified and which remain distinct. Alternatives either identify too much or too little.[^colimit-why] [^colimit-why]: The merge precisely controls which concepts are shared. In category theory, this is a *colimit* (specifically a *pushout*), the universal way to combine theories. In practice: - `ThGraph` and `ThConstraint` share `Vertex` but not `Edge` and `Constraint`. Edges and constraints are distinct concepts attached to the same vertex sort. - `ThGraph` and `ThMulti` share both `Vertex` and `Edge`. Multi-edge labels extend the same edge concept. - `ThWType` and `ThMeta` share `Node`. Metadata extends the node concept. ## Theory groups Protocols cluster into six theory groups based on their schema/instance theory pair. Each group has a registration helper: | Group | Schema Theory | Instance Theory | Helper | |---|---|---|---| | A | ThGraph + ThConstraint + ThMulti | ThWType | `register_constrained_multigraph_wtype` | | B | ThHypergraph + ThConstraint | ThFunctor | `register_hypergraph_functor` | | C | ThSimpleGraph + ThConstraint | ThFlat | `register_simple_graph_flat` | | D | ThGraph + ThConstraint + ThMulti + ThInterface | ThWType | `register_typed_graph_wtype` | | E | ThGraph + ThConstraint + ThMulti | ThWType + ThMeta | `register_multigraph_wtype_meta` | ### Group a: constrained multigraph + w-type The largest group: protocols with tree-shaped data and potentially parallel edges. ATProto, JSON Schema, OpenAPI, MongoDB, YAML Schema, and many more belong here. The schema theory combines directed graphs (for type relationships), constraints (for validation rules), and multi-edge support (for fields sharing the same type). Instances are W-type trees. ### Group b: hypergraph + Functor Relational and columnar protocols: SQL, Cassandra, DynamoDB, Parquet, Arrow, DataFrame, CSV. The schema theory uses hypergraphs because tables are multi-arity relations (a table "connects" its columns). Instances are set-valued mappings.[^functor-group-b] [^functor-group-b]: Each table corresponds to a schema vertex; each row is an element of that table. These table-shaped mappings are *set-valued functors* from the schema category to **Set**. ### Group c: simple graph + flat IDL and serialization protocols: Protobuf, Avro, Thrift, Cap'n Proto, FlatBuffers, ASN.1, Bond, HCL, Redis. These have flat field structures (no deep nesting within a single message/record) and enforce field uniqueness. Instances are flat key-value structures. ### Group d: typed graph + w-type with interfaces Type systems and API schemas with inheritance: GraphQL, TypeScript, Python, Rust, Java, Go, Swift, Kotlin, C#. Adds `ThInterface` to Group A, modeling inheritance constructs like GraphQL interfaces and Java abstract classes. ### Group e: multigraph + w-type + metadata Document formats with mixed content: XML/XSD, HTML, CSS, DOCX, ODF, Neo4j. Combines Group A's schema theory with instance-level metadata support (discriminators, extra fields) for elements with both structured children and unstructured attributes. ::: {.callout-tip} When adding a new protocol, first determine which theory group it belongs to. If it fits an existing group, reuse the corresponding helper and only define edge rules, vertex kinds, and parser/emitter. ::: ## The protocol interface Each protocol module exports a `protocol()` function returning a `Protocol` struct: - **`name`**: the protocol identifier (e.g., `"atproto"`, `"json_schema"`, `"sql"`). - **`schema_theory`** and **`instance_theory`**: names referencing entries in the theory registry. - **`edge_rules`**: which edge kinds are structurally significant and what source/target vertex kinds they connect. Used by `panproto-check` to classify breaking changes. - **`obj_kinds`**: vertex kinds representing composite objects (as opposed to scalars like `"string"` or `"integer"`). Used during parsing and by the lens system. - **`constraint_sorts`**: which constraint sorts this protocol recognizes (e.g., `"maxLength"`, `"pattern"`, `"maxGraphemes"`). Used by the migration engine's existence checker and the breaking change detector. The theory names in the `Protocol` struct reference entries in the theory registry, populated by the helper functions. This indirection keeps protocol definitions declarative: a protocol names theories rather than constructing them. ## The two-level presentation architecture panproto's two-parameter architecture (schema theory + instance theory) is mirrored by two crates that provide the presentation layer:[^presentation-functor] [^presentation-functor]: Each presentation layer converts between a native format and panproto's internal representation. This conversion is a *functor* mapping syntactic objects to their algebraic models. ### Schema-level presentations (`panproto-protocols`) Every protocol implements bidirectional schema-level conversion: - **Parser**: `parse_*(input: &str) -> Result<Schema, ProtocolError>` converts a native format specification (`.proto` file, JSON Schema, SQL DDL, GraphQL SDL) into panproto's `Schema` representation. - **Emitter**: `emit_*(schema: &Schema) -> Result<String, ProtocolError>` converts a `Schema` back to native format. Examples: `parse_json_schema(json_str)` parses a JSON Schema; `emit_protobuf(schema)` emits a `.proto` file; `parse_sql(ddl_str)` parses SQL DDL. ### Instance-level presentations (`panproto-io`) The `panproto-io` crate provides complementary instance-level conversions for all 76 protocols: - **Parser**: `parse_wtype(protocol, schema, &bytes)` converts raw data (HTML document, CoNLL-U file, Protobuf wire message) into a `WInstance` or `FInstance`. - **Emitter**: `emit_wtype(protocol, schema, &instance)` converts an instance back to raw format bytes. The choice of `WInstance` vs `FInstance` is determined by the protocol's `instance_theory` field, not preference. ### The complete pipeline Together, these levels complete the data migration pipeline: ``` raw spec ──protocols parse──→ Schema v1 ──mig compile──→ CompiledMigration raw data ──io parse──→ Instance v1 ──inst restrict──→ Instance v2 ──io emit──→ raw data ``` The commutativity guarantee from Spivak 2012 ensures that parse, restrict, and emit compose correctly across all 76 protocols. Both levels support round-trip testing: `parse(emit(parse(input)))` should produce structurally equivalent results. ::: {.callout-note} ## What Does "Commutativity" Buy Us? It means we don't need separate integration tests for every (source protocol, target protocol) pair. Correctness of the individual presentations plus the algebraic guarantee gives us correctness of the full pipeline. ::: ### Parsing strategy Parsers typically follow a two-phase approach: 1. **Syntactic parsing**: use a format-specific parser (e.g., `serde_json` for JSON Schema, `tree-sitter` for Protobuf) to produce an AST. 2. **Semantic translation**: walk the AST, creating vertices for types and fields, edges for relationships, constraints for validation rules, and hyper-edges for multi-arity relations. ### Emitting strategy Emitters reverse the process: 1. **Topology traversal**: walk the schema graph, identifying root vertices, following edges in topological order. 2. **Format construction**: generate native syntax for each vertex (type declaration, field definition, constraint annotation) and edge (relationship, containment). ## Adding a new Protocol To add a new protocol, follow five steps: ### Step 1: identify component theories Determine which building-block theories your protocol needs. Most fit one of the five existing groups. Ask: - Does the protocol have directed edges between fields/types? Use `ThGraph`. - Does it support parallel edges (multiple fields with the same type)? Add `ThMulti`. - Does it have constraints (min/max length, required fields, patterns)? Add `ThConstraint`. - Does it have multi-arity relations (SQL foreign keys, table definitions)? Use `ThHypergraph` instead of `ThGraph`. - Does it support interfaces or inheritance? Add `ThInterface`. - Does it have flat field structures with no deep nesting? Use `ThFlat` as the instance theory. - Does it have tree-structured data? Use `ThWType`. - Does it have mixed content (structured + unstructured)? Add `ThMeta` to `ThWType`. ### Step 2: compose via structured merge If your protocol fits an existing group, call the corresponding helper: ```rust theories::register_constrained_multigraph_wtype( registry, "ThMyProtocolSchema", "ThMyProtocolInstance", ); ``` If you need a novel combination, compose manually using `panproto_gat::colimit()`, specifying shared sorts at each step. ### Step 3: define edge rules and vertex kinds Create a `protocol()` function: ```rust pub fn protocol() -> Protocol { Protocol { name: "my_protocol".into(), schema_theory: "ThMyProtocolSchema".into(), instance_theory: "ThMyProtocolInstance".into(), edge_rules: vec![/* ... */], obj_kinds: vec!["object".into(), "message".into()], constraint_sorts: vec!["maxLength".into()], } } ``` ### Step 4: implement schema presentation (`panproto-protocols`) If your protocol has a schema definition language, write `parse_my_protocol` to convert native format specifications into a `Schema`, and `emit_my_protocol` to convert back. Some protocols (like CoNLL-U) have no separate schema language; the schema is implicit. These only need Step 4b. ### Step 4b: implement instance presentation (`panproto-io`) Write an instance codec by implementing the `InstanceParser` and `InstanceEmitter` traits. For most protocols: - **JSON data**: use `JsonCodec::new("my_protocol")`, delegating to the SIMD `simd-json` pathway. - **XML data**: use `XmlCodec::new("my_protocol")`, delegating to `quick-xml`. - **Tab/delimited data**: use `TabularCodec::tsv("my_protocol", "table_name")`, delegating to the `memchr` SIMD pathway. - **Custom format**: implement both traits directly using a dedicated parsing library. Register the codec in the appropriate category module's `register_all` function and add it to the `default_registry` test. ### Step 5: add integration tests Add tests at both levels: - **Schema round-trip** (`panproto-protocols`): parse a native specification, emit it back, verify it matches. - **Instance round-trip** (`panproto-io`): parse native data, emit it back, re-parse, verify structural equality. Use real-world fixture data. - **Migration integration**: verify that `parse → restrict → emit` produces valid output. - **Breaking-change tests**: verify classification logic for your protocol's edge rules. ::: {.callout-warning} When composing theories via structured merge, the shared theory must contain exactly the sorts that should be identified. Sharing too few creates duplicate sorts; sharing too many over-constrains the composition. Review existing helper functions for correct sharing patterns. ::: ### Worked example: taskml Protocol Suppose we're defining a protocol for a task management markup language: - Tasks have fields (name, description, priority). - Tasks can have subtasks (nesting). - Fields have validation constraints (maxLength on name, enum for priority). - Field names are unique within a task. **Step 1**: This maps to Group C (simple graph + flat). But subtasks imply nesting, requiring W-type trees. So it's actually Group A (constrained multigraph + W-type), without multi-edge support since field names are unique. We use Group A as-is (multi-edge support is unused but harmless). **Step 2**: Using Group A: ```rust theories::register_constrained_multigraph_wtype( registry, "ThTaskMLSchema", "ThTaskMLInstance", ); ``` **Step 3**: Define the protocol: ```rust pub fn protocol() -> Protocol { Protocol { name: "taskml".into(), schema_theory: "ThTaskMLSchema".into(), instance_theory: "ThTaskMLInstance".into(), edge_rules: vec![EdgeRule { edge_kind: "field".into(), src_kinds: vec!["task".into()], tgt_kinds: vec!["string".into(), "enum".into(), "integer".into()], }], obj_kinds: vec!["task".into()], constraint_sorts: vec!["maxLength".into(), "enumValues".into()], } } ``` **Steps 4 and 5**: Implement `parse_taskml` and `emit_taskml`, then write round-trip and migration tests. ::: {.callout-tip} ## ThHypergraph vs ThGraph? Use `ThGraph` for binary relations (one source, one target). Use `ThHypergraph` for $n$-ary relations (SQL foreign keys connecting multiple columns). The edge structure of your format's native syntax usually makes this obvious. ::: ## Theoretical foundations The structured merge pattern is backed by a formal guarantee: the merge preserves the equations and well-formedness conditions of both component theories. If `ThGraph` guarantees every edge has source and target vertices, and `ThConstraint` guarantees every constraint is attached to a vertex, then the merged theory inherits both guarantees.[^theory-morphism] [^theory-morphism]: The structured merge identifies shared sorts along a shared sub-theory. This is a colimit (pushout) in the category of GATs [@cartmell1986]. The architecture follows @lynch2024, where protocols are presentations of algebraic theories and interoperability is mediated by structure-preserving maps (theory morphisms). The `panproto-protocols` and `panproto-io` crates together implement two levels of the data pipeline: schema presentations map format specifications to abstract schema models, and instance presentations map format data to abstract instance models. This two-level factorization mirrors panproto's two-parameter architecture and ensures correctness by construction. For a hands-on guide to building your own protocol, see the [tutorial](https://panproto.dev/tutorial/chapters/10-building-your-own-protocol.html).

12.1 Protocol categories

12.2 Building-block theories

12.2.1 Graph theories

12.2.2 Constraint theory

12.2.3 Instance theories

12.3 The structured merge pattern

12.3.1 Why this construction?

12.4 Theory groups

12.4.1 Group a: constrained multigraph + w-type

12.4.2 Group b: hypergraph + Functor

12.4.3 Group c: simple graph + flat

12.4.4 Group d: typed graph + w-type with interfaces

12.4.5 Group e: multigraph + w-type + metadata

12.5 The protocol interface

12.6 The two-level presentation architecture

12.6.1 Schema-level presentations (panproto-protocols)

12.6.2 Instance-level presentations (panproto-io)

12.6.3 The complete pipeline

12.6.4 Parsing strategy

12.6.5 Emitting strategy

12.7 Adding a new Protocol

12.7.1 Step 1: identify component theories

12.7.2 Step 2: compose via structured merge

12.7.3 Step 3: define edge rules and vertex kinds

12.7.4 Step 4: implement schema presentation (panproto-protocols)

12.7.5 Step 4b: implement instance presentation (panproto-io)

12.7.6 Step 5: add integration tests

12.7.7 Worked example: taskml Protocol

12.8 Theoretical foundations

12.6.1 Schema-level presentations (`panproto-protocols`)

12.6.2 Instance-level presentations (`panproto-io`)

12.7.4 Step 4: implement schema presentation (`panproto-protocols`)

12.7.5 Step 4b: implement instance presentation (`panproto-io`)