12  panproto-protocols: Protocol Definitions

JSON Schema, Protobuf, SQL DDL, GraphQL SDL—these look nothing alike on the surface. But each describes a shape (a schema) and a way to fill that shape with data (an instance), and each can be modeled as a pair of GATs. The panproto-protocols crate is where all of those pairs live. It defines 60+ protocols across 9 categories, each assembled from the same small set of reusable building-block theories via colimit (the “structured merge” from Chapter 6).

This chapter covers the protocol catalog, the building blocks, and how to wire up a new protocol from scratch.

12.1 Protocol categories

Category Examples Count
Serialization Protobuf, Avro, Thrift, Cap’n Proto, FlatBuffers, ASN.1, Bond, MsgPack 8
Data Schema JSON Schema, XML/XSD, CSV/Table Schema, YAML, TOML, CDDL, INI, BSON 8
API GraphQL, OpenAPI, AsyncAPI, RAML, JSON:API 5
Database SQL, MongoDB, Cassandra, DynamoDB, Neo4j, Redis 6
Type System TypeScript, Python, Rust, Java, Go, Swift, Kotlin, C# 8
Web/Document ATProto, HTML, CSS, DOCX, ODF, Markdown, JSX, Vue, Svelte 9
Data Science Parquet, Arrow, DataFrame 3
Domain GeoJSON, FHIR, RSS/Atom, vCard/iCal, EDI X12, SWIFT MT 6
Config HCL, K8s CRD, Docker Compose, CloudFormation, Ansible 5

Despite this diversity, every protocol follows the same pattern: compose building-block theories via structured merge, define edge rules and vertex kinds, and implement a parser/emitter pair. This factorization avoids combinatorial explosion.

12.2 Building-block theories

The theories module defines reusable GATs that protocols compose. Each function returns a standalone Theory value that can be registered in a theory registry and referenced by name.

12.2.1 Graph theories

The three graph theories form the foundation for schema representation:

pub fn th_constraint() -> Theory {
    Theory::new(
        "ThConstraint",
        vec![
            Sort::simple("Vertex"),
            Sort::dependent("Constraint", vec![SortParam::new("v", "Vertex")]),
        ],
        vec![Operation::unary("target", "c", "Constraint", "Vertex")],
        vec![],
    )
}

/// `ThMulti`: multigraph (parallel edges).
///
/// Sorts: `Vertex`, `Edge`, `EdgeLabel`.
/// Ops: `edge_label : Edge → EdgeLabel`.
#[must_use]
pub fn th_multi() -> Theory {
    Theory::new(
        "ThMulti",
        vec![
            Sort::simple("Vertex"),
            Sort::simple("Edge"),
            Sort::simple("EdgeLabel"),
        ],
        vec![Operation::unary("edge_label", "e", "Edge", "EdgeLabel")],
        vec![],
    )
}

/// `ThWType`: W-type instance theory (tree-shaped data).
///
/// Sorts: `Node`, `Arc`, `Value`.
/// Ops: `anchor`, `arc_src`, `arc_tgt`, `arc_edge`, `node_value`.
///
/// Note: `anchor : Node → Vertex` and `arc_edge : Arc → Edge` reference
/// schema sorts. These are identified via colimit when the instance
/// theory is composed with the schema theory.
#[must_use]
pub fn th_wtype() -> Theory {
    Theory::new(
        "ThWType",
        vec![
            Sort::simple("Node"),
            Sort::simple("Arc"),
            Sort::simple("Value"),
        ],
        vec![
            Operation::unary("anchor", "n", "Node", "Vertex"),
            Operation::unary("arc_src", "a", "Arc", "Node"),
            Operation::unary("arc_tgt", "a", "Arc", "Node"),
            Operation::unary("arc_edge", "a", "Arc", "Edge"),
            Operation::unary("node_value", "n", "Node", "Value"),
        ],
        vec![],
    )
}

/// `ThMeta`: metadata extension for W-type instances.
///
/// Sorts: `Node`, `Discriminator`, `ExtraField`.
/// Ops: `discriminator`, `extra_field`.
///
/// Note: `extra_field` outputs `Value`, which is also declared in `ThWType`.
/// These sorts are identified when `ThMeta` is composed with `ThWType` via
/// colimit over shared `{Node, Value}`.
#[must_use]
pub fn th_meta() -> Theory {
    Theory::new(
        "ThMeta",
        vec![
            Sort::simple("Node"),
            Sort::simple("Discriminator"),

th_graph is the most fundamental: a directed graph with Vertex and Edge sorts and src/tgt operations. Nearly every protocol uses this as its base. Vertices represent types or fields; edges represent relationships.

th_simple_graph adds an equation enforcing edge uniqueness by endpoints. Use this for protocols like Protobuf where each field name is unique within a message.

th_hypergraph introduces multi-arity connections via HyperEdge and Label sorts. A hyper-edge connects multiple vertices through labeled positions, modeling SQL table definitions (where a table relates multiple columns via labeled positions) and multi-argument function signatures.

12.2.2 Constraint theory

The constraint theory demonstrates the dependent sort pattern:

            Sort::simple("Value"),
        ],
        vec![
            Operation::unary("discriminator", "n", "Node", "Discriminator"),
            Operation::new(
                "extra_field",
                vec![
                    ("n".into(), "Node".into()),
                    ("key".into(), "ExtraField".into()),
                ],
                "Value",
            ),
        ],
        vec![],
    )
}

The Constraint sort is parameterized by a vertex via SortParam::new("v", "Vertex"). This GAT-level encoding means constraints are attached to specific schema elements. A maxLength constraint on one field is a different sort instance than a maxLength on another field.

This dependent sort has practical consequences:

  • The migration engine’s existence checker (Chapter 9) scopes constraint compatibility checks to individual vertices.
  • The breaking change detector (Chapter 11) attributes constraint tightening to the specific vertex affected.
  • The lens system (Chapter 10) tracks per-vertex constraint changes in the complement.
Note

The dependent sort Constraint(v: Vertex) ensures constraint-related operations are automatically scoped to the correct vertex, without runtime dispatch or downcasting.1

TipWhy Dependent Sorts Over Flat Constraints?

If constraints were a flat sort (not parameterized by vertex), the existence checker couldn’t distinguish “constraint C tightened on vertex A” from “constraint C tightened on vertex B.” The dependent sort makes scoping automatic.

12.2.3 Instance theories

Beyond schema theories, panproto defines instance theories:

  • th_wtype: W-type trees with Node, Arc, and Value sorts. Operations include anchor (nodes to schema vertices), arc_src/arc_tgt (tree edges), arc_edge (arc labels), and node_value (payload). Used by JSON-like, XML-like, and document protocols.

  • th_functor: set-valued functors with Table, Row, and ForeignKey sorts. Operations include table_vertex (tables to schema vertices) and fk_src/fk_tgt (foreign key endpoints). Used by relational and columnar protocols.

  • th_flat: flat key-value structures with Node, Field, and Value sorts. Operations include field_node (owning node), field_value (payload), and node_anchor (schema vertex). Used by IDL protocols like Protobuf where data has flat field structure without deep nesting.

  • th_meta: metadata extension adding Discriminator and ExtraField sorts. Composes with th_wtype via structured merge over the shared Node sort. Used by document protocols with mixed content (XML, HTML) where elements have both structured children and unstructured attributes.

12.3 The structured merge pattern

Each protocol is defined by composing building-block theories via structured merge, gluing theories together along shared sorts.2 Shared sorts act as the meeting point where two theories agree on concepts.

For example, ATProto’s schema theory merges ThGraph, ThConstraint, and ThMulti, all sharing the Vertex sort:

graph TD
    ThGraph["ThGraph<br/>(Vertex, Edge, src, tgt)"]
    ThConstraint["ThConstraint<br/>(Vertex, Constraint)"]
    ThMulti["ThMulti<br/>(Vertex, Edge, EdgeLabel)"]
    SharedV["Shared: Vertex"]
    SharedVE["Shared: Vertex, Edge"]

    ThGraph -->|merge over| SharedV
    ThConstraint -->|merge over| SharedV
    SharedV --> GC["ThGraph + ThConstraint"]
    GC -->|merge over| SharedVE
    ThMulti -->|merge over| SharedVE
    SharedVE --> Schema["ATProto Schema Theory"]

    ThWType["ThWType<br/>(Node, Arc, Value)"] --> Instance["ATProto Instance Theory"]

The merge ensures that the Vertex sort in ThGraph and the Vertex sort in ThConstraint are identified: they refer to the same concept. The resulting schema theory has all sorts and operations from both components, with shared sorts unified.

12.3.1 Why this construction?

The structured merge precisely controls which sorts are identified and which remain distinct. Alternatives either identify too much or too little.3

In practice:

  • ThGraph and ThConstraint share Vertex but not Edge and Constraint. Edges and constraints are distinct concepts attached to the same vertex sort.
  • ThGraph and ThMulti share both Vertex and Edge. Multi-edge labels extend the same edge concept.
  • ThWType and ThMeta share Node. Metadata extends the node concept.

12.4 Theory groups

Protocols cluster into six theory groups based on their schema/instance theory pair. Each group has a registration helper:

Group Schema Theory Instance Theory Helper
A ThGraph + ThConstraint + ThMulti ThWType register_constrained_multigraph_wtype
B ThHypergraph + ThConstraint ThFunctor register_hypergraph_functor
C ThSimpleGraph + ThConstraint ThFlat register_simple_graph_flat
D ThGraph + ThConstraint + ThMulti + ThInterface ThWType register_typed_graph_wtype
E ThGraph + ThConstraint + ThMulti ThWType + ThMeta register_multigraph_wtype_meta

12.4.1 Group a: constrained multigraph + w-type

The largest group: protocols with tree-shaped data and potentially parallel edges. ATProto, JSON Schema, OpenAPI, MongoDB, YAML Schema, and many more belong here. The schema theory combines directed graphs (for type relationships), constraints (for validation rules), and multi-edge support (for fields sharing the same type). Instances are W-type trees.

12.4.2 Group b: hypergraph + Functor

Relational and columnar protocols: SQL, Cassandra, DynamoDB, Parquet, Arrow, DataFrame, CSV. The schema theory uses hypergraphs because tables are multi-arity relations (a table “connects” its columns). Instances are set-valued mappings.4

12.4.3 Group c: simple graph + flat

IDL and serialization protocols: Protobuf, Avro, Thrift, Cap’n Proto, FlatBuffers, ASN.1, Bond, HCL, Redis. These have flat field structures (no deep nesting within a single message/record) and enforce field uniqueness. Instances are flat key-value structures.

12.4.4 Group d: typed graph + w-type with interfaces

Type systems and API schemas with inheritance: GraphQL, TypeScript, Python, Rust, Java, Go, Swift, Kotlin, C#. Adds ThInterface to Group A, modeling inheritance constructs like GraphQL interfaces and Java abstract classes.

12.4.5 Group e: multigraph + w-type + metadata

Document formats with mixed content: XML/XSD, HTML, CSS, DOCX, ODF, Neo4j. Combines Group A’s schema theory with instance-level metadata support (discriminators, extra fields) for elements with both structured children and unstructured attributes.

Tip

When adding a new protocol, first determine which theory group it belongs to. If it fits an existing group, reuse the corresponding helper and only define edge rules, vertex kinds, and parser/emitter.

12.5 The protocol interface

Each protocol module exports a protocol() function returning a Protocol struct:

  • name: the protocol identifier (e.g., "atproto", "json_schema", "sql").
  • schema_theory and instance_theory: names referencing entries in the theory registry.
  • edge_rules: which edge kinds are structurally significant and what source/target vertex kinds they connect. Used by panproto-check to classify breaking changes.
  • obj_kinds: vertex kinds representing composite objects (as opposed to scalars like "string" or "integer"). Used during parsing and by the lens system.
  • constraint_sorts: which constraint sorts this protocol recognizes (e.g., "maxLength", "pattern", "maxGraphemes"). Used by the migration engine’s existence checker and the breaking change detector.

The theory names in the Protocol struct reference entries in the theory registry, populated by the helper functions. This indirection keeps protocol definitions declarative: a protocol names theories rather than constructing them.

12.6 The two-level presentation architecture

panproto’s two-parameter architecture (schema theory + instance theory) is mirrored by two crates that provide the presentation layer:5

12.6.1 Schema-level presentations (panproto-protocols)

Every protocol implements bidirectional schema-level conversion:

  • Parser: parse_*(input: &str) -> Result<Schema, ProtocolError> converts a native format specification (.proto file, JSON Schema, SQL DDL, GraphQL SDL) into panproto’s Schema representation.
  • Emitter: emit_*(schema: &Schema) -> Result<String, ProtocolError> converts a Schema back to native format.

Examples: parse_json_schema(json_str) parses a JSON Schema; emit_protobuf(schema) emits a .proto file; parse_sql(ddl_str) parses SQL DDL.

12.6.2 Instance-level presentations (panproto-io)

The panproto-io crate provides complementary instance-level conversions for all 76 protocols:

  • Parser: parse_wtype(protocol, schema, &bytes) converts raw data (HTML document, CoNLL-U file, Protobuf wire message) into a WInstance or FInstance.
  • Emitter: emit_wtype(protocol, schema, &instance) converts an instance back to raw format bytes.

The choice of WInstance vs FInstance is determined by the protocol’s instance_theory field, not preference.

12.6.3 The complete pipeline

Together, these levels complete the data migration pipeline:

raw spec ──protocols parse──→ Schema v1 ──mig compile──→ CompiledMigration
raw data ──io parse──→ Instance v1 ──inst restrict──→ Instance v2 ──io emit──→ raw data

The commutativity guarantee from Spivak 2012 ensures that parse, restrict, and emit compose correctly across all 76 protocols. Both levels support round-trip testing: parse(emit(parse(input))) should produce structurally equivalent results.

NoteWhat Does “Commutativity” Buy Us?

It means we don’t need separate integration tests for every (source protocol, target protocol) pair. Correctness of the individual presentations plus the algebraic guarantee gives us correctness of the full pipeline.

12.6.4 Parsing strategy

Parsers typically follow a two-phase approach:

  1. Syntactic parsing: use a format-specific parser (e.g., serde_json for JSON Schema, tree-sitter for Protobuf) to produce an AST.
  2. Semantic translation: walk the AST, creating vertices for types and fields, edges for relationships, constraints for validation rules, and hyper-edges for multi-arity relations.

12.6.5 Emitting strategy

Emitters reverse the process:

  1. Topology traversal: walk the schema graph, identifying root vertices, following edges in topological order.
  2. Format construction: generate native syntax for each vertex (type declaration, field definition, constraint annotation) and edge (relationship, containment).

12.7 Adding a new Protocol

To add a new protocol, follow five steps:

12.7.1 Step 1: identify component theories

Determine which building-block theories your protocol needs. Most fit one of the five existing groups. Ask:

  • Does the protocol have directed edges between fields/types? Use ThGraph.
  • Does it support parallel edges (multiple fields with the same type)? Add ThMulti.
  • Does it have constraints (min/max length, required fields, patterns)? Add ThConstraint.
  • Does it have multi-arity relations (SQL foreign keys, table definitions)? Use ThHypergraph instead of ThGraph.
  • Does it support interfaces or inheritance? Add ThInterface.
  • Does it have flat field structures with no deep nesting? Use ThFlat as the instance theory.
  • Does it have tree-structured data? Use ThWType.
  • Does it have mixed content (structured + unstructured)? Add ThMeta to ThWType.

12.7.2 Step 2: compose via structured merge

If your protocol fits an existing group, call the corresponding helper:

theories::register_constrained_multigraph_wtype(
    registry,
    "ThMyProtocolSchema",
    "ThMyProtocolInstance",
);

If you need a novel combination, compose manually using panproto_gat::colimit(), specifying shared sorts at each step.

12.7.3 Step 3: define edge rules and vertex kinds

Create a protocol() function:

pub fn protocol() -> Protocol {
    Protocol {
        name: "my_protocol".into(),
        schema_theory: "ThMyProtocolSchema".into(),
        instance_theory: "ThMyProtocolInstance".into(),
        edge_rules: vec![/* ... */],
        obj_kinds: vec!["object".into(), "message".into()],
        constraint_sorts: vec!["maxLength".into()],
    }
}

12.7.4 Step 4: implement schema presentation (panproto-protocols)

If your protocol has a schema definition language, write parse_my_protocol to convert native format specifications into a Schema, and emit_my_protocol to convert back.

Some protocols (like CoNLL-U) have no separate schema language; the schema is implicit. These only need Step 4b.

12.7.5 Step 4b: implement instance presentation (panproto-io)

Write an instance codec by implementing the InstanceParser and InstanceEmitter traits. For most protocols:

  • JSON data: use JsonCodec::new("my_protocol"), delegating to the SIMD simd-json pathway.
  • XML data: use XmlCodec::new("my_protocol"), delegating to quick-xml.
  • Tab/delimited data: use TabularCodec::tsv("my_protocol", "table_name"), delegating to the memchr SIMD pathway.
  • Custom format: implement both traits directly using a dedicated parsing library.

Register the codec in the appropriate category module’s register_all function and add it to the default_registry test.

12.7.6 Step 5: add integration tests

Add tests at both levels:

  • Schema round-trip (panproto-protocols): parse a native specification, emit it back, verify it matches.
  • Instance round-trip (panproto-io): parse native data, emit it back, re-parse, verify structural equality. Use real-world fixture data.
  • Migration integration: verify that parse → restrict → emit produces valid output.
  • Breaking-change tests: verify classification logic for your protocol’s edge rules.
Warning

When composing theories via structured merge, the shared theory must contain exactly the sorts that should be identified. Sharing too few creates duplicate sorts; sharing too many over-constrains the composition. Review existing helper functions for correct sharing patterns.

12.7.7 Worked example: taskml Protocol

Suppose we’re defining a protocol for a task management markup language:

  • Tasks have fields (name, description, priority).
  • Tasks can have subtasks (nesting).
  • Fields have validation constraints (maxLength on name, enum for priority).
  • Field names are unique within a task.

Step 1: This maps to Group C (simple graph + flat). But subtasks imply nesting, requiring W-type trees. So it’s actually Group A (constrained multigraph + W-type), without multi-edge support since field names are unique. We use Group A as-is (multi-edge support is unused but harmless).

Step 2: Using Group A:

theories::register_constrained_multigraph_wtype(
    registry,
    "ThTaskMLSchema",
    "ThTaskMLInstance",
);

Step 3: Define the protocol:

pub fn protocol() -> Protocol {
    Protocol {
        name: "taskml".into(),
        schema_theory: "ThTaskMLSchema".into(),
        instance_theory: "ThTaskMLInstance".into(),
        edge_rules: vec![EdgeRule {
            edge_kind: "field".into(),
            src_kinds: vec!["task".into()],
            tgt_kinds: vec!["string".into(), "enum".into(), "integer".into()],
        }],
        obj_kinds: vec!["task".into()],
        constraint_sorts: vec!["maxLength".into(), "enumValues".into()],
    }
}

Steps 4 and 5: Implement parse_taskml and emit_taskml, then write round-trip and migration tests.

TipThHypergraph vs ThGraph?

Use ThGraph for binary relations (one source, one target). Use ThHypergraph for \(n\)-ary relations (SQL foreign keys connecting multiple columns). The edge structure of your format’s native syntax usually makes this obvious.

12.8 Theoretical foundations

The structured merge pattern is backed by a formal guarantee: the merge preserves the equations and well-formedness conditions of both component theories. If ThGraph guarantees every edge has source and target vertices, and ThConstraint guarantees every constraint is attached to a vertex, then the merged theory inherits both guarantees.6

The panproto-protocols and panproto-io crates together implement two levels of the data pipeline: schema presentations map format specifications to abstract schema models, and instance presentations map format data to abstract instance models. This two-level factorization mirrors panproto’s two-parameter architecture and ensures correctness by construction.

For a hands-on guide to building your own protocol, see the tutorial.

Cartmell, John. 1986. “Generalised Algebraic Theories and Contextual Categories.” Annals of Pure and Applied Logic 32: 209–43. https://doi.org/10.1016/0168-0072(86)90053-9.
Lynch, Owen, Kris Brown, James Fairbanks, and Evan Patterson. 2024. GATlab: Modeling and Programming with Generalized Algebraic Theories.” arXiv Preprint arXiv:2404.04837, ahead of print. https://doi.org/10.48550/arXiv.2404.04837.

  1. In categorical literature, this is a fibered sort (Cartmell 1986)—a sort whose elements depend on another sort’s elements.↩︎

  2. The merge glues theories along shared sorts, producing a combined theory where shared elements are unified. This is a colimit in the category of GATs.↩︎

  3. The merge precisely controls which concepts are shared. In category theory, this is a colimit (specifically a pushout), the universal way to combine theories.↩︎

  4. Each table corresponds to a schema vertex; each row is an element of that table. These table-shaped mappings are set-valued functors from the schema category to Set.↩︎

  5. Each presentation layer converts between a native format and panproto’s internal representation. This conversion is a functor mapping syntactic objects to their algebraic models.↩︎

  6. The structured merge identifies shared sorts along a shared sub-theory. This is a colimit (pushout) in the category of GATs (Cartmell 1986). The architecture follows Lynch et al. (2024), where protocols are presentations of algebraic theories and interoperability is mediated by structure-preserving maps (theory morphisms).↩︎