graph TD
ThGraph["ThGraph<br/>(Vertex, Edge, src, tgt)"]
ThConstraint["ThConstraint<br/>(Vertex, Constraint)"]
ThMulti["ThMulti<br/>(Vertex, Edge, EdgeLabel)"]
SharedV["Shared: Vertex"]
SharedVE["Shared: Vertex, Edge"]
ThGraph -->|merge over| SharedV
ThConstraint -->|merge over| SharedV
SharedV --> GC["ThGraph + ThConstraint"]
GC -->|merge over| SharedVE
ThMulti -->|merge over| SharedVE
SharedVE --> Schema["ATProto Schema Theory"]
ThWType["ThWType<br/>(Node, Arc, Value)"] --> Instance["ATProto Instance Theory"]
12 panproto-protocols: Protocol Definitions
JSON Schema, Protobuf, SQL DDL, GraphQL SDL—these look nothing alike on the surface. But each describes a shape (a schema) and a way to fill that shape with data (an instance), and each can be modeled as a pair of GATs. The panproto-protocols crate is where all of those pairs live. It defines 60+ protocols across 9 categories, each assembled from the same small set of reusable building-block theories via colimit (the “structured merge” from Chapter 6).
This chapter covers the protocol catalog, the building blocks, and how to wire up a new protocol from scratch.
12.1 Protocol categories
| Category | Examples | Count |
|---|---|---|
| Serialization | Protobuf, Avro, Thrift, Cap’n Proto, FlatBuffers, ASN.1, Bond, MsgPack | 8 |
| Data Schema | JSON Schema, XML/XSD, CSV/Table Schema, YAML, TOML, CDDL, INI, BSON | 8 |
| API | GraphQL, OpenAPI, AsyncAPI, RAML, JSON:API | 5 |
| Database | SQL, MongoDB, Cassandra, DynamoDB, Neo4j, Redis | 6 |
| Type System | TypeScript, Python, Rust, Java, Go, Swift, Kotlin, C# | 8 |
| Web/Document | ATProto, HTML, CSS, DOCX, ODF, Markdown, JSX, Vue, Svelte | 9 |
| Data Science | Parquet, Arrow, DataFrame | 3 |
| Domain | GeoJSON, FHIR, RSS/Atom, vCard/iCal, EDI X12, SWIFT MT | 6 |
| Config | HCL, K8s CRD, Docker Compose, CloudFormation, Ansible | 5 |
Despite this diversity, every protocol follows the same pattern: compose building-block theories via structured merge, define edge rules and vertex kinds, and implement a parser/emitter pair. This factorization avoids combinatorial explosion.
12.2 Building-block theories
The theories module defines reusable GATs that protocols compose. Each function returns a standalone Theory value that can be registered in a theory registry and referenced by name.
12.2.1 Graph theories
The three graph theories form the foundation for schema representation:
pub fn th_constraint() -> Theory {
Theory::new(
"ThConstraint",
vec![
Sort::simple("Vertex"),
Sort::dependent("Constraint", vec![SortParam::new("v", "Vertex")]),
],
vec![Operation::unary("target", "c", "Constraint", "Vertex")],
vec![],
)
}
/// `ThMulti`: multigraph (parallel edges).
///
/// Sorts: `Vertex`, `Edge`, `EdgeLabel`.
/// Ops: `edge_label : Edge → EdgeLabel`.
#[must_use]
pub fn th_multi() -> Theory {
Theory::new(
"ThMulti",
vec![
Sort::simple("Vertex"),
Sort::simple("Edge"),
Sort::simple("EdgeLabel"),
],
vec![Operation::unary("edge_label", "e", "Edge", "EdgeLabel")],
vec![],
)
}
/// `ThWType`: W-type instance theory (tree-shaped data).
///
/// Sorts: `Node`, `Arc`, `Value`.
/// Ops: `anchor`, `arc_src`, `arc_tgt`, `arc_edge`, `node_value`.
///
/// Note: `anchor : Node → Vertex` and `arc_edge : Arc → Edge` reference
/// schema sorts. These are identified via colimit when the instance
/// theory is composed with the schema theory.
#[must_use]
pub fn th_wtype() -> Theory {
Theory::new(
"ThWType",
vec![
Sort::simple("Node"),
Sort::simple("Arc"),
Sort::simple("Value"),
],
vec![
Operation::unary("anchor", "n", "Node", "Vertex"),
Operation::unary("arc_src", "a", "Arc", "Node"),
Operation::unary("arc_tgt", "a", "Arc", "Node"),
Operation::unary("arc_edge", "a", "Arc", "Edge"),
Operation::unary("node_value", "n", "Node", "Value"),
],
vec![],
)
}
/// `ThMeta`: metadata extension for W-type instances.
///
/// Sorts: `Node`, `Discriminator`, `ExtraField`.
/// Ops: `discriminator`, `extra_field`.
///
/// Note: `extra_field` outputs `Value`, which is also declared in `ThWType`.
/// These sorts are identified when `ThMeta` is composed with `ThWType` via
/// colimit over shared `{Node, Value}`.
#[must_use]
pub fn th_meta() -> Theory {
Theory::new(
"ThMeta",
vec![
Sort::simple("Node"),
Sort::simple("Discriminator"),th_graph is the most fundamental: a directed graph with Vertex and Edge sorts and src/tgt operations. Nearly every protocol uses this as its base. Vertices represent types or fields; edges represent relationships.
th_simple_graph adds an equation enforcing edge uniqueness by endpoints. Use this for protocols like Protobuf where each field name is unique within a message.
th_hypergraph introduces multi-arity connections via HyperEdge and Label sorts. A hyper-edge connects multiple vertices through labeled positions, modeling SQL table definitions (where a table relates multiple columns via labeled positions) and multi-argument function signatures.
12.2.2 Constraint theory
The constraint theory demonstrates the dependent sort pattern:
Sort::simple("Value"),
],
vec![
Operation::unary("discriminator", "n", "Node", "Discriminator"),
Operation::new(
"extra_field",
vec![
("n".into(), "Node".into()),
("key".into(), "ExtraField".into()),
],
"Value",
),
],
vec![],
)
}The Constraint sort is parameterized by a vertex via SortParam::new("v", "Vertex"). This GAT-level encoding means constraints are attached to specific schema elements. A maxLength constraint on one field is a different sort instance than a maxLength on another field.
This dependent sort has practical consequences:
- The migration engine’s existence checker (Chapter 9) scopes constraint compatibility checks to individual vertices.
- The breaking change detector (Chapter 11) attributes constraint tightening to the specific vertex affected.
- The lens system (Chapter 10) tracks per-vertex constraint changes in the complement.
The dependent sort Constraint(v: Vertex) ensures constraint-related operations are automatically scoped to the correct vertex, without runtime dispatch or downcasting.1
If constraints were a flat sort (not parameterized by vertex), the existence checker couldn’t distinguish “constraint C tightened on vertex A” from “constraint C tightened on vertex B.” The dependent sort makes scoping automatic.
12.2.3 Instance theories
Beyond schema theories, panproto defines instance theories:
th_wtype: W-type trees withNode,Arc, andValuesorts. Operations includeanchor(nodes to schema vertices),arc_src/arc_tgt(tree edges),arc_edge(arc labels), andnode_value(payload). Used by JSON-like, XML-like, and document protocols.th_functor: set-valued functors withTable,Row, andForeignKeysorts. Operations includetable_vertex(tables to schema vertices) andfk_src/fk_tgt(foreign key endpoints). Used by relational and columnar protocols.th_flat: flat key-value structures withNode,Field, andValuesorts. Operations includefield_node(owning node),field_value(payload), andnode_anchor(schema vertex). Used by IDL protocols like Protobuf where data has flat field structure without deep nesting.th_meta: metadata extension addingDiscriminatorandExtraFieldsorts. Composes withth_wtypevia structured merge over the sharedNodesort. Used by document protocols with mixed content (XML, HTML) where elements have both structured children and unstructured attributes.
12.3 The structured merge pattern
Each protocol is defined by composing building-block theories via structured merge, gluing theories together along shared sorts.2 Shared sorts act as the meeting point where two theories agree on concepts.
For example, ATProto’s schema theory merges ThGraph, ThConstraint, and ThMulti, all sharing the Vertex sort:
The merge ensures that the Vertex sort in ThGraph and the Vertex sort in ThConstraint are identified: they refer to the same concept. The resulting schema theory has all sorts and operations from both components, with shared sorts unified.
12.3.1 Why this construction?
The structured merge precisely controls which sorts are identified and which remain distinct. Alternatives either identify too much or too little.3
In practice:
ThGraphandThConstraintshareVertexbut notEdgeandConstraint. Edges and constraints are distinct concepts attached to the same vertex sort.ThGraphandThMultishare bothVertexandEdge. Multi-edge labels extend the same edge concept.ThWTypeandThMetashareNode. Metadata extends the node concept.
12.4 Theory groups
Protocols cluster into six theory groups based on their schema/instance theory pair. Each group has a registration helper:
| Group | Schema Theory | Instance Theory | Helper |
|---|---|---|---|
| A | ThGraph + ThConstraint + ThMulti | ThWType | register_constrained_multigraph_wtype |
| B | ThHypergraph + ThConstraint | ThFunctor | register_hypergraph_functor |
| C | ThSimpleGraph + ThConstraint | ThFlat | register_simple_graph_flat |
| D | ThGraph + ThConstraint + ThMulti + ThInterface | ThWType | register_typed_graph_wtype |
| E | ThGraph + ThConstraint + ThMulti | ThWType + ThMeta | register_multigraph_wtype_meta |
12.4.1 Group a: constrained multigraph + w-type
The largest group: protocols with tree-shaped data and potentially parallel edges. ATProto, JSON Schema, OpenAPI, MongoDB, YAML Schema, and many more belong here. The schema theory combines directed graphs (for type relationships), constraints (for validation rules), and multi-edge support (for fields sharing the same type). Instances are W-type trees.
12.4.2 Group b: hypergraph + Functor
Relational and columnar protocols: SQL, Cassandra, DynamoDB, Parquet, Arrow, DataFrame, CSV. The schema theory uses hypergraphs because tables are multi-arity relations (a table “connects” its columns). Instances are set-valued mappings.4
12.4.3 Group c: simple graph + flat
IDL and serialization protocols: Protobuf, Avro, Thrift, Cap’n Proto, FlatBuffers, ASN.1, Bond, HCL, Redis. These have flat field structures (no deep nesting within a single message/record) and enforce field uniqueness. Instances are flat key-value structures.
12.4.4 Group d: typed graph + w-type with interfaces
Type systems and API schemas with inheritance: GraphQL, TypeScript, Python, Rust, Java, Go, Swift, Kotlin, C#. Adds ThInterface to Group A, modeling inheritance constructs like GraphQL interfaces and Java abstract classes.
12.4.5 Group e: multigraph + w-type + metadata
Document formats with mixed content: XML/XSD, HTML, CSS, DOCX, ODF, Neo4j. Combines Group A’s schema theory with instance-level metadata support (discriminators, extra fields) for elements with both structured children and unstructured attributes.
When adding a new protocol, first determine which theory group it belongs to. If it fits an existing group, reuse the corresponding helper and only define edge rules, vertex kinds, and parser/emitter.
12.5 The protocol interface
Each protocol module exports a protocol() function returning a Protocol struct:
name: the protocol identifier (e.g.,"atproto","json_schema","sql").schema_theoryandinstance_theory: names referencing entries in the theory registry.edge_rules: which edge kinds are structurally significant and what source/target vertex kinds they connect. Used bypanproto-checkto classify breaking changes.obj_kinds: vertex kinds representing composite objects (as opposed to scalars like"string"or"integer"). Used during parsing and by the lens system.constraint_sorts: which constraint sorts this protocol recognizes (e.g.,"maxLength","pattern","maxGraphemes"). Used by the migration engine’s existence checker and the breaking change detector.
The theory names in the Protocol struct reference entries in the theory registry, populated by the helper functions. This indirection keeps protocol definitions declarative: a protocol names theories rather than constructing them.
12.6 The two-level presentation architecture
panproto’s two-parameter architecture (schema theory + instance theory) is mirrored by two crates that provide the presentation layer:5
12.6.1 Schema-level presentations (panproto-protocols)
Every protocol implements bidirectional schema-level conversion:
- Parser:
parse_*(input: &str) -> Result<Schema, ProtocolError>converts a native format specification (.protofile, JSON Schema, SQL DDL, GraphQL SDL) into panproto’sSchemarepresentation. - Emitter:
emit_*(schema: &Schema) -> Result<String, ProtocolError>converts aSchemaback to native format.
Examples: parse_json_schema(json_str) parses a JSON Schema; emit_protobuf(schema) emits a .proto file; parse_sql(ddl_str) parses SQL DDL.
12.6.2 Instance-level presentations (panproto-io)
The panproto-io crate provides complementary instance-level conversions for all 76 protocols:
- Parser:
parse_wtype(protocol, schema, &bytes)converts raw data (HTML document, CoNLL-U file, Protobuf wire message) into aWInstanceorFInstance. - Emitter:
emit_wtype(protocol, schema, &instance)converts an instance back to raw format bytes.
The choice of WInstance vs FInstance is determined by the protocol’s instance_theory field, not preference.
12.6.3 The complete pipeline
Together, these levels complete the data migration pipeline:
raw spec ──protocols parse──→ Schema v1 ──mig compile──→ CompiledMigration
raw data ──io parse──→ Instance v1 ──inst restrict──→ Instance v2 ──io emit──→ raw data
The commutativity guarantee from Spivak 2012 ensures that parse, restrict, and emit compose correctly across all 76 protocols. Both levels support round-trip testing: parse(emit(parse(input))) should produce structurally equivalent results.
It means we don’t need separate integration tests for every (source protocol, target protocol) pair. Correctness of the individual presentations plus the algebraic guarantee gives us correctness of the full pipeline.
12.6.4 Parsing strategy
Parsers typically follow a two-phase approach:
- Syntactic parsing: use a format-specific parser (e.g.,
serde_jsonfor JSON Schema,tree-sitterfor Protobuf) to produce an AST. - Semantic translation: walk the AST, creating vertices for types and fields, edges for relationships, constraints for validation rules, and hyper-edges for multi-arity relations.
12.6.5 Emitting strategy
Emitters reverse the process:
- Topology traversal: walk the schema graph, identifying root vertices, following edges in topological order.
- Format construction: generate native syntax for each vertex (type declaration, field definition, constraint annotation) and edge (relationship, containment).
12.7 Adding a new Protocol
To add a new protocol, follow five steps:
12.7.1 Step 1: identify component theories
Determine which building-block theories your protocol needs. Most fit one of the five existing groups. Ask:
- Does the protocol have directed edges between fields/types? Use
ThGraph. - Does it support parallel edges (multiple fields with the same type)? Add
ThMulti. - Does it have constraints (min/max length, required fields, patterns)? Add
ThConstraint. - Does it have multi-arity relations (SQL foreign keys, table definitions)? Use
ThHypergraphinstead ofThGraph. - Does it support interfaces or inheritance? Add
ThInterface. - Does it have flat field structures with no deep nesting? Use
ThFlatas the instance theory. - Does it have tree-structured data? Use
ThWType. - Does it have mixed content (structured + unstructured)? Add
ThMetatoThWType.
12.7.2 Step 2: compose via structured merge
If your protocol fits an existing group, call the corresponding helper:
theories::register_constrained_multigraph_wtype(
registry,
"ThMyProtocolSchema",
"ThMyProtocolInstance",
);If you need a novel combination, compose manually using panproto_gat::colimit(), specifying shared sorts at each step.
12.7.3 Step 3: define edge rules and vertex kinds
Create a protocol() function:
pub fn protocol() -> Protocol {
Protocol {
name: "my_protocol".into(),
schema_theory: "ThMyProtocolSchema".into(),
instance_theory: "ThMyProtocolInstance".into(),
edge_rules: vec![/* ... */],
obj_kinds: vec!["object".into(), "message".into()],
constraint_sorts: vec!["maxLength".into()],
}
}12.7.4 Step 4: implement schema presentation (panproto-protocols)
If your protocol has a schema definition language, write parse_my_protocol to convert native format specifications into a Schema, and emit_my_protocol to convert back.
Some protocols (like CoNLL-U) have no separate schema language; the schema is implicit. These only need Step 4b.
12.7.5 Step 4b: implement instance presentation (panproto-io)
Write an instance codec by implementing the InstanceParser and InstanceEmitter traits. For most protocols:
- JSON data: use
JsonCodec::new("my_protocol"), delegating to the SIMDsimd-jsonpathway. - XML data: use
XmlCodec::new("my_protocol"), delegating toquick-xml. - Tab/delimited data: use
TabularCodec::tsv("my_protocol", "table_name"), delegating to thememchrSIMD pathway. - Custom format: implement both traits directly using a dedicated parsing library.
Register the codec in the appropriate category module’s register_all function and add it to the default_registry test.
12.7.6 Step 5: add integration tests
Add tests at both levels:
- Schema round-trip (
panproto-protocols): parse a native specification, emit it back, verify it matches. - Instance round-trip (
panproto-io): parse native data, emit it back, re-parse, verify structural equality. Use real-world fixture data. - Migration integration: verify that
parse → restrict → emitproduces valid output. - Breaking-change tests: verify classification logic for your protocol’s edge rules.
When composing theories via structured merge, the shared theory must contain exactly the sorts that should be identified. Sharing too few creates duplicate sorts; sharing too many over-constrains the composition. Review existing helper functions for correct sharing patterns.
12.7.7 Worked example: taskml Protocol
Suppose we’re defining a protocol for a task management markup language:
- Tasks have fields (name, description, priority).
- Tasks can have subtasks (nesting).
- Fields have validation constraints (maxLength on name, enum for priority).
- Field names are unique within a task.
Step 1: This maps to Group C (simple graph + flat). But subtasks imply nesting, requiring W-type trees. So it’s actually Group A (constrained multigraph + W-type), without multi-edge support since field names are unique. We use Group A as-is (multi-edge support is unused but harmless).
Step 2: Using Group A:
theories::register_constrained_multigraph_wtype(
registry,
"ThTaskMLSchema",
"ThTaskMLInstance",
);Step 3: Define the protocol:
pub fn protocol() -> Protocol {
Protocol {
name: "taskml".into(),
schema_theory: "ThTaskMLSchema".into(),
instance_theory: "ThTaskMLInstance".into(),
edge_rules: vec![EdgeRule {
edge_kind: "field".into(),
src_kinds: vec!["task".into()],
tgt_kinds: vec!["string".into(), "enum".into(), "integer".into()],
}],
obj_kinds: vec!["task".into()],
constraint_sorts: vec!["maxLength".into(), "enumValues".into()],
}
}Steps 4 and 5: Implement parse_taskml and emit_taskml, then write round-trip and migration tests.
Use ThGraph for binary relations (one source, one target). Use ThHypergraph for \(n\)-ary relations (SQL foreign keys connecting multiple columns). The edge structure of your format’s native syntax usually makes this obvious.
12.8 Theoretical foundations
The structured merge pattern is backed by a formal guarantee: the merge preserves the equations and well-formedness conditions of both component theories. If ThGraph guarantees every edge has source and target vertices, and ThConstraint guarantees every constraint is attached to a vertex, then the merged theory inherits both guarantees.6
The panproto-protocols and panproto-io crates together implement two levels of the data pipeline: schema presentations map format specifications to abstract schema models, and instance presentations map format data to abstract instance models. This two-level factorization mirrors panproto’s two-parameter architecture and ensures correctness by construction.
For a hands-on guide to building your own protocol, see the tutorial.
In categorical literature, this is a fibered sort (Cartmell 1986)—a sort whose elements depend on another sort’s elements.↩︎
The merge glues theories along shared sorts, producing a combined theory where shared elements are unified. This is a colimit in the category of GATs.↩︎
The merge precisely controls which concepts are shared. In category theory, this is a colimit (specifically a pushout), the universal way to combine theories.↩︎
Each table corresponds to a schema vertex; each row is an element of that table. These table-shaped mappings are set-valued functors from the schema category to Set.↩︎
Each presentation layer converts between a native format and panproto’s internal representation. This conversion is a functor mapping syntactic objects to their algebraic models.↩︎
The structured merge identifies shared sorts along a shared sub-theory. This is a colimit (pushout) in the category of GATs (Cartmell 1986). The architecture follows Lynch et al. (2024), where protocols are presentations of algebraic theories and interoperability is mediated by structure-preserving maps (theory morphisms).↩︎