7 panproto-schema: Schemas and Protocols

A schema is a model of a schema theory in \(\mathbf{Set}\): it sends each sort to a finite set of elements and each operation to a function between those sets. For the graph schema theory with sorts \(\mathrm{Vertex}\) and \(\mathrm{Edge}\) and operations \(\mathrm{src}, \mathrm{tgt}: \mathrm{Edge} \to \mathrm{Vertex}\), a schema assigns concrete vertex and edge sets and concrete source/target functions. The Rust representation is a labeled directed hypergraph.

panproto-schema is a Level 1 crate. For the user-facing explanation of schemas, see the tutorial.

7.1 Schema elements

7.1.1 Math

The four element types correspond to sorts in the schema theory GAT:

Table 7.1: How GAT sorts map to Schema fields.

GAT Sort	Schema field	Interpretation
\(\mathrm{Vertex}\)	`vertices: HashMap<String, Vertex>`	Nodes of the hypergraph
\(\mathrm{Edge}\)	`edges: HashMap<Edge, String>`	Directed binary edges
\(\mathrm{HyperEdge}\)	`hyper_edges: HashMap<String, HyperEdge>`	Multi-arity connections
\(\mathrm{Constraint}\)	`constraints: HashMap<String, Vec<Constraint>>`	Value restrictions

The operations of the GAT (e.g., \(\mathrm{src}: \mathrm{Edge} \to \mathrm{Vertex}\), \(\mathrm{tgt}: \mathrm{Edge} \to \mathrm{Vertex}\)) are implicit in the Edge struct’s src and tgt fields. Rather than storing operations as separate function objects, the operation’s result is stored as a field of the input struct.

7.1.2 Rust


/// A schema vertex.
///
/// Each vertex has a unique `id`, a `kind` drawn from the protocol's
/// recognized vertex kinds, and an optional NSID (namespace identifier).
#[derive(Clone, Debug, PartialEq, Eq, Hash, Serialize, Deserialize)]
pub struct Vertex {
    /// Unique vertex identifier within the schema.
    pub id: Name,
    /// The vertex kind (e.g., `"record"`, `"object"`, `"string"`).
    pub kind: Name,
    /// Optional namespace identifier (e.g., `"app.bsky.feed.post"`).
    pub nsid: Option<Name>,
}

/// A binary edge between two vertices.
///
/// Edges are directed: they go from `src` to `tgt`. The `kind` determines
/// the structural role (e.g., `"prop"`, `"record-schema"`), and `name`
/// provides an optional label (e.g., the property name).
#[derive(Clone, Debug, PartialEq, Eq, Hash, PartialOrd, Ord, Serialize, Deserialize)]
pub struct Edge {
    /// Source vertex ID.
    pub src: Name,
    /// Target vertex ID.
    pub tgt: Name,
    /// Edge kind (e.g., `"prop"`, `"record-schema"`).
    pub kind: Name,
    /// Optional edge label (e.g., a property name like `"text"`).
    pub name: Option<Name>,
}

/// A hyper-edge (present only when the schema theory includes `ThHypergraph`).
///
/// Hyper-edges connect multiple vertices via a labeled signature.
#[derive(Clone, Debug, PartialEq, Eq, Serialize, Deserialize)]
pub struct HyperEdge {
    /// Unique hyper-edge identifier.
    pub id: Name,
    /// Hyper-edge kind.
    pub kind: Name,
    /// Maps label names to vertex IDs.
    pub signature: HashMap<Name, Name>,
    /// The label that identifies the parent vertex.
    pub parent_label: Name,
}

/// A constraint on a vertex.
///
/// Constraints restrict the values a vertex can hold (e.g., maximum
/// string length, format pattern).
#[derive(Clone, Debug, PartialEq, Eq, PartialOrd, Ord, Serialize, Deserialize)]
pub struct Constraint {
    /// The constraint sort (e.g., `"maxLength"`, `"format"`).
    pub sort: Name,
    /// The constraint value (e.g., `"3000"`, `"at-uri"`).
    pub value: String,

7.1.3 Element details

Vertex. Every vertex has a unique id, a kind (drawn from the protocol’s recognized set), and an optional NSID (namespace identifier). The kind field is the discriminator that edge rules use to determine which connections are legal. In the AT Protocol, a vertex of kind "record" can have a "record-schema" edge to a vertex of kind "object", but not to a vertex of kind "string".

Edge. Binary edges are the workhorses of schema structure. Each edge is directed (src to tgt), has a kind that determines its structural role, and may carry an optional name label. Property edges use the name to encode the field name (e.g., "text", "createdAt").

HyperEdge. Hyper-edges generalize binary edges to connect multiple vertices via a labeled signature. They appear only in schema theories that include ThHypergraph. Each hyper-edge has a parent_label that identifies which vertex in its signature serves as the parent in the instance tree.

Constraint. Constraints restrict the values a vertex can hold. They’re typed by a sort (e.g., "maxLength", "format") and carry a string value. Constraints aren’t validated during schema construction; the validate module checks them against the protocol’s constraint_sorts after the schema is built.

Why are constraints validated post-build rather than in the builder?

Constraint validation requires the complete schema (e.g., checking that a constraint’s target vertex exists). The builder validates local invariants (per-vertex, per-edge). The validator checks global invariants (connectivity, completeness, constraint sorts). Splitting these ensures the builder stays simple and the validator has full context.

7.2 The schema struct


/// Specification of a coercion between two value kinds.
///
/// Contains the forward coercion expression, an optional inverse for
/// round-tripping, and the coercion class classifying the round-trip behavior.
#[derive(Debug, Clone, PartialEq, Eq, serde::Serialize, serde::Deserialize)]
pub struct CoercionSpec {
    /// Forward coercion expression (source to target).
    pub forward: panproto_expr::Expr,
    /// Inverse coercion expression (target to source) for the `put` direction.
    pub inverse: Option<panproto_expr::Expr>,
    /// Round-trip classification.
    pub class: panproto_gat::CoercionClass,
}

/// A schema: a model of the protocol's schema theory.
///
/// Contains both the raw data (vertices, edges, constraints, etc.) and
/// precomputed adjacency indices for efficient graph traversal.
#[derive(Clone, Debug, Serialize, Deserialize)]
pub struct Schema {
    /// The protocol this schema belongs to.
    pub protocol: String,
    /// Vertices keyed by their ID.
    pub vertices: HashMap<Name, Vertex>,
    /// Edges keyed by the edge itself, value is the edge kind.
    #[serde(with = "crate::serde_helpers::map_as_vec")]
    pub edges: HashMap<Edge, Name>,
    /// Hyper-edges keyed by their ID.
    pub hyper_edges: HashMap<Name, HyperEdge>,
    /// Constraints per vertex ID.
    pub constraints: HashMap<Name, Vec<Constraint>>,
    /// Required edges per vertex ID.
    pub required: HashMap<Name, Vec<Edge>>,
    /// NSID mapping: vertex ID to NSID string.
    pub nsids: HashMap<Name, Name>,

    /// Coproduct variants per union vertex ID.
    #[serde(default)]
    pub variants: HashMap<Name, Vec<Variant>>,
    /// Edge ordering positions (edge → position index).
    #[serde(default, with = "crate::serde_helpers::map_as_vec_default")]
    pub orderings: HashMap<Edge, u32>,
    /// Recursion points (fixpoint markers).
    #[serde(default)]
    pub recursion_points: HashMap<Name, RecursionPoint>,
    /// Spans connecting pairs of vertices.
    #[serde(default)]
    pub spans: HashMap<Name, Span>,

The adjacency indices (outgoing, incoming, between) are HashMaps from vertex IDs (or vertex-pair tuples) to SmallVec<Edge, N>. Using SmallVec avoids heap allocation for vertices with few neighbors, which is the common case. Most schema vertices have between one and four outgoing edges.

The indices are computed once at build time and never modified. This makes lookups \(O(1)\) and supports the fast graph traversals that the migration and validation pipelines require.

7.3 Protocols

7.3.1 Math

A protocol \(P = (T_{\mathrm{schema}}, T_{\mathrm{inst}}, R)\) pairs a schema theory \(T_{\mathrm{schema}}\) with an instance theory \(T_{\mathrm{inst}}\) and a set of edge rules \(R\). These are the two parameters of the two-parameter architecture described in Chapter 5.

7.3.2 Rust


/// A well-formedness rule for edges of a given kind.
///
/// When `src_kinds` is non-empty, only vertices whose kind appears in the
/// list may serve as the source of an edge of this kind. An empty list
/// means any vertex kind is allowed. The same applies to `tgt_kinds`.
#[derive(Clone, Debug, PartialEq, Eq, Serialize, Deserialize)]
pub struct EdgeRule {
    /// The edge kind this rule governs (e.g., `"prop"`, `"record-schema"`).
    pub edge_kind: String,
    /// Permitted source vertex kinds (empty = any).
    pub src_kinds: Vec<String>,
    /// Permitted target vertex kinds (empty = any).
    pub tgt_kinds: Vec<String>,
}

/// Identifies the schema and instance theories for a data-format protocol,
/// together with structural well-formedness rules.
///
/// Protocols are the Level-1 configuration objects that drive schema
/// construction and validation. Each protocol names a schema theory GAT
/// and an instance theory GAT (both defined in `panproto-protocols`),
/// and supplies edge rules, recognized vertex kinds, and constraint sorts.
#[derive(Clone, Debug, Default, PartialEq, Eq, Serialize, Deserialize)]
#[allow(clippy::struct_excessive_bools)]
pub struct Protocol {
    /// Human-readable protocol name (e.g., `"atproto"`, `"sql"`).
    pub name: String,
    /// Name of the schema theory GAT in the theory registry.
    pub schema_theory: String,
    /// Name of the instance theory GAT in the theory registry.
    pub instance_theory: String,
    /// Composition recipe that produced the schema theory.
    pub schema_composition: Option<CompositionSpec>,
    /// Composition recipe that produced the instance theory.
    pub instance_composition: Option<CompositionSpec>,
    /// Well-formedness rules for each edge kind.
    pub edge_rules: Vec<EdgeRule>,
    /// Vertex kinds that are considered "object-like" (containers).
    pub obj_kinds: Vec<String>,
    /// Recognized constraint sorts (e.g., `"maxLength"`, `"format"`).
    pub constraint_sorts: Vec<String>,

    // -- structural feature flags (all default to false) --
    /// Whether this protocol uses ordered collections (`ThOrder`).
    #[serde(default)]
    pub has_order: bool,
    /// Whether this protocol has coproduct/union types (`ThCoproduct`).
    #[serde(default)]
    pub has_coproducts: bool,
    /// Whether this protocol supports recursive types (`ThRecursion`).
    #[serde(default)]
    pub has_recursion: bool,
    /// Whether this protocol has causal/temporal ordering (`ThCausal`).
    #[serde(default)]

7.3.3 EdgeRule

The EdgeRule struct encodes well-formedness constraints on schema edges. Each rule governs a specific edge kind and specifies which vertex kinds are permitted as sources and targets. When a list is empty, any vertex kind is allowed in that position.

This design keeps protocol definitions declarative. Rather than writing procedural validation code, protocol authors declare a table of rules and the builder enforces them uniformly.

7.3.4 Protocol fields

schema_theory and instance_theory name the GATs in the theory registry.
edge_rules constrain which edges can connect which vertices.
obj_kinds lists vertex kinds that are considered “object-like” (containers that can have child properties). This is used by the instance parser to decide whether a JSON value should be walked recursively.
constraint_sorts lists the recognized constraint types for post-build validation.

7.3.5 Schema structure diagram

classDiagram
    class Protocol {
        +String name
        +String schema_theory
        +String instance_theory
        +Vec~EdgeRule~ edge_rules
        +Vec~String~ obj_kinds
        +Vec~String~ constraint_sorts
        +find_edge_rule(kind) Option~EdgeRule~
        +is_known_vertex_kind(kind) bool
    }

    class EdgeRule {
        +String edge_kind
        +Vec~String~ src_kinds
        +Vec~String~ tgt_kinds
    }

    class Schema {
        +String protocol
        +HashMap vertices
        +HashMap edges
        +HashMap hyper_edges
        +HashMap constraints
        +HashMap outgoing
        +HashMap incoming
        +HashMap between
        +vertex(id) Option~Vertex~
        +outgoing_edges(id) Slice~Edge~
        +incoming_edges(id) Slice~Edge~
    }

    class Vertex {
        +String id
        +String kind
        +Option~String~ nsid
    }

    class Edge {
        +String src
        +String tgt
        +String kind
        +Option~String~ name
    }

    Protocol *-- EdgeRule : edge_rules
    Schema *-- Vertex : vertices
    Schema *-- Edge : edges
    Schema ..> Protocol : built against

Figure 7.1: Class diagram showing the relationships between Protocol, Schema, and element types.

7.4 SchemaBuilder

The SchemaBuilder is the only way to construct a Schema. It enforces a fluent, consume-self API where each method returns Result<Self, SchemaError>. Validation failures are caught immediately. Partial construction is impossible.

use panproto_expr::Expr;

/// A builder for incrementally constructing a validated [`Schema`].
///
/// # Example
///
/// ```ignore
/// let schema = SchemaBuilder::new(&protocol)
///     .vertex("post", "record", Some("app.bsky.feed.post"))?
///     .vertex("post:body", "object", None)?
///     .edge("post", "post:body", "record-schema", None)?
///     .build()?;
/// ```
pub struct SchemaBuilder {
    protocol: Protocol,
    vertices: HashMap<Name, Vertex>,
    edges: Vec<Edge>,
    hyper_edges: HashMap<Name, HyperEdge>,
    constraints: HashMap<Name, Vec<Constraint>>,
    required: HashMap<Name, Vec<Edge>>,
    nsids: HashMap<Name, Name>,
    edge_set: FxHashSet<(Name, Name, Name, Option<Name>)>,
    coercions: HashMap<(Name, Name), CoercionSpec>,
    mergers: HashMap<Name, Expr>,
    defaults: HashMap<Name, Expr>,
    policies: HashMap<Name, Expr>,
}

impl SchemaBuilder {
    /// Create a new builder for the given protocol.
    #[must_use]
    pub fn new(protocol: &Protocol) -> Self {
        Self {
            protocol: protocol.clone(),
            vertices: HashMap::new(),
            edges: Vec::new(),
            hyper_edges: HashMap::new(),

7.4.1 Validation pipeline

The builder validates elements incrementally as they’re added, then performs a final global check at build() time:

flowchart LR
    A["vertex()"] --> B{"Vertex kind<br>in protocol?"}
    B -->|yes| C{"Duplicate<br>ID?"}
    B -->|no| X1["SchemaError::<br>UnknownVertexKind"]
    C -->|no| D["Accept vertex"]
    C -->|yes| X2["SchemaError::<br>DuplicateVertex"]

    E["edge()"] --> F{"src & tgt<br>exist?"}
    F -->|yes| G{"Edge kind<br>has rule?"}
    F -->|no| X3["SchemaError::<br>VertexNotFound"]
    G -->|yes| H{"src/tgt kinds<br>match rule?"}
    G -->|no| X4["SchemaError::<br>UnknownEdgeKind"]
    H -->|yes| I["Accept edge"]
    H -->|no| X5["SchemaError::<br>InvalidEdge*"]

    J["build()"] --> K{"Any<br>vertices?"}
    K -->|yes| L["Compute<br>adjacency indices"]
    K -->|no| X6["SchemaError::<br>EmptySchema"]
    L --> M["Return Schema"]

    style X1 fill:#fdd,stroke:#c33
    style X2 fill:#fdd,stroke:#c33
    style X3 fill:#fdd,stroke:#c33
    style X4 fill:#fdd,stroke:#c33
    style X5 fill:#fdd,stroke:#c33
    style X6 fill:#fdd,stroke:#c33
    style M fill:#dfd,stroke:#3c3

Figure 7.2: SchemaBuilder validation pipeline. Each step checks one class of invariant.

7.4.2 Vertex validation

When vertex() is called, the builder checks:

No duplicate ID: the vertex ID must not already exist in the builder’s vertex map.
Known kind: if the protocol defines any edge rules or obj_kinds, the vertex kind must appear somewhere in those declarations. If the protocol is fully open (no rules at all), any kind is accepted.

7.4.3 Edge validation

When edge() is called, the builder checks:

Endpoints exist: both src and tgt must be vertex IDs already added to the builder.
Known edge kind: if the protocol defines edge rules, there must be a rule for this edge kind.
Source kind permitted: if the matching EdgeRule has a non-empty src_kinds list, the source vertex’s kind must appear in it.
Target kind permitted: same check for tgt_kinds.
No duplicate edge: the (src, tgt, kind, name) tuple must be unique.

7.4.4 Build-time Index computation

The build() method iterates over all accepted edges exactly once, populating three indices:

outgoing[src]: all edges leaving a vertex.
incoming[tgt]: all edges arriving at a vertex.
between[(src, tgt)]: all edges between a specific pair.

These indices use SmallVec<Edge, 4> (or SmallVec<Edge, 2> for between) to avoid heap allocation in the common case. The choice of inline capacity (4 and 2 respectively) is based on profiling of real-world AT Protocol schemas, where the median vertex outdegree is 3.

Tip

If you’re adding a new protocol to panproto-protocols, start by defining the Protocol struct with its edge rules and vertex kinds. The SchemaBuilder will enforce those rules automatically. You don’t need to write any custom validation code.

7.5 Open vs. closed protocols

The builder supports both open and closed protocols:

Closed protocol: defines obj_kinds and/or edge_rules. Only vertex kinds and edge kinds explicitly mentioned are accepted. This is the default for production protocols like AT Protocol and SQL.
Open protocol: has empty obj_kinds and edge_rules. Any vertex kind and any edge kind are accepted without validation. This is useful for testing and for protocols where the set of kinds isn’t known ahead of time.

The distinction is handled by a guard clause in the builder: validation is skipped when the protocol has no declarations to validate against.

Can an open protocol cause problems downstream?

Yes. An open protocol allows any vertex kind, so the migration engine can’t check kind consistency. The breaking-change detector can’t determine which edge kinds are structurally significant. Use open protocols only for testing or exploratory work. Production protocols should always declare their edge rules.

7.6 Schema normalization

The normalize module (in schema/src/normalize.rs) provides canonical ordering of schema elements for deterministic serialization. Normalization sorts vertices by ID, edges by (src, tgt, kind, name), and constraints by (sort, value). This ensures that structurally identical schemas produce byte-identical MessagePack output, which is critical for content-addressed storage and migration hash verification.

7.7 Schema validation

Beyond the per-element checks in the builder, the validate module provides post-build validation:

Constraint sort validation: checks that every constraint’s sort field appears in the protocol’s constraint_sorts list.
Connectivity checks: verifies that the schema graph is connected (every vertex is reachable from some root vertex via edges).
Required-edge checks: verifies that vertices which declare required edges actually have those edges in the schema.

These checks are separated from the builder because they require the complete schema to evaluate.

7.8 Design rationale

7.8.1 Why adjacency indices at build time

Computing adjacency indices eagerly at build time (rather than lazily on first access) is a deliberate trade-off. Schema construction happens once; schema traversal happens many times during migration compilation, instance parsing, and validation. Paying the \(O(|E|)\) index-building cost once eliminates repeated \(O(|E|)\) scans during traversal.

7.8.2 Why string-keyed maps

Vertices, edges, and constraints are identified by String keys rather than integer IDs. String IDs carry semantic meaning (e.g., "post:body.text") that makes schemas human-readable and debuggable. The performance cost is acceptable because schema sizes are modest (hundreds to low thousands of elements) and all hot-path lookups go through the precomputed HashMap indices.

7.8.3 Why edge in adjacency values

The adjacency indices store full Edge clones rather than references or indices into a flat edge array. This avoids lifetime complications and makes the adjacency API simple (outgoing_edges returns &[Edge]). The memory cost of edge duplication is offset by the SmallVec inline storage optimization.