2 What Schemas Have in Common

3 What Schemas Have in Common

The previous chapter showed four descriptions of the same post and observed that they share a common anatomy: containers holding named, typed members, with constraints on some of those members. This chapter makes that observation precise. We will start with graphs (which are easy to draw), move to theories (which are the real tool), and end with the Rust data structures that panproto uses internally.

If you have seen category theory before, you will recognize the trajectory: concrete examples first, then the abstraction, then the implementation. If you have not, nothing here requires it. We will build the concepts from the ground up.

3.1 Schemas are graphs

Take the simplest possible JSON Schema:

{
  "type": "object",
  "properties": {
    "name": { "type": "string" }
  }
}

This describes one container (object) with one member (name) of type string. Draw it as a picture: two dots for object and string, one arrow labeled name pointing from the first to the second.

That picture is a directed graph. Formally, a directed graph is a quadruple \((V, E, s, t)\): a set of vertices \(V\), a set of edges \(E\), and two functions \(s, t\colon E \to V\) that assign each edge a source and a target. For our JSON Schema, \(V = \{\texttt{object}, \texttt{string}\}\), \(E = \{\texttt{name}\}\), \(s(\texttt{name}) = \texttt{object}\), \(t(\texttt{name}) = \texttt{string}\).

Now consider something slightly more realistic. An ATProto Lexicon record with two string fields:

{
  "type": "object",
  "properties": {
    "displayName": { "type": "string", "maxLength": 640 },
    "description": { "type": "string", "maxLength": 2560 }
  }
}

Both displayName and description are arrows from object to string. A plain directed graph cannot represent this: it collapses any two edges that share the same source and target into one. We need a multigraph, a directed graph that allows parallel edges between the same pair of vertices.

graph LR
    Object((object))
    Str((string))

    Object -- "displayName<br/>maxLength: 640" --> Str
    Object -- "description<br/>maxLength: 2560" --> Str

Figure 3.1: An ATProto profile schema as a directed multigraph. Two parallel edges from object to string, distinguished by their labels.

The moment you have two fields of the same type—which is nearly always—you need at least a multigraph.

Some languages need even more. A SQL foreign key connects four things at once: the source table, the source column, the referenced table, and the referenced column.

ALTER TABLE posts
    ADD CONSTRAINT fk_posts_author
    FOREIGN KEY (author_id) REFERENCES authors (id);

You cannot represent that as a binary edge without losing information. A hypergraph handles it: hyperedges connect an ordered sequence of vertices, not just two.

graph TB
    subgraph Posts ["posts (table)"]
        P_id["id : BIGINT PK"]
        P_author["author_id : BIGINT"]
    end

    subgraph Authors ["authors (table)"]
        A_id["id : BIGINT PK"]
    end

    FK{{"fk_posts_author"}}

    FK -. "source table" .-> Posts
    FK -. "source column" .-> P_author
    FK -. "referenced table" .-> Authors
    FK -. "referenced column" .-> A_id

    style FK fill:#f9f,stroke:#333,stroke-dasharray: 5 5

Figure 3.2: A SQL foreign key as a hyperedge connecting four entities.

Composite unique constraints, Protobuf oneof declarations, and GraphQL union types are all multi-ary relationships that benefit from hypergraph representation.

3.1.1 A larger example: GitHub’s GraphQL schema

Consider a fragment of GitHub’s GraphQL schema:

type Repository {
  name: String!
  description: String
  issues(first: Int): IssueConnection!
  pullRequests(first: Int): PullRequestConnection!
  owner: RepositoryOwner!
}

type Issue {
  title: String!
  body: String!
  author: Actor
}

As a multigraph, the vertices are Repository, Issue, String, IssueConnection, PullRequestConnection, RepositoryOwner, Int, and Actor. Each field is an edge. Repository alone has five outgoing edges, two of which—name and description—both target String. Those parallel edges are exactly why we need multigraph structure. The full GitHub GraphQL schema is a multigraph with hundreds of vertices and thousands of edges, all following this same pattern.

3.1.2 Constraints decorate the graph

Schemas carry more than connectivity. They have constraints: maxLength, NOT NULL, DEFAULT, pattern, minimum, maxItems. But constraints do not change the graph. A maxLength on a string property attaches metadata to an existing edge or vertex; it does not add new vertices or edges.

Different schema languages support different constraints:

Constraint	JSON Schema	Protobuf	SQL DDL	GraphQL	ATProto Lexicon
maxLength	yes	no	VARCHAR(n)	no	yes
pattern (regex)	yes	no	CHECK	no	no
minimum / maximum	yes	no	CHECK	no	yes
NOT NULL	required[]	(default set)	NOT NULL	`!`	required[]
DEFAULT	default	default value	DEFAULT	no	default
UNIQUE	no	no	UNIQUE	no	no
PRIMARY KEY	no	no	PRIMARY KEY	no	no

The constraint vocabulary varies wildly. The mechanism—annotations on a graph—is uniform.

Exercise

An ATProto Lexicon has three string fields with different maxLength constraints. Draw the multigraph. Is maxLength a vertex, an edge, or something else?

Solution

The multigraph has two vertices (object and string) and three parallel edges from object to string, one for each field. maxLength is neither a vertex nor an edge. It is a constraint annotation on a vertex (or, depending on how you model it, on an edge). It decorates the graph with additional metadata without changing the graph’s shape.

We have been describing these graph structures informally, and informal descriptions are fine for building intuition. But panproto needs to compute with them: it needs to diff two schemas, check whether a migration is valid, and derive data transformations automatically. For that, we need a language for describing the structure of structure itself.

3.2 Theories

Suppose you wanted to write a TypeScript interface that any directed graph must satisfy. You might write something like:

interface DirectedGraph {
  Vertex: Set;
  Edge: Set;
  src: (e: Edge) => Vertex;
  tgt: (e: Edge) => Vertex;
}

That interface says: “give me a set of vertices, a set of edges, and two functions that assign each edge a source vertex and a target vertex.” Any concrete graph—the post schema, the GitHub schema, any graph at all—is a value that satisfies this interface.

A theory is the mathematical version of that idea. A theory \(\mathbb{T} = (\mathcal{S}, \mathcal{O}, \mathcal{E})\) packages three things:

Sorts \(\mathcal{S}\) are the types of entities. In the directed graph theory, \(\mathcal{S} = \{\text{Vertex}, \text{Edge}\}\). In a SQL schema theory, \(\mathcal{S}\) includes Table, Column, ForeignKey, and more. (panproto uses the word “sort” rather than “type” to avoid confusion with schema-level types like string or integer.)

Operations \(\mathcal{O}\) are functions connecting sorts. \(\text{src}\colon \text{Edge} \to \text{Vertex}\) says “every edge has a source vertex.”

Equations \(\mathcal{E}\) are invariants that operations must satisfy. In SQL, the equation \(\text{table}(\text{srcColumn}(fk)) = \text{srcTable}(fk)\) says: take a foreign key, get its source column, ask which table that column belongs to; you get the same answer as asking the foreign key directly for its source table.

A concrete graph that provides actual sets and functions satisfying a theory is called a model. The relationship is the same as between a TypeScript interface and a class implementing it: the theory describes what must exist and what laws hold; the model fills in the specifics.

3.2.1 The directed graph theory

The simplest useful theory:

\[\mathcal{S} = \{\text{Vertex}, \text{Edge}\}, \quad \mathcal{O} = \{\text{src}\colon \text{Edge} \to \text{Vertex},\; \text{tgt}\colon \text{Edge} \to \text{Vertex}\}, \quad \mathcal{E} = \emptyset.\]

No equations—any assignment of sets and functions is a valid model. That is because directed graphs have no structural invariants beyond connectivity. They are the simplest kind of schema.

A model of this theory is a concrete graph. The ATProto post schema, viewed as a multigraph, is one:

\(\text{Vertex} = \{\texttt{object}, \texttt{string}, \texttt{datetime}, \texttt{array}, \texttt{ref}\}\)
\(\text{Edge} = \{\texttt{text}, \texttt{createdAt}, \texttt{tags}, \texttt{reply}\}\)
\(s(\texttt{text}) = \texttt{object}\), \(t(\texttt{text}) = \texttt{string}\)
\(s(\texttt{createdAt}) = \texttt{object}\), \(t(\texttt{createdAt}) = \texttt{datetime}\)

The schema is the model.

3.2.2 The monoid: a theory with equations

To see what equations buy you, consider a different theory entirely. A monoid is a set with an associative binary operation and an identity element. String concatenation is a monoid. Array.prototype.reduce with an initial value uses a monoid.

\[\mathcal{S} = \{\text{Carrier}\}\]

\[\mathcal{O} = \{\text{mul}\colon \text{Carrier} \times \text{Carrier} \to \text{Carrier},\; \text{unit}\colon () \to \text{Carrier}\}\]

\[\mathcal{E} = \begin{cases} \text{mul}(\text{mul}(a, b), c) = \text{mul}(a, \text{mul}(b, c)) & \text{(associativity)} \\ \text{mul}(\text{unit}(), a) = a & \text{(left identity)} \\ \text{mul}(a, \text{unit}()) = a & \text{(right identity)} \end{cases}\]

Three models: integers under addition (with \(\text{unit} = 0\)), strings under concatenation (with \(\text{unit} = \texttt{""}\)), booleans under AND (with \(\text{unit} = \text{true}\)). Different concrete sets and functions, same theory. The equations constrain what counts as a valid model: you cannot claim that addition is a monoid and then hand back a mul function that is not associative.

The directed graph theory had no equations, which means anything goes. The monoid theory has three equations, which means only structures satisfying associativity and identity count. The more equations a theory has, the more opinionated it is about what a valid model looks like.

Exercise

Write out the theory for a reflexive graph: a directed graph where every vertex has a distinguished self-loop. How many sorts, operations, and equations?

Solution

Two sorts (Vertex, Edge), three operations (\(\text{src}\colon \text{Edge} \to \text{Vertex}\), \(\text{tgt}\colon \text{Edge} \to \text{Vertex}\), \(\text{id}\colon \text{Vertex} \to \text{Edge}\)), and two equations (\(\text{src}(\text{id}(v)) = v\) and \(\text{tgt}(\text{id}(v)) = v\)). The id operation assigns each vertex a distinguished self-loop, and the equations ensure that loop starts and ends at the same vertex.

3.3 Dependent sorts

The theories above treat all vertices uniformly: every vertex is the same kind of thing. But real constraints are not uniform. maxLength makes sense on strings, not on integers. CHECK makes sense on SQL columns, not on GraphQL fields. maxGraphemes makes sense on ATProto string properties and nothing else.

One approach is to model constraints as a single flat sort and check at runtime whether a given constraint is valid for a given vertex. That works, but it is unsatisfying in the same way that TypeScript’s any type is unsatisfying. You can write correct programs with it, but the type system is not helping you.

The better approach is to make the sort of constraints depend on the vertex. String vertices accept maxLength and pattern. Integer vertices accept minimum and maximum. The constraint vocabulary is not one set; it is a family of sets, indexed by vertex.

A Generalized Algebraic Theory (GAT) expresses this with dependent sorts: sorts parameterized by terms of other sorts.¹

\(\text{Vertex}\): ordinary sort (no parameters).
\(\text{Edge}\): ordinary sort.
\(\text{Constraint}(v\colon \text{Vertex})\): dependent sort, parameterized by a vertex.

For a specific vertex, say stringVertex, the set \(\text{Constraint}(\text{stringVertex}) = \{\text{maxLength}, \text{pattern}\}\). For intVertex, \(\text{Constraint}(\text{intVertex}) = \{\text{minimum}, \text{maximum}\}\). These are genuinely different sets. A model that tries to put maxLength in \(\text{Constraint}(\text{intVertex})\) is not a valid model; the types do not match.

The flat-sort alternative is to dependent sorts what any is to precise types. Both let you write correct programs, but dependent sorts catch constraint errors structurally, at theory-definition time, rather than at runtime.

3.4 Theories for real schema languages

With sorts, operations, equations, and dependent sorts, we can write down complete theories for real schema languages. Here are three.

3.4.1 JSON Schema

Sorts: SchemaNode, Property, \(\text{Constraint}(s\colon \text{SchemaNode})\)

Operations:

\(\text{owner}\colon \text{Property} \to \text{SchemaNode}\) (which object owns this property)
\(\text{target}\colon \text{Property} \to \text{SchemaNode}\) (the property’s type)
\(\text{name}\colon \text{Property} \to \text{String}\)
\(\text{required}\colon \text{Property} \to \text{Bool}\)
\(\text{schemaType}\colon \text{SchemaNode} \to \text{TypeTag}\)
\(\text{maxLength}\colon \text{Constraint}(\text{stringNode}) \to \mathbb{N}\)
\(\text{pattern}\colon \text{Constraint}(\text{stringNode}) \to \text{Regex}\)
\(\text{minimum}\colon \text{Constraint}(\text{numericNode}) \to \mathbb{R}\)

Equations: \(\text{schemaType}(\text{owner}(p)) = \texttt{object}\). Properties belong to objects, not to strings or arrays.

3.4.2 SQL DDL

Sorts: Table, Column, ForeignKey, \(\text{Constraint}(c\colon \text{Column})\)

Operations:

\(\text{table}\colon \text{Column} \to \text{Table}\)
\(\text{colType}\colon \text{Column} \to \text{DataType}\)
\(\text{srcTable}\colon \text{ForeignKey} \to \text{Table}\)
\(\text{srcColumn}\colon \text{ForeignKey} \to \text{Column}\)
\(\text{refTable}\colon \text{ForeignKey} \to \text{Table}\)
\(\text{refColumn}\colon \text{ForeignKey} \to \text{Column}\)
\(\text{notNull}\colon \text{Column} \to \text{Bool}\)
\(\text{defaultVal}\colon \text{Column} \to \text{Option}\langle\text{Expr}\rangle\)

Equations:

\(\text{table}(\text{srcColumn}(fk)) = \text{srcTable}(fk)\)
\(\text{table}(\text{refColumn}(fk)) = \text{refTable}(fk)\)

Read the first equation aloud: take a foreign key, look up its source column, ask which table that column belongs to; you get the same answer as asking the foreign key directly for its source table. That is a structural invariant that every SQL database enforces. panproto can verify schema well-formedness before generating any SQL, because the equation is part of the theory.

3.4.3 Protobuf

Sorts: Message, Field, EnumType, EnumValue

Operations:

\(\text{message}\colon \text{Field} \to \text{Message}\)
\(\text{fieldType}\colon \text{Field} \to \text{TypeRef}\)
\(\text{fieldNumber}\colon \text{Field} \to \mathbb{N}\)
\(\text{enumType}\colon \text{EnumValue} \to \text{EnumType}\)
\(\text{enumNumber}\colon \text{EnumValue} \to \mathbb{Z}\)

Equations: For distinct fields \(f_1, f_2\) in the same message, \(\text{fieldNumber}(f_1) \neq \text{fieldNumber}(f_2)\).

This uniqueness constraint is what makes Protobuf’s wire format work. It is specific to Protobuf; JSON Schema and SQL have nothing like it.

Exercise

The SQL theory has two equations, both of the form “navigating through the column gives the same result as asking the foreign key directly.” Protobuf has a disequality constraint on field numbers. JSON Schema has one equation constraining owner. What does the presence of equations tell you about a schema language’s structural complexity? What would a theory with no equations mean?

Solution

Equations encode structural invariants. A theory with equations describes a language where certain relationships between elements are guaranteed: any model must satisfy them, or it is not a valid schema in that language. A theory with no equations imposes no invariants beyond the types of the operations. Any assignment of sets and functions of the right types is a valid model. More equations means the language is more opinionated about what a well-formed schema looks like.

These mathematical definitions are not just notation. They are the data structures panproto operates on.

3.5 The Rust implementation

panproto represents theories as runtime values: data you construct, inspect, and pass to algorithms.

3.5.1 Sorts

    /// Projection: the result is a deterministic function of the source
    /// fiber, but no inverse recovers the source from the result alone.
    /// This is the classification for `ComputeField` transforms that
    /// derive data from child scalar values (the dependent-sum
    /// projection). Complement stores nothing for the derived field
    /// because `get` re-derives it deterministically; modifications to
    /// the derived field in the view are not independently

A SortParam captures the name and sort of each parameter in a dependent sort. For \(\text{Constraint}(v\colon \text{Vertex})\), there is one SortParam with name "v" and a sort reference pointing to Vertex.

    /// ordering, equivalently lattice join in the information-loss
    /// ordering).
    ///
    /// Laws:
    /// - `Iso` is identity: `Iso.compose(x) = x`
    /// - `Opaque` absorbs: `Opaque.compose(x) = Opaque`
    /// - Same-kind is idempotent: `Retraction.compose(Retraction) = Retraction`,
    ///   `Projection.compose(Projection) = Projection`
    /// - Cross-kind collapses: `Retraction.compose(Projection) = Opaque`
    ///   (a retraction followed by a projection, or vice versa, has

A Sort has a name and a list of parameters. Vertex and Edge have empty parameter lists. \(\text{Constraint}(v\colon \text{Vertex})\) has one entry.

3.5.2 Operations

#[derive(Debug, Clone, PartialEq, Eq, Hash, serde::Serialize, serde::Deserialize)]
pub struct Operation {
    /// The operation name (e.g., "src", "tgt", "compose").
    pub name: Arc<str>,
    /// Typed inputs as `(param_name, sort_name)` pairs.
    pub inputs: Vec<(Arc<str>, Arc<str>)>,
    /// The output sort name.
    pub output: Arc<str>,
}

An Operation has a name, input sorts, and an output sort. The operation \(\text{src}\colon \text{Edge} \to \text{Vertex}\) becomes an Operation with one input of sort Edge and output sort Vertex.

3.5.3 Theories

#[derive(Debug, Clone)]
pub struct Theory {
    /// The theory name (e.g., "Monoid", "Category").
    pub name: Arc<str>,
    /// Names of parent theories this theory extends.
    pub extends: Vec<Arc<str>>,
    /// Sort declarations.
    pub sorts: Vec<Sort>,
    /// Operation declarations.
    pub ops: Vec<Operation>,
    /// Equations (axioms).
    pub eqs: Vec<Equation>,
    /// Directed equations (rewrite rules).
    pub directed_eqs: Vec<DirectedEquation>,
    /// Conflict resolution policies for merge operations.
    pub policies: Vec<ConflictPolicy>,
    /// O(1) sort lookup by name.
    sort_idx: FxHashMap<Arc<str>, usize>,
    /// O(1) operation lookup by name.
    op_idx: FxHashMap<Arc<str>, usize>,
    /// O(1) equation lookup by name.
    eq_idx: FxHashMap<Arc<str>, usize>,
    /// O(1) directed equation lookup by name.
    directed_eq_idx: FxHashMap<Arc<str>, usize>,
    /// O(1) policy lookup by name.
    policy_idx: FxHashMap<Arc<str>, usize>,
}

The Theory struct holds sorts, operations, and equations. When panproto diffs two schemas, plans a migration, or generates code, it consults the theory to know what structural shape it is working with.

Here is the monoid theory constructed as an actual Rust test from the panproto-gat crate:

    for pol in &theory.policies {
        if policy_names.insert(Arc::clone(&pol.name)) {
            merged_policies.push(pol.clone());
        }
    }

    in_stack.remove(name);
    visited.insert(name.to_owned());

    Ok(Theory::full(
        Arc::from(name),
        Vec::new(),
        merged_sorts,
        merged_ops,
        merged_eqs,
        merged_directed_eqs,
        merged_policies,
    ))
}

/// The abstract theory of editable structures (Hofmann, Pierce, Wagner 2012).
///
/// An editable structure is a triple `(S, E, ·)` where `S` is a set of
/// states, `E` is a monoid of edits, and `·` is a partial monoid action
/// `E × S → S`.
///
/// This theory declares the sorts and operations without fixing any
/// concrete model. Models of this theory are concrete edit algebras
/// (e.g., `TreeEdit` for W-type instances, `TableEdit` for functor
/// instances).
#[must_use]
pub fn th_editable_structure() -> Theory {
    use crate::eq::{Equation, Term};
    use crate::op::Operation;
    use crate::sort::Sort;

    let sorts = vec![Sort::simple("State"), Sort::simple("Edit")];

    let ops = vec![
        // identity() → Edit
        Operation::new("identity", vec![], "Edit"),
        // compose(e1: Edit, e2: Edit) → Edit
        Operation::new(
            "compose",
            vec![("e1".into(), "Edit".into()), ("e2".into(), "Edit".into())],
            "Edit",
        ),
        // apply(e: Edit, s: State) → State  (partial)
        Operation::new(
            "apply",
            vec![("e".into(), "Edit".into()), ("s".into(), "State".into())],
            "State",
        ),

The code mirrors the mathematical specification: sorts first, then operations with their input and output sorts, then equations as pairs of terms that must be equal.

3.5.4 Machine-checked equations

panproto type-checks every equation when you define a theory. Both sides must have the same sort, and every operation must be applied to arguments of the correct sort. An equation claiming a Vertex-sorted term equals an Edge-sorted term is rejected with a clear error message.

The CLI checks equations automatically during schema add and schema commit. You can also run the check explicitly:

schema verify my-theory.json

When you compose theories via colimit (a structured merge that glues two theories together while identifying their shared parts), the equations from each component are re-checked in the combined theory.

For the implementation details behind the type-checker and the colimit algorithm, see the GAT engine chapter in the developer guide.²

We now have a precise vocabulary for describing the structure of a single schema language. But panproto supports 76 schema languages, and it treats them all uniformly. The next chapter explains how: by packaging each language as a pair of theories, and parameterizing the engine over that pair.

Adámek, Jiří, and Jiří Rosický. 1994. Locally Presentable and Accessible Categories. Vol. 189. London Mathematical Society Lecture Note Series. Cambridge University Press. https://doi.org/10.1017/CBO9780511600579.

Cartmell, John. 1986. “Generalised Algebraic Theories and Contextual Categories.” Annals of Pure and Applied Logic 32: 209–43. https://doi.org/10.1016/0168-0072(86)90053-9.

Lynch, Owen, Kris Brown, James Fairbanks, and Evan Patterson. 2024. “GATlab: Modeling and Programming with Generalized Algebraic Theories.” arXiv Preprint arXiv:2404.04837, ahead of print. https://doi.org/10.48550/arXiv.2404.04837.

Spivak, David I. 2012. “Functorial Data Migration.” Information and Computation 217: 31–51. https://arxiv.org/abs/1009.1166.

The foundational treatment of GATs is due to Cartmell (1986).↩︎
The foundational framework: Spivak (2012) showed database schemas can be treated as categories with migration computed by composing maps between them. panproto generalizes this from relational databases to arbitrary schema languages. The GAT framework (Cartmell 1986) adds dependent sorts, implemented via an equivalence with Essentially Algebraic Theories (Adámek and Rosický 1994). The computational architecture adapts GATlab (Lynch et al. 2024) to Rust. Appendix A has the full mathematical treatment.↩︎