9  panproto-mig: The Migration Engine

A Migration encodes a graph morphism between two schemas: vertex maps, edge maps, hyper-edge maps, and contraction resolvers. For a schema with vertices \(\{A, B, C\}\) and a migration that drops \(B\), the vertex map sends \(A \mapsto A'\) and \(C \mapsto C'\), and the resolver specifies which edge in the target connects \(A'\) to \(C'\) (since the intermediate \(B\) is gone).


/// A migration specification: maps between two schemas.
///
/// The vertex and edge maps define the core graph morphism. The resolver
/// and hyper-resolver handle contraction ambiguities that arise when
/// intermediate vertices are dropped.
#[derive(Clone, Debug, Serialize, Deserialize)]
pub struct Migration {
    /// Maps source vertex IDs to target vertex IDs.
    pub vertex_map: HashMap<Name, Name>,
    /// Maps source edges to target edges.
    #[serde(with = "panproto_schema::serde_helpers::map_as_vec")]
    pub edge_map: HashMap<Edge, Edge>,
    /// Maps source hyper-edge IDs to target hyper-edge IDs.
    pub hyper_edge_map: HashMap<Name, Name>,
    /// Maps (hyper-edge ID, label) pairs to new labels.
    #[serde(with = "panproto_schema::serde_helpers::map_as_vec")]
    pub label_map: HashMap<(Name, Name), Name>,
    /// Binary contraction resolver: `(src_vertex, tgt_vertex)` -> resolved edge.
    #[serde(with = "panproto_schema::serde_helpers::map_as_vec")]
    pub resolver: HashMap<(Name, Name), Edge>,
    /// Hyper-edge contraction resolver: maps `(hyper_edge_id, labels)` to

The six fields break down into three logical groups.

Core morphism. The vertex_map and edge_map fields define the fundamental graph morphism from source schema to target schema. Every source vertex that should survive in the target gets an entry in vertex_map; vertices absent from the map are dropped. Similarly, edge_map records how source edges correspond to target edges.

Hyper-edge morphism. The hyper_edge_map and label_map fields handle protocols whose schema theories include the HyperEdge sort (Group B protocols such as SQL, Parquet, and Cassandra). A hyper-edge connects multiple vertices via labeled positions, so the mapping must track both the hyper-edge identity and the label assignments.

Contraction resolvers. The resolver and hyper_resolver fields handle the case where a migration contracts intermediate vertices. For example, if a schema has the path \(A \to B \to C\) and the migration drops \(B\), the resulting schema connects \(A\) directly to \(C\). But there may be multiple edges between \(A\) and \(C\) in the target. The resolver records which one to use. The hyper_resolver does the same for hyper-edge contraction, additionally providing label remaps.

9.0.1 Identity and empty migrations

Two constructors cover the common base cases: an identity migration that maps everything to itself, and an empty migration with no mappings at all.

    #[allow(clippy::type_complexity)]
    #[serde(with = "panproto_schema::serde_helpers::map_as_vec")]
    pub hyper_resolver: HashMap<(Name, Vec<Name>), (Name, HashMap<Name, Name>)>,
    /// Expression-based resolvers for enriched migrations.
    #[serde(default, with = "panproto_schema::serde_helpers::map_as_vec_default")]
    pub expr_resolvers: HashMap<(Name, Name), panproto_expr::Expr>,
}

impl Migration {
    /// Create an identity migration for the given schema vertex and edge sets.
    ///
    /// Every vertex maps to itself and every edge maps to itself.
    #[must_use]
    pub fn identity(vertices: &[Name], edges: &[Edge]) -> Self {
        let vertex_map: HashMap<Name, Name> =
            vertices.iter().map(|v| (v.clone(), v.clone())).collect();
        let edge_map: HashMap<Edge, Edge> = edges.iter().map(|e| (e.clone(), e.clone())).collect();
        Self {
            vertex_map,
            edge_map,
            hyper_edge_map: HashMap::new(),
            label_map: HashMap::new(),
            resolver: HashMap::new(),
            hyper_resolver: HashMap::new(),
            expr_resolvers: HashMap::new(),
        }
    }

    /// Create an empty migration (no mappings).
    #[must_use]
    pub fn empty() -> Self {
        Self {

Migration::identity is the starting point for incremental migration construction. Begin with identity and then override specific entries to express the desired transformation. It also serves as the unit element for composition: composing any migration with the identity yields the original migration.

Migration::empty serves as the zero element when building migrations programmatically. It’s primarily useful in test scaffolding or when constructing migrations from combinators where the initial state has no mappings.

9.1 The migration pipeline

The full pipeline follows five stages:

sequenceDiagram
    participant User
    participant Mig as Migration
    participant Exist as check_existence
    participant Comp as compile()
    participant CM as CompiledMigration
    participant Lift as lift_wtype / lift_functor

    User->>Mig: define vertex_map, edge_map, ...
    Mig->>Exist: check_existence(protocol, src, tgt, mig, registry)
    Exist-->>Mig: ExistenceReport { valid, errors }
    Mig->>Comp: compile(src, tgt, migration)
    Comp-->>CM: CompiledMigration
    CM->>Lift: lift_wtype(instance, compiled) or lift_functor(...)
    Lift-->>User: transformed instance

  1. Existence checking (check_existence): validates that the migration is well-formed with respect to the protocol’s theories.
  2. Compilation (compile): pre-computes surviving sets, remapping tables, and resolvers for fast per-record application.
  3. Lifting (lift_wtype, lift_functor): applies the compiled migration to concrete tree or table instances.
  4. Composition (compose): chains two sequential migrations into a single equivalent migration.
  5. Inversion (invert): constructs the inverse migration when the original is bijective.

Stages 1 through 3 form the critical path for applying a single migration. Stages 4 and 5 are used when managing sequences of migrations or implementing rollback capabilities.

9.2 Theory-derived existence checking

The existence checker inspects the protocol’s schema and instance theory sorts at runtime to determine which checks apply. The validation rules are derived from the theory’s structure, so adding a new protocol with novel theory composition requires no code changes to panproto-mig.1

9.2.1 ExistenceReport and check_existence

/// Result of existence checking.
#[derive(Debug, serde::Serialize, serde::Deserialize)]
pub struct ExistenceReport {
    /// Whether all conditions are satisfied.
    pub valid: bool,
    /// Individual errors (empty when `valid` is true).
    pub errors: Vec<ExistenceError>,
}

/// Check existence conditions for a migration.
///
/// The conditions checked are DERIVED from the schema and instance
/// theory structure, not a hardcoded list. The function inspects
/// the theory's sorts to decide which checks to apply.
///
/// Always checks: vertex map validity, edge map validity, kind consistency.
/// Conditionally checks (based on theory sorts):
/// - `Constraint` sort present -> constraint compatibility
/// - `HyperEdge` sort present -> signature coherence + simultaneity
/// - Instance theory has `Node` sort (W-type) -> reachability risks
#[must_use]
pub fn check_existence(
    protocol: &Protocol,
    src: &Schema,
    tgt: &Schema,
    migration: &Migration,
    theory_registry: &HashMap<String, Theory>,
) -> ExistenceReport {
    let mut errors = Vec::new();

    // Look up the schema theory to determine which checks apply.
    let schema_theory = theory_registry.get(&protocol.schema_theory);

    if let Some(theory) = schema_theory {
        // Theory-derived conditional checks.
        if theory.find_sort("Constraint").is_some() {
            errors.extend(check_constraint_compatibility(src, tgt, migration));
        }
        if theory.find_sort("HyperEdge").is_some() {
            errors.extend(check_signature_coherence(src, tgt, migration));
            errors.extend(check_simultaneity(src, tgt, migration));
        }
    }

    // Look up the instance theory for W-type checks.
    let inst_theory = theory_registry.get(&protocol.instance_theory);
    if let Some(theory) = inst_theory {
        if theory.find_sort("Node").is_some() {
            errors.extend(check_reachability(src, tgt, migration));
        }
    }

    // New theory-derived checks from building blocks.
    if let Some(theory) = schema_theory {
        if theory.find_sort("Variant").is_some() {
            errors.extend(check_variant_preservation(src, tgt, migration));
        }
        if theory.find_sort("Position").is_some() {
            errors.extend(check_order_compatibility(src, tgt, migration));
        }
        if theory.find_sort("Mu").is_some() {
            errors.extend(check_recursion_compatibility(src, tgt, migration));
        }
        if theory.find_sort("Usage").is_some() {
            errors.extend(check_linearity(src, tgt, migration));
        }
    }

    // Always check basic morphism validity.
    errors.extend(check_vertex_map(src, tgt, migration));
    errors.extend(check_edge_map(src, tgt, migration));
    errors.extend(check_kind_consistency(src, tgt, migration));

    ExistenceReport {
        valid: errors.is_empty(),
        errors,
    }
}

9.2.2 How theory inspection works

The function looks up the protocol’s schema theory and instance theory from a registry, then queries for specific sorts to decide which validation checks to run:

  • Always checked (regardless of theory structure): vertex map validity, edge map validity, kind consistency. These are fundamental properties of any graph morphism.
  • Constraint sort present in schema theory: enables constraint compatibility checking. The target schema must not impose strictly tighter constraints than the source. For example, reducing maxLength from 3000 to 300 means existing data with 3000-character strings would violate the new constraint.
  • HyperEdge sort present in schema theory: enables two additional checks. Signature coherence verifies that mapped hyper-edges have compatible label-to-vertex assignments. Simultaneity verifies that all labels in a target hyper-edge reference surviving vertices (you can’t have a SQL foreign key that references a dropped table).
  • Node sort present in instance theory: enables reachability risk detection for tree instances. When intermediate vertices are dropped, subtrees connected through those vertices may become disconnected from the root, producing orphaned data.

flowchart TD
    A[check_existence] --> B{Schema theory lookup}
    B --> C[Always: vertex map validity]
    B --> D[Always: edge map validity]
    B --> E[Always: kind consistency]
    B --> F{Constraint sort?}
    F -->|Yes| G[Constraint compatibility]
    F -->|No| H[Skip]
    B --> I{HyperEdge sort?}
    I -->|Yes| J[Signature coherence]
    I -->|Yes| K[Simultaneity]
    I -->|No| L[Skip]
    A --> M{Instance theory lookup}
    M --> N{Node sort?}
    N -->|Yes| O[Reachability risks]
    N -->|No| P[Skip]

A protocol whose schema theory has both Constraint and HyperEdge sorts (like SQL) will get all conditional checks. A protocol with neither (like a minimal custom protocol) will only get the universal checks.

CautionWhat if the theory registry doesn’t have my protocol’s theories?

check_existence returns ExistenceError::TheoryNotFound. Register the protocol’s theories before calling check_existence. The panproto-protocols crate auto-registers all built-in protocols; custom protocols must call registry.register() explicitly.

9.2.3 Constraint compatibility in detail

The constraint compatibility check verifies two properties:

  1. No tightening: for each mapped vertex, every constraint in the target must not be strictly more restrictive than the corresponding constraint in the source. For upper-bound constraints (maxLength, maxSize, maximum), a smaller target value is tighter. For lower-bound constraints (minLength, minimum), a larger target value is tighter.

  2. No missing required fields: if the target schema marks an edge as required for a vertex, the source must have a corresponding edge (possibly via the edge map) that maps to it. Adding a required field with no default is a migration obstruction.

9.2.4 Reachability in detail

The reachability check applies only when the instance theory has a Node sort (tree instances). It approximates whether dropping vertices will disconnect subtrees from the root by checking, for each surviving non-root vertex, whether it has at least one incoming edge from another surviving vertex in the source schema.

Warning

The reachability check is an approximation. It verifies that each surviving vertex has a surviving parent, but it doesn’t compute full transitive reachability from the root. In practice this is sufficient for tree-structured instances, but pathological graphs with long chains of dropped vertices could produce false negatives.

9.3 Compilation

The compile function translates a Migration specification into a CompiledMigration, a form optimized for per-record application. Compilation pre-computes:

  1. Surviving vertex set: the image of vertex_map in the target schema. Only nodes anchored at these vertices will be kept during lifting.
  2. Surviving edge set: the image of edge_map. Only arcs corresponding to these edges will be preserved.
  3. Vertex and edge remap tables: maps from source IDs to target IDs. These allow the lifter to update anchors and edge labels in \(O(1)\) per element.
  4. Resolver copies: the contraction resolvers carried forward for use during lifting. When the lifter encounters two surviving nodes that weren’t directly connected in the source, it consults the resolver to determine which target edge to use.

The compiled form enables lift_wtype and lift_functor to process each record in \(O(n)\) time (where \(n\) is node count) without re-deriving the migration structure on every invocation. For high-throughput scenarios (e.g., migrating millions of ATProto records), this separation between compilation and execution is critical for performance.

Tip

If your migration has already passed existence checking, compilation won’t fail unless the migration references vertices or edges absent from the target schema (which existence checking should have caught). Always run existence checking first; it produces much more informative error messages than compilation failures.

9.4 Lifting

Lifting applies a compiled migration to a concrete instance. panproto supports two lifting modes corresponding to the two families of instance theories.

9.4.1 Tree lifting

lift_wtype handles tree-shaped instances used by JSON-like, XML-like, and document protocols:

  1. Walks the source instance’s node tree from the root.
  2. For each node, checks whether its anchor vertex is in the surviving set.
  3. If surviving: remaps the anchor to its target vertex ID and copies the node to the result.
  4. If dropped: omits the node. Any children anchored at surviving vertices are re-parented using the resolver.
  5. Rebuilds the arc list and fan list based on surviving nodes and the edge remap table.

The resolver is critical during step 4. When a dropped intermediate vertex B sat between A and C, the lifter needs to create a direct arc from A to C. If multiple edges exist between A and C in the target schema, the resolver disambiguates.

9.4.2 Table lifting

lift_functor handles tabular instances used by relational and columnar protocols:

  1. Iterates over tables in the source instance.
  2. For each table, checks whether its anchor vertex is in the surviving set.
  3. If surviving: copies all rows, remapping foreign keys according to the vertex remap table and dropping columns for non-surviving edges.
  4. If dropped: omits the entire table.

Both lifting functions delegate to the corresponding restrict operations in panproto-inst.

9.5 Composition and inversion

9.5.1 Composition

Composition chains two migrations \(m_1: S_1 \to S_2\) and \(m_2: S_2 \to S_3\) into a single migration \(m_2 \circ m_1: S_1 \to S_3\). The composed vertex map is the functional composition of the two vertex maps: for each source vertex \(v\) in \(S_1\), the composed map sends \(v\) to \(m_2(m_1(v))\). The edge and hyper-edge maps compose similarly.

Resolvers from both migrations are merged, with \(m_2\)’s resolvers taking precedence when keys overlap. This is correct because the second migration’s resolvers encode decisions about the final target schema.

Composition is associative: \((m_3 \circ m_2) \circ m_1 = m_3 \circ (m_2 \circ m_1)\). Combined with the identity migration as the unit, this makes migrations composable in the expected way.2

9.5.2 Inversion

Inversion constructs \(m^{-1}: S_2 \to S_1\) from a migration \(m: S_1 \to S_2\), but only when \(m\) is bijective. Specifically, it requires:

  • Injectivity of vertex_map: no two source vertices map to the same target vertex. If user.name and user.displayName both map to user.label, the inverse can’t know which to restore.
  • Surjectivity of vertex_map: every target vertex has a preimage. If the target has a vertex that no source vertex maps to, the inverse has no source for it.
  • The same conditions on edge_map and hyper_edge_map.

If any condition fails, invert returns an InvertError identifying the specific obstruction.

Important

Inversion requires bijectivity. A migration that drops vertices or contracts edges is not invertible. Use the lens system (Chapter 10) when you need round-trip transformations for non-bijective mappings.

CautionHow do you test whether a migration is invertible before calling invert?

Check that vertex_map.len() == target_schema.vertices.len() (surjectivity) and that all values in vertex_map are distinct (injectivity). The same for edge_map. Or just call invert and handle the InvertError.

9.6 Type-checked migration derivation

When the VCS schema add command auto-derives a migration from a schema diff, the derived migration undergoes GAT-level validation before it is staged. This validation is performed by the gat_validate module in panproto-vcs.

9.6.1 Migration structure validation

The validate_migration function inspects the derived migration’s vertex and edge maps for structural coherence:

  • Non-empty vertex map: a migration that maps zero vertices is flagged with a warning (it may indicate the diff algorithm failed to match vertices across schema versions).
  • Edge endpoint coherence: for each mapped edge (old_edge -> new_edge), the function verifies that the new edge’s source and target vertices are in the vertex map’s image. If an edge maps to endpoints that aren’t themselves mapped, the migration may be inconsistent.

These checks produce warnings (stored in GatDiagnostics::migration_warnings) rather than hard errors, because some legitimate migrations intentionally drop vertices while retaining edges to surviving neighbors.

9.6.2 Theory equation validation

The validate_theory_equations function runs typecheck_theory on the protocol’s schema theory. This catches malformed equations that could arise from theory composition or inheritance. The validate_schema_equations function goes further: it builds a model from the schema and runs check_model_with_options to verify that the schema satisfies the protocol theory’s equations.

If either check finds errors, the staging pipeline marks the migration as ValidationStatus::Invalid and populates the gat_diagnostics field of the staged schema. The subsequent schema commit will be blocked unless --skip-verify is passed.

9.6.3 Integration with auto_mig

The auto_mig::derive_migration function produces a Migration struct from a schema diff. The VCS Repository::add method calls this function, then immediately runs GAT validation on the result:

schema diff -> auto_mig::derive_migration -> gat_validate::validate_migration -> stage

If the auto-derived migration maps very few vertices (less than half of the old schema’s vertices), the pipeline falls back to find_best_morphism from panproto-mig’s homomorphism search, which uses a more expensive but more thorough algorithm to find the best vertex correspondence.

9.7 Coherence checking in compose_path

When the VCS merge algorithm composes migration paths through different branch histories, it must verify that the composed path is coherent: the composed maps must commute with all operations.3

The check_natural_transformation function from panproto-gat is used at two points in the merge pipeline:

  1. Three-way merge verification: after computing the merged schema, the merge algorithm constructs compatible families of maps from the “ours” and “theirs” migration morphisms to the merge result. These must commute with all operations for the merge to be semantically correct.

  2. Rebase coherence: when rebasing a chain of commits onto a new base, each replayed commit’s migration is composed with the base migration. The vertical_compose function chains these maps, and check_natural_transformation validates each composed result.

The merge result struct includes a pullback_overlap field that reports the shared structure of the two branch theories over the merge base theory. Non-empty overlap indicates shared structure that both branches modified, which requires careful reconciliation.

9.8 Worked example: renaming a field

Consider renaming the text field to content in a schema with vertices \(\{\texttt{body}, \texttt{body.text}\}\) and an edge \(\texttt{body} \to \texttt{body.text}\) with name "text".

Step 1: Define the migration.

The vertex_map maps body to body and body.text to body.content. The edge_map maps the old edge to a new edge with name "content". No resolvers are needed because no vertices are dropped.

Step 2: Check existence.

check_existence looks up the protocol’s schema theory. If the theory has a Constraint sort, it verifies that any constraints on body.text are compatible with the constraints on body.content in the target. Since this is a rename (same kind, same constraints), all checks pass.

Step 3: Compile.

compile produces a CompiledMigration with:

  • surviving_verts: {body, body.content}
  • surviving_edges: {body -> body.content (content)}
  • vertex_remap: {body.text -> body.content}
  • edge_remap: old edge maps to new edge

Step 4: Lift.

lift_wtype walks the source instance. Nodes anchored at body keep their anchor. Nodes anchored at body.text have their anchor remapped to body.content. Arcs are updated to use the new edge. The result is a valid instance under the target schema.

This is a bijective migration, so invert would also succeed, producing a migration that renames content back to text.

9.9 Coverage analysis (partiality)

File: crates/panproto-mig/src/coverage.rs

Not every record can survive every migration. A coercion from string to int fails on non-numeric strings. A constraint tightening rejects values that violate the new constraint. The check_coverage function provides dry-run capability: it attempts the migration on each record individually and reports which succeed and which fail, without modifying any data.

pub struct CoverageReport {
    pub total_records: usize,
    pub successful: usize,
    pub failed: Vec<PartialFailure>,
    pub coverage_ratio: f64,
}

Each PartialFailure records the failing record’s ID and the reason: ConstraintViolation, MissingRequiredField, TypeMismatch, or ExprEvalFailed. This integrates with the CLI’s schema migrate --dry-run command, which prints a coverage report before committing to the migration.

9.10 Performance considerations

The migration pipeline is designed for throughput:

  • Existence checking is \(O(|V| + |E| + |H|)\) for the universal checks, plus \(O(|C|)\) for the conditional constraint check and \(O(|H| \cdot |L|)\) for the conditional signature coherence check (where \(V\), \(E\), \(H\), \(C\), \(L\) are vertex map, edge map, hyper-edge map, constraints, and labels respectively). This is fast even for large schemas.
  • Compilation is \(O(|m|)\) where \(m\) is the migration size. It happens once per schema version pair.
  • Lifting is \(O(n)\) per record (where \(n\) is node count), with hash-map lookups for remap and resolver operations. For tree instances, this is a single tree traversal. For table instances, this is a single pass over rows.
  • Composition is \(O(|V| + |E|)\) and happens once when chaining migrations.

For batch migration scenarios (e.g., migrating a million ATProto records from schema v1 to v2), the cost is dominated by lifting, which is linear in the total data size.

9.11 Error handling

The panproto-mig error types are designed to provide actionable diagnostics:

  • ExistenceError: includes variants for WellFormedness (invalid vertex/edge references), ConstraintTightened, RequiredFieldMissing, KindInconsistency, SignatureCoherence, Simultaneity, ReachabilityRisk, and EdgeMissing. Each variant carries enough context to identify the exact source of the problem.
  • ComposeError: reports when composition fails due to incompatible intermediate schemas.
  • InvertError: identifies the specific bijectivity violation (non-injective vertex map, non-surjective edge map, etc.).
  • LiftError: reports runtime failures during instance transformation.

All error types implement Display for human-readable messages and Serialize/Deserialize for machine consumption.

For a hands-on walkthrough of building and applying migrations, see the tutorial.

9.12 Module map

Module Exports Purpose
migration Migration Migration specification type with vertex/edge/hyper-edge maps and resolvers
existence ExistenceReport, check_existence Theory-derived validation of migration well-formedness
compile compile Pre-computation of surviving sets and remap tables
lift lift_wtype, lift_functor Instance transformation for tree and table instances
compose compose Sequential composition of two migrations
invert invert Construction of inverse migrations for bijective mappings
error ExistenceError, ComposeError, InvertError, LiftError, MigError Error types for each pipeline stage

All public types are re-exported from the crate root, so consumers can write use panproto_mig::Migration without navigating the module structure.

9.13 Testing strategy

The crate’s test suite covers each pipeline stage independently:

  • Existence checking tests verify that specific error variants are produced for known-bad migrations (constraint tightening, kind inconsistency, required field missing, etc.). Each test constructs a minimal schema pair and migration, registers the appropriate theories, and asserts that the expected error variant appears in the report.
  • Compilation tests verify that surviving sets and remap tables are computed correctly for various migration shapes.
  • Lifting tests use small hand-crafted instances to verify that nodes are correctly dropped, remapped, and re-parented.
  • Composition tests verify that composing two migrations produces the same result as applying them sequentially.
  • Inversion tests verify that invertible migrations round-trip correctly and that non-invertible migrations produce the expected InvertError.

The test suite is intentionally structured so that each test constructs its own minimal schema pair and migration from scratch, avoiding shared test fixtures that could mask bugs through accidental coupling. See Chapter 17 for the project-wide testing conventions.

Property-based tests (using proptest) supplement the hand-crafted cases, generating random schemas and migrations to verify invariants like “composing a migration with its inverse yields an identity migration” and “existence checking accepts the identity migration for any schema.”

Ehrig, Hartmut, Karsten Ehrig, Ulrike Prange, and Gabriele Taentzer. 2006. Fundamentals of Algebraic Graph Transformation. Monographs in Theoretical Computer Science. Springer. https://doi.org/10.1007/3-540-31188-2.

  1. This approach is inspired by the double-pushout rewriting framework (Ehrig et al. 2006), where validation conditions are derived from the structure of the rule rather than enumerated by hand.↩︎

  2. Formally, migrations form a category: the migration category over a fixed protocol.↩︎

  3. In categorical terms, this is the naturality condition for natural transformations between the composed morphisms.↩︎