13  Automatic Migration Discovery

You have two schema versions where text became body and createdAt became publishedAt. Writing out the migration by hand is tedious. Wouldn’t it be nice if the system could figure out which fields correspond? That’s what this chapter covers.

A migration \(F: S_1 \to S_2\) is a structure-preserving map (Chapter 4 for the definition). Finding one is a constraint satisfaction problem:

panproto solves this with backtracking search, optimized with two heuristics.

13.2 Worked example: renaming fields

The old schema has vertices {post, post.text, post.createdAt}. The new has {post, post.body, post.publishedAt}. A manual migration would require specifying which field maps to which.

With find_best_morphism, panproto discovers this automatically:

  • post maps to post (same kind, same structure)
  • post.text maps to post.body (both string, connected to post via prop edge)
  • post.createdAt maps to post.publishedAt (both datetime, same pattern)

The quality score penalizes name mismatches. If multiple migrations exist, the one preserving the most names wins.

CautionExercise: Breaking ties

If two candidate migrations have the same quality score, which one does panproto pick?

The search is deterministic: it iterates vertices in ID order, so ties are broken by vertex ordering. The first valid morphism found in iteration order wins. If you need a specific migration in ambiguous cases, provide manual hints via the hints parameter.

13.3 Three ways to migrate data along a discovered mapping

A migration mapping \(F\) supports three operations for moving data. Each is mathematically related and captures a different perspective on what the target instance should contain.

13.3.1 Restrict

Project data backward from the new schema onto the old schema, keeping only the parts that the old schema expects. This is what Chapter 9 described in detail. The complement captures what was dropped, enabling round-trip recovery via lenses (Chapter 7).

Restrict is the workhorse of panproto. With the complement, it is lossless and reversible. It runs as a fused single-pass BFS.

13.3.2 Extend

Push data forward along the mapping. Where restrict drops unmapped nodes, extend keeps everything and remaps it:

  • Nodes with remapped types get new types
  • Edges with remapped endpoints get new endpoints
  • When multiple source vertices map to the same target (many-to-one), their data is merged

Extend produces the “most general” target instance consistent with the source data. Think of it as a UNION: everything from the source appears in the target, possibly combined.

Warning

Unlike restrict, extend is not reversible. There is no complement mechanism. Use extend when you want to push data forward without needing to come back.

13.3.3 Complete

For each target vertex, compute the Cartesian product of all source data that maps to it. In SQL terms, this is a JOIN across all the source tables that feed into a given target table.

Complete produces the “most conservative” target instance: one that contains only data provably consistent across all sources. Think of it as asking, “what data is consistent with every possible interpretation of the mapping?”

Warning

Complete can produce exponentially large instances (Cartesian products). panproto gates this with a configurable size limit.

These three operations are mathematically related. Restrict and extend are inverses in a precise sense. Restrict and complete form another inverse pair. This means extend(restrict(data)) always contains the original data (you never lose information), and restrict(complete(data)) is always contained in the original data (complete never invents data that wasn’t there).1

13.4 Schema merge

Given two schemas that share some structure, panproto computes their merge: the minimal schema containing both, with shared elements unified.2

let (integrated, left_morph, right_morph) = schema_merge(
    &atproto_schema,
    &mastodon_schema,
    &overlap,
)?;

The overlap specifies which vertices and edges are identified between the two schemas. The merge produces the smallest schema that both original schemas map into.

13.4.1 Automatic overlap discovery

You don’t need to specify the overlap manually. discover_overlap uses the homomorphism search to find the largest shared sub-schema:

let overlap = discover_overlap(&atproto_schema, &mastodon_schema);
let (integrated, _, _) = schema_merge(&atproto_schema, &mastodon_schema, &overlap)?;

This finds the largest schema that both inputs share and uses it as the gluing point for the merge.

CautionExercise: Limitations of automatic discovery

discover_overlap finds the largest shared sub-schema by homomorphism search. If two schemas share structure under different names (e.g., author vs. creator), what might the search miss?

discover_overlap matches vertex kinds and edge patterns. If two schemas share structure under different names, the kind-based search may still find a match (both are object vertices with similar outgoing edges). But if the kinds also differ, the search misses it entirely. In that case, provide an explicit overlap mapping that identifies the corresponding vertices by hand.

13.5 Constraint enforcement

When extend pushes data forward, the target schema’s equational constraints may not be satisfied. panproto enforces constraints by iterating:3

  1. Find active triggers: pattern matches in the instance that violate a constraint
  2. Apply the consequence: add missing rows or merge values
  3. Repeat until no triggers remain (fixpoint) or an iteration limit is reached

The enforcement terminates for panproto’s equational theories (purely algebraic, with no recursive constraints). For schemas with recursive types, an iteration bound prevents non-termination.

13.6 GAT-validated auto-migration

Auto-derived migrations are subject to the same equation verification as hand-written ones. When schema add computes a migration automatically, it checks that the migration preserves all equations in the theory.

If the source theory has an equation like src(id(v)) = v (from ThReflexiveGraph) and the migration maps src to src' and id to id', then the target theory must satisfy src'(id'(v)) = v. panproto verifies this by applying the migration mapping to both sides of every equation and checking that the results are equal in the target theory.

This catches a class of bugs that would otherwise be invisible: migrations that look correct at the vertex-and-edge level but silently break structural invariants. A migration that maps a reflexive graph to a non-reflexive one would pass a basic vertex/edge check but fail the equation preservation check, because the target has no id operation.

The verification is fast (linear in the number of equations) and runs automatically during schema add. Failure looks like:

error: migration violates equation
  equation: src(id(v)) = v
  mapped to: src'(???) (operation 'id' has no image under the migration)
  hint: the target schema may be missing a required operation
CautionExercise: Soundness vs. completeness

Equation preservation checking is sound: if it passes, the equations hold. But it only checks equations syntactically by substitution. Can two equivalent but differently written equations fail to be recognized as equal?

In principle, yes. The checker compares equation sides by syntactic equality after substitution. Two expressions that are semantically equal but syntactically different (e.g., mul(a, b) vs. mul(b, a) in a commutative theory) might not be recognized as equal. In practice, panproto’s normal-form representation of equations (which canonicalizes variable ordering and operation application) makes this rare. If you suspect a false negative, inspect the mapped equations with schema show --equations.

13.7 The common case is simple

For the common case (evolving a schema within the same protocol), you don’t need any of this. schema add in the VCS (Chapter 10) automatically derives migrations for additions, removals, and renames. The homomorphism search, the three migration strategies, and schema merge are for the hard cases: cross-protocol translation, schema integration from independent sources, and algebraic data integration patterns.

13.8 From migrations to protolenses

The migration discovery machinery produces one migration between two specific schemas. But the same structural change often applies to many schemas. Renaming createdAt to created_at is the same operation whether the schema has 5 fields or 50. Writing a separate migration for each schema is tedious and error-prone.

Chapter 17 shows how to lift the results of this chapter into protolens chains: schema-independent objects that can be instantiated against any schema satisfying a precondition. The pipeline is: migration discovery (this chapter) produces a mapping; factorization decomposes it into elementary operations; each operation becomes a protolens constructor. The result is a reusable chain that encodes the pattern of the migration, not a single instance of it.

In practice, ProtolensChainHandle.autoGenerate() (TypeScript) or ProtolensChainHandle.auto_generate() (Python) runs the full pipeline under the hood (migration search, quality scoring, constraint propagation) and then lifts the result into a protolens chain. Understanding the machinery here helps you diagnose failures and ambiguities in auto-generation, and guides you when providing manual hints or complement specs.

Spivak, David I. 2012. “Functorial Data Migration.” Information and Computation 217: 31–51. https://arxiv.org/abs/1009.1166.

  1. In the categorical literature (Spivak 2012), restrict is the pullback functor \(\Delta_F\), extend is the left Kan extension \(\Sigma_F\), and complete is the right Kan extension \(\Pi_F\). They satisfy the adjunction \(\Sigma_F \dashv \Delta_F \dashv \Pi_F\). See Appendix A.↩︎

  2. The merge operation is formally a pushout in the category of schemas. The overlap specifies a shared sub-schema, and the merge is the universal construction that glues the two schemas together. See Appendix A.↩︎

  3. This iterative enforcement is an instance of the chase algorithm from database theory, adapted to panproto’s equational theories. See Appendix A.↩︎