13 Automatic Migration Discovery

You have two schema versions where text became body and createdAt became publishedAt. Writing out the migration by hand is tedious. Wouldn’t it be nice if the system could figure out which fields correspond? That’s what this chapter covers.

A migration $F: S_1 \to S_2$ is a structure-preserving map (Chapter 4 for the definition). Finding one is a constraint satisfaction problem:

Variables: one per vertex in $S_1$
Domains: for each variable, vertices in $S_2$ with compatible kind
Constraints: for every edge $e: v_1 \to v_2$ in $S_1$, there must exist an edge $F(e): F(v_1) \to F(v_2)$ in $S_2$ with compatible kind

panproto solves this with backtracking search, optimized with two heuristics.

13.1 Backtracking search

The search uses MRV (Minimum Remaining Values) to assign the most constrained vertex first. A vertex with only one compatible target gets assigned immediately. A vertex with many options is deferred.

Forward checking propagates constraints after each assignment. If any neighbor’s domain becomes empty, backtrack immediately without exploring further.

let results = find_morphisms(&old_schema, &new_schema, &SearchOptions {
    monic: true,   // injective: no two source vertices share a target
    ..Default::default()
});

The search returns all valid migrations, sorted by quality score. Quality combines:

Name similarity: edit distance between mapped vertex IDs
Edge name preservation: fraction of edges that keep the same label

The highest-quality migration is the most natural one.

13.2 Worked example: renaming fields

The old schema has vertices {post, post.text, post.createdAt}. The new has {post, post.body, post.publishedAt}. A manual migration would require specifying which field maps to which.

With find_best_morphism, panproto discovers this automatically:

post maps to post (same kind, same structure)
post.text maps to post.body (both string, connected to post via prop edge)
post.createdAt maps to post.publishedAt (both datetime, same pattern)

The quality score penalizes name mismatches. If multiple migrations exist, the one preserving the most names wins.

Exercise: Breaking ties

If two candidate migrations have the same quality score, which one does panproto pick?

Answer

The search is deterministic: it iterates vertices in ID order, so ties are broken by vertex ordering. The first valid morphism found in iteration order wins. If you need a specific migration in ambiguous cases, provide manual hints via the hints parameter.

13.3 Three ways to migrate data along a discovered mapping

A migration mapping $F$ supports three operations for moving data. Each is mathematically related and captures a different perspective on what the target instance should contain.

13.3.1 Restrict

Project data backward from the new schema onto the old schema, keeping only the parts that the old schema expects. This is what Chapter 9 described in detail. The complement captures what was dropped, enabling round-trip recovery via lenses (Chapter 7).

Restrict is the workhorse of panproto. With the complement, it is lossless and reversible. It runs as a fused single-pass BFS.

13.3.2 Extend

Push data forward along the mapping. Where restrict drops unmapped nodes, extend keeps everything and remaps it:

Nodes with remapped types get new types
Edges with remapped endpoints get new endpoints
When multiple source vertices map to the same target (many-to-one), their data is merged

Extend produces the “most general” target instance consistent with the source data. Think of it as a UNION: everything from the source appears in the target, possibly combined.

Warning

Unlike restrict, extend is not reversible. There is no complement mechanism. Use extend when you want to push data forward without needing to come back.

13.3.3 Complete

For each target vertex, compute the Cartesian product of all source data that maps to it. In SQL terms, this is a JOIN across all the source tables that feed into a given target table.

Complete produces the “most conservative” target instance: one that contains only data provably consistent across all sources. Think of it as asking, “what data is consistent with every possible interpretation of the mapping?”

Warning

Complete can produce exponentially large instances (Cartesian products). panproto gates this with a configurable size limit.

These three operations are mathematically related. Restrict and extend are inverses in a precise sense. Restrict and complete form another inverse pair. This means extend(restrict(data)) always contains the original data (you never lose information), and restrict(complete(data)) is always contained in the original data (complete never invents data that wasn’t there).¹

13.4 Schema merge

Given two schemas that share some structure, panproto computes their merge: the minimal schema containing both, with shared elements unified.²

let (integrated, left_morph, right_morph) = schema_merge(
    &atproto_schema,
    &mastodon_schema,
    &overlap,
)?;

The overlap specifies which vertices and edges are identified between the two schemas. The merge produces the smallest schema that both original schemas map into.

13.4.1 Automatic overlap discovery

You don’t need to specify the overlap manually. discover_overlap uses the homomorphism search to find the largest shared sub-schema:

let overlap = discover_overlap(&atproto_schema, &mastodon_schema);
let (integrated, _, _) = schema_merge(&atproto_schema, &mastodon_schema, &overlap)?;

This finds the largest schema that both inputs share and uses it as the gluing point for the merge.

Exercise: Limitations of automatic discovery

discover_overlap finds the largest shared sub-schema by homomorphism search. If two schemas share structure under different names (e.g., author vs. creator), what might the search miss?

Answer

discover_overlap matches vertex kinds and edge patterns. If two schemas share structure under different names, the kind-based search may still find a match (both are object vertices with similar outgoing edges). But if the kinds also differ, the search misses it entirely. In that case, provide an explicit overlap mapping that identifies the corresponding vertices by hand.

13.5 Constraint enforcement

When extend pushes data forward, the target schema’s equational constraints may not be satisfied. panproto enforces constraints by iterating:³

Find active triggers: pattern matches in the instance that violate a constraint
Apply the consequence: add missing rows or merge values
Repeat until no triggers remain (fixpoint) or an iteration limit is reached

The enforcement terminates for panproto’s equational theories (purely algebraic, with no recursive constraints). For schemas with recursive types, an iteration bound prevents non-termination.

13.6 GAT-validated auto-migration

Auto-derived migrations are subject to the same equation verification as hand-written ones. When schema add computes a migration automatically, it checks that the migration preserves all equations in the theory.

If the source theory has an equation like src(id(v)) = v (from ThReflexiveGraph) and the migration maps src to src' and id to id', then the target theory must satisfy src'(id'(v)) = v. panproto verifies this by applying the migration mapping to both sides of every equation and checking that the results are equal in the target theory.

This catches a class of bugs that would otherwise be invisible: migrations that look correct at the vertex-and-edge level but silently break structural invariants. A migration that maps a reflexive graph to a non-reflexive one would pass a basic vertex/edge check but fail the equation preservation check, because the target has no id operation.

The verification is fast (linear in the number of equations) and runs automatically during schema add. Failure looks like:

error: migration violates equation
  equation: src(id(v)) = v
  mapped to: src'(???) (operation 'id' has no image under the migration)
  hint: the target schema may be missing a required operation

Exercise: Soundness vs. completeness

Equation preservation checking is sound: if it passes, the equations hold. But it only checks equations syntactically by substitution. Can two equivalent but differently written equations fail to be recognized as equal?

Answer

In principle, yes. The checker compares equation sides by syntactic equality after substitution. Two expressions that are semantically equal but syntactically different (e.g., mul(a, b) vs. mul(b, a) in a commutative theory) might not be recognized as equal. In practice, panproto’s normal-form representation of equations (which canonicalizes variable ordering and operation application) makes this rare. If you suspect a false negative, inspect the mapped equations with schema show --equations.

13.7 The common case is simple

For the common case (evolving a schema within the same protocol), you don’t need any of this. schema add in the VCS (Chapter 10) automatically derives migrations for additions, removals, and renames. The homomorphism search, the three migration strategies, and schema merge are for the hard cases: cross-protocol translation, schema integration from independent sources, and algebraic data integration patterns.

13.8 From migrations to protolenses

The migration discovery machinery produces one migration between two specific schemas. But the same structural change often applies to many schemas. Renaming createdAt to created_at is the same operation whether the schema has 5 fields or 50. Writing a separate migration for each schema is tedious and error-prone.

Chapter 17 shows how to lift the results of this chapter into protolens chains: schema-independent objects that can be instantiated against any schema satisfying a precondition. The pipeline is: migration discovery (this chapter) produces a mapping; factorization decomposes it into elementary operations; each operation becomes a protolens constructor. The result is a reusable chain that encodes the pattern of the migration, not a single instance of it.

In practice, ProtolensChainHandle.autoGenerate() (TypeScript) or ProtolensChainHandle.auto_generate() (Python) runs the full pipeline under the hood (migration search, quality scoring, constraint propagation) and then lifts the result into a protolens chain. Understanding the machinery here helps you diagnose failures and ambiguities in auto-generation, and guides you when providing manual hints or complement specs.

Spivak, David I. 2012. “Functorial Data Migration.” Information and Computation 217: 31–51. https://arxiv.org/abs/1009.1166.

In the categorical literature (Spivak 2012), restrict is the pullback functor $\Delta_F$, extend is the left Kan extension $\Sigma_F$, and complete is the right Kan extension $\Pi_F$. They satisfy the adjunction $\Sigma_F \dashv \Delta_F \dashv \Pi_F$. See Appendix A.↩︎
The merge operation is formally a pushout in the category of schemas. The overlap specifies a shared sub-schema, and the merge is the universal construction that glues the two schemas together. See Appendix A.↩︎
This iterative enforcement is an instance of the chase algorithm from database theory, adapted to panproto’s equational theories. See Appendix A.↩︎

# Automatic Migration Discovery {#sec-auto-migration} You have two schema versions where `text` became `body` and `createdAt` became `publishedAt`. Writing out the migration by hand is tedious. Wouldn't it be nice if the system could figure out which fields correspond? That's what this chapter covers. A migration $F: S_1 \to S_2$ is a structure-preserving map (@sec-protocols-as-parameters for the definition). Finding one is a constraint satisfaction problem: - **Variables**: one per vertex in $S_1$ - **Domains**: for each variable, vertices in $S_2$ with compatible kind - **Constraints**: for every edge $e: v_1 \to v_2$ in $S_1$, there must exist an edge $F(e): F(v_1) \to F(v_2)$ in $S_2$ with compatible kind panproto solves this with backtracking search, optimized with two heuristics. ## Backtracking search The search uses **MRV** (Minimum Remaining Values) to assign the most constrained vertex first. A vertex with only one compatible target gets assigned immediately. A vertex with many options is deferred. **Forward checking** propagates constraints after each assignment. If any neighbor's domain becomes empty, backtrack immediately without exploring further. ```rust let results = find_morphisms(&old_schema, &new_schema, &SearchOptions { monic: true, // injective: no two source vertices share a target ..Default::default() }); ``` The search returns all valid migrations, sorted by quality score. Quality combines: - **Name similarity**: edit distance between mapped vertex IDs - **Edge name preservation**: fraction of edges that keep the same label The highest-quality migration is the most natural one. ## Worked example: renaming fields The old schema has vertices `{post, post.text, post.createdAt}`. The new has `{post, post.body, post.publishedAt}`. A manual migration would require specifying which field maps to which. With `find_best_morphism`, panproto discovers this automatically: - `post` maps to `post` (same kind, same structure) - `post.text` maps to `post.body` (both `string`, connected to `post` via `prop` edge) - `post.createdAt` maps to `post.publishedAt` (both `datetime`, same pattern) The quality score penalizes name mismatches. If multiple migrations exist, the one preserving the most names wins. :::{.callout-caution} ## Exercise: Breaking ties If two candidate migrations have the same quality score, which one does panproto pick? ::: ::: {.callout-tip collapse=true} ## Answer The search is deterministic: it iterates vertices in ID order, so ties are broken by vertex ordering. The first valid morphism found in iteration order wins. If you need a specific migration in ambiguous cases, provide manual hints via the `hints` parameter. ::: ## Three ways to migrate data along a discovered mapping {#sec-three-migrations} A migration mapping $F$ supports three operations for moving data. Each is mathematically related and captures a different perspective on what the target instance should contain. ### Restrict Project data backward from the new schema onto the old schema, keeping only the parts that the old schema expects. This is what @sec-lifting-data described in detail. The complement captures what was dropped, enabling round-trip recovery via lenses (@sec-lenses). Restrict is the workhorse of panproto. With the complement, it is lossless and reversible. It runs as a fused single-pass BFS. ### Extend Push data *forward* along the mapping. Where restrict drops unmapped nodes, extend keeps everything and remaps it: - Nodes with remapped types get new types - Edges with remapped endpoints get new endpoints - When multiple source vertices map to the same target (many-to-one), their data is merged Extend produces the "most general" target instance consistent with the source data. Think of it as a UNION: everything from the source appears in the target, possibly combined. :::{.callout-warning} Unlike restrict, extend is not reversible. There is no complement mechanism. Use extend when you want to push data forward without needing to come back. ::: ### Complete For each target vertex, compute the Cartesian product of all source data that maps to it. In SQL terms, this is a JOIN across all the source tables that feed into a given target table. Complete produces the "most conservative" target instance: one that contains only data provably consistent across all sources. Think of it as asking, "what data is consistent with *every* possible interpretation of the mapping?" :::{.callout-warning} Complete can produce exponentially large instances (Cartesian products). panproto gates this with a configurable size limit. ::: These three operations are mathematically related. Restrict and extend are inverses in a precise sense. Restrict and complete form another inverse pair. This means `extend(restrict(data))` always contains the original data (you never lose information), and `restrict(complete(data))` is always contained in the original data (complete never invents data that wasn't there).^[In the categorical literature [@spivak2012], restrict is the pullback functor $\Delta_F$, extend is the left Kan extension $\Sigma_F$, and complete is the right Kan extension $\Pi_F$. They satisfy the adjunction $\Sigma_F \dashv \Delta_F \dashv \Pi_F$. See [Appendix A](../appendices/A-formal-foundations.qmd).] ## Schema merge {#sec-schema-merge} Given two schemas that share some structure, panproto computes their **merge**: the minimal schema containing both, with shared elements unified.^[The merge operation is formally a *pushout* in the category of schemas. The overlap specifies a shared sub-schema, and the merge is the universal construction that glues the two schemas together. See [Appendix A](../appendices/A-formal-foundations.qmd).] ```rust let (integrated, left_morph, right_morph) = schema_merge( &atproto_schema, &mastodon_schema, &overlap, )?; ``` The overlap specifies which vertices and edges are identified between the two schemas. The merge produces the smallest schema that both original schemas map into. ### Automatic overlap discovery You don't need to specify the overlap manually. `discover_overlap` uses the homomorphism search to find the largest shared sub-schema: ```rust let overlap = discover_overlap(&atproto_schema, &mastodon_schema); let (integrated, _, _) = schema_merge(&atproto_schema, &mastodon_schema, &overlap)?; ``` This finds the largest schema that both inputs share and uses it as the gluing point for the merge. :::{.callout-caution} ## Exercise: Limitations of automatic discovery `discover_overlap` finds the *largest* shared sub-schema by homomorphism search. If two schemas share structure under different names (e.g., `author` vs. `creator`), what might the search miss? ::: ::: {.callout-tip collapse=true} ## Answer `discover_overlap` matches vertex kinds and edge patterns. If two schemas share structure under different names, the kind-based search may still find a match (both are `object` vertices with similar outgoing edges). But if the kinds also differ, the search misses it entirely. In that case, provide an explicit overlap mapping that identifies the corresponding vertices by hand. ::: ## Constraint enforcement {#sec-constraint-enforcement} When extend pushes data forward, the target schema's equational constraints may not be satisfied. panproto enforces constraints by iterating:^[This iterative enforcement is an instance of the *chase algorithm* from database theory, adapted to panproto's equational theories. See [Appendix A](../appendices/A-formal-foundations.qmd).] 1. Find active *triggers*: pattern matches in the instance that violate a constraint 2. Apply the *consequence*: add missing rows or merge values 3. Repeat until no triggers remain (fixpoint) or an iteration limit is reached The enforcement terminates for panproto's equational theories (purely algebraic, with no recursive constraints). For schemas with recursive types, an iteration bound prevents non-termination. ## GAT-validated auto-migration Auto-derived migrations are subject to the same equation verification as hand-written ones. When `schema add` computes a migration automatically, it checks that the migration **preserves all equations** in the theory. If the source theory has an equation like `src(id(v)) = v` (from ThReflexiveGraph) and the migration maps `src` to `src'` and `id` to `id'`, then the target theory must satisfy `src'(id'(v)) = v`. panproto verifies this by applying the migration mapping to both sides of every equation and checking that the results are equal in the target theory. This catches a class of bugs that would otherwise be invisible: migrations that look correct at the vertex-and-edge level but silently break structural invariants. A migration that maps a reflexive graph to a non-reflexive one would pass a basic vertex/edge check but fail the equation preservation check, because the target has no `id` operation. The verification is fast (linear in the number of equations) and runs automatically during `schema add`. Failure looks like: ``` error: migration violates equation equation: src(id(v)) = v mapped to: src'(???) (operation 'id' has no image under the migration) hint: the target schema may be missing a required operation ``` :::{.callout-caution} ## Exercise: Soundness vs. completeness Equation preservation checking is sound: if it passes, the equations hold. But it only checks equations *syntactically* by substitution. Can two equivalent but differently written equations fail to be recognized as equal? ::: ::: {.callout-tip collapse=true} ## Answer In principle, yes. The checker compares equation sides by syntactic equality after substitution. Two expressions that are semantically equal but syntactically different (e.g., `mul(a, b)` vs. `mul(b, a)` in a commutative theory) might not be recognized as equal. In practice, panproto's normal-form representation of equations (which canonicalizes variable ordering and operation application) makes this rare. If you suspect a false negative, inspect the mapped equations with `schema show --equations`. ::: ## The common case is simple For the common case (evolving a schema within the same protocol), you don't need any of this. `schema add` in the VCS (@sec-version-control) automatically derives migrations for additions, removals, and renames. The homomorphism search, the three migration strategies, and schema merge are for the hard cases: cross-protocol translation, schema integration from independent sources, and algebraic data integration patterns. ## From migrations to protolenses {#sec-migrations-to-protolenses} The migration discovery machinery produces *one migration between two specific schemas*. But the same structural change often applies to many schemas. Renaming `createdAt` to `created_at` is the same operation whether the schema has 5 fields or 50. Writing a separate migration for each schema is tedious and error-prone. @sec-auto-lens shows how to lift the results of this chapter into **protolens chains**: schema-independent objects that can be instantiated against any schema satisfying a precondition. The pipeline is: migration discovery (this chapter) produces a mapping; factorization decomposes it into elementary operations; each operation becomes a protolens constructor. The result is a reusable chain that encodes the *pattern* of the migration, not a single instance of it. In practice, `ProtolensChainHandle.autoGenerate()` (TypeScript) or `ProtolensChainHandle.auto_generate()` (Python) runs the full pipeline under the hood (migration search, quality scoring, constraint propagation) and then lifts the result into a protolens chain. Understanding the machinery here helps you diagnose failures and ambiguities in auto-generation, and guides you when providing manual hints or complement specs.