4 Your First Migration

5 Your First Migration

Time to migrate some data. We will take the Bluesky tags change from Chapter 1—adding an optional array of strings to app.bsky.feed.post—and work through it end to end: defining the two schema versions, writing the migration morphism by hand, compiling it, and applying it to a concrete record. Along the way, we will see what the mathematical vocabulary from the previous chapters actually buys us in practice.

5.1 Two versions of a post

Version 1 of the post schema has two fields: text (the post body) and createdAt (a datetime timestamp).

import { Panproto } from '@panproto/core';

const panproto = await Panproto.init();
const atproto = panproto.protocol('atproto');

// start snippet schema-v1
const schemaV1 = atproto.schema()
  .vertex('post', 'record', { nsid: 'app.bsky.feed.post' })
  .vertex('post:body', 'object')
  .vertex('post:body.text', 'string')
  .vertex('post:body.createdAt', 'string')
  .edge('post', 'post:body', 'record-schema')
  .edge('post:body', 'post:body.text', 'prop', { name: 'text' })
  .edge('post:body', 'post:body.createdAt', 'prop', { name: 'createdAt' })
  .build();
// end snippet schema-v1

// start snippet schema-v2
const schemaV2 = atproto.schema()
  .vertex('post', 'record', { nsid: 'app.bsky.feed.post' })
  .vertex('post:body', 'object')
  .vertex('post:body.text', 'string')
  .vertex('post:body.createdAt', 'string')
  .vertex('post:body.tags', 'array')
  .vertex('post:body.tags:item', 'string')
  .edge('post', 'post:body', 'record-schema')
  .edge('post:body', 'post:body.text', 'prop', { name: 'text' })
  .edge('post:body', 'post:body.createdAt', 'prop', { name: 'createdAt' })
  .edge('post:body', 'post:body.tags', 'prop', { name: 'tags' })
  .edge('post:body.tags', 'post:body.tags:item', 'items')
  .build();
// end snippet schema-v2

// start snippet migration
const migration = panproto.migration(schemaV1, schemaV2)
  .map('post', 'post')
  .map('post:body', 'post:body')
  .map('post:body.text', 'post:body.text')
  .map('post:body.createdAt', 'post:body.createdAt')
  .compile();
// end snippet migration

// start snippet lift
const inputRecord = {
  text: 'Hello, world!',
  createdAt: '2024-01-15T12:00:00Z',
};

const result = migration.lift(inputRecord);
// result.data: { text: 'Hello, world!', createdAt: '2024-01-15T12:00:00Z' }
// The tags field is absent (not mapped from v1).
// end snippet lift

Viewed through the lens of the ATProto schema theory \(\mathrm{colimit}(\text{ThGraph}, \text{ThConstraint}, \text{ThMulti})\), this is a directed graph with three vertices (Post, Text, CreatedAt, each annotated with a kind) and two edges (\(\text{text}\colon \text{Post} \to \text{Text}\) and \(\text{createdAt}\colon \text{Post} \to \text{CreatedAt}\), each annotated with multiplicity and constraints).

Version 2 adds tags:

import { Panproto } from '@panproto/core';

const panproto = await Panproto.init();
const atproto = panproto.protocol('atproto');

// start snippet schema-v1
const schemaV1 = atproto.schema()
  .vertex('post', 'record', { nsid: 'app.bsky.feed.post' })
  .vertex('post:body', 'object')
  .vertex('post:body.text', 'string')
  .vertex('post:body.createdAt', 'string')
  .edge('post', 'post:body', 'record-schema')
  .edge('post:body', 'post:body.text', 'prop', { name: 'text' })
  .edge('post:body', 'post:body.createdAt', 'prop', { name: 'createdAt' })
  .build();
// end snippet schema-v1

// start snippet schema-v2
const schemaV2 = atproto.schema()
  .vertex('post', 'record', { nsid: 'app.bsky.feed.post' })
  .vertex('post:body', 'object')
  .vertex('post:body.text', 'string')
  .vertex('post:body.createdAt', 'string')
  .vertex('post:body.tags', 'array')
  .vertex('post:body.tags:item', 'string')
  .edge('post', 'post:body', 'record-schema')
  .edge('post:body', 'post:body.text', 'prop', { name: 'text' })
  .edge('post:body', 'post:body.createdAt', 'prop', { name: 'createdAt' })
  .edge('post:body', 'post:body.tags', 'prop', { name: 'tags' })
  .edge('post:body.tags', 'post:body.tags:item', 'items')
  .build();
// end snippet schema-v2

// start snippet migration
const migration = panproto.migration(schemaV1, schemaV2)
  .map('post', 'post')
  .map('post:body', 'post:body')
  .map('post:body.text', 'post:body.text')
  .map('post:body.createdAt', 'post:body.createdAt')
  .compile();
// end snippet migration

// start snippet lift
const inputRecord = {
  text: 'Hello, world!',
  createdAt: '2024-01-15T12:00:00Z',
};

const result = migration.lift(inputRecord);
// result.data: { text: 'Hello, world!', createdAt: '2024-01-15T12:00:00Z' }
// The tags field is absent (not mapped from v1).
// end snippet lift

One new vertex (Tags, with kind array<string>) and one new edge (\(\text{tags}\colon \text{Post} \to \text{Tags}\), with multiplicity optional). Everything else is unchanged.

The question is: given a record conforming to v2, how do we produce a record conforming to v1?

Direction of migration

We are migrating backward: from v2 data to v1 compatibility. This is the common direction in practice. A client that only understands v1 receives a v2 record and needs to make sense of it. The forward direction (v1 data → v2) requires defaults or explicit construction of the new fields, which is a different operation (extend in the set-valued map world, a fill operation in the W-type world).¹

5.2 The migration morphism

To express the relationship between the two versions precisely, we need a schema morphism: a pair of maps \((f_V\colon V_1 \to V_2,\; f_E\colon E_1 \to E_2)\) that preserves connectivity. If edge \(e\) connects vertices \(u\) and \(v\) in \(S_1\), then \(f_E(e)\) must connect \(f_V(u)\) and \(f_V(v)\) in \(S_2\).

const migration = panproto.migration(schemaV1, schemaV2)
  .map('post', 'post')
  .map('post:body', 'post:body')
  .map('post:body.text', 'post:body.text')
  .map('post:body.createdAt', 'post:body.createdAt')
  .compile();

The vertex map \(f_V\) sends each v1 vertex to its v2 counterpart: \(\text{Post} \mapsto \text{Post}\), \(\text{Text} \mapsto \text{Text}\), \(\text{CreatedAt} \mapsto \text{CreatedAt}\).

The edge map \(f_E\) does the same for edges: \(\text{text} \mapsto \text{text}\), \(\text{createdAt} \mapsto \text{createdAt}\).

The Tags vertex and tags edge do not appear in the migration. They exist in \(S_2\) but have no preimage in \(S_1\). That is the structural encoding of “v2 added a new field that v1 does not know about.”

This particular migration morphism is an injection: every v1 element maps to a distinct v2 element. Not all migrations are injections. You can merge vertices, collapse edges, remap types. But the “add a field” pattern always produces one.

Exercise

The migration morphism above is injective. Give an example of a non-injective migration morphism. What schema change would it represent?

Solution

Suppose v1 has two string types, FirstName and LastName, each with its own edge from Person. v2 merges them into a single FullName vertex. The vertex map sends both FirstName and LastName to FullName. This represents collapsing two fields into one, which requires a merge function to combine the data.

5.2.1 What the maps mean

The vertex map says: “for each type in the old schema, here is the corresponding type in the new schema.” If a v1 vertex maps to a v2 vertex, the engine knows that data anchored to that vertex type can survive the migration. If a v2 vertex has no preimage, data anchored to it must be introduced (going forward) or discarded (going backward).

The edge map says: “for each field in the old schema, here is the corresponding field in the new schema.” This map must be compatible with the vertex map: if edge \(e\) goes from \(u\) to \(v\), then \(f_E(e)\) must go from \(f_V(u)\) to \(f_V(v)\). This compatibility condition ensures the migration preserves graph structure.²

5.3 Compiling and lifting

A migration morphism is a declaration of intent. Turning it into an executable operation requires one more step: compilation, where the engine pre-computes the data structures needed for efficient per-record migration.

const inputRecord = {
  text: 'Hello, world!',
  createdAt: '2024-01-15T12:00:00Z',
};

const result = migration.lift(inputRecord);
// result.data: { text: 'Hello, world!', createdAt: '2024-01-15T12:00:00Z' }
// The tags field is absent (not mapped from v1).

Compilation produces a surviving vertex set (which v2 vertices are in the image of \(f_V\)), a surviving edge set, remap tables for translating v2 names back to v1 names, and pre-checked constraint compatibility data. The lift function takes this compiled migration and a v2 record and produces a v1 record.

5.4 Tracing the lift step by step

Suppose we have a v2 record with the new tags field populated:

{
  "text": "Hello, world!",
  "createdAt": "2025-01-15T12:00:00Z",
  "tags": ["greeting", "test"]
}

As a W-type instance, this is a tree with a root node anchored to Post and three children, each anchored to a different vertex and reached via a different edge:

graph TD
    R["Post (root)"]
    T["Text: 'Hello, world!'"]
    C["CreatedAt: '2025-01-15T12:00:00Z'"]
    G["Tags: ['greeting', 'test']"]
    R -->|text| T
    R -->|createdAt| C
    R -->|tags| G

Figure 5.1: The v2 record as a W-type tree.

Because ATProto uses W-type instances, the engine runs the tree surgery pipeline. For this simple example, only the first two steps do real work:

Step 1 (signature restriction): Walk every node and check whether its anchor vertex is in the surviving set \(\{\text{Post}, \text{Text}, \text{CreatedAt}\}\). The Tags node is not, so it is removed.

Step 2 (reachability BFS): Both remaining children are direct children of the root, so they are trivially reachable.

Steps 3 through 5 (ancestor contraction, edge resolution, fan reconstruction) are no-ops here because the removed Tags node was a leaf with no surviving descendants, the edge labels are identity-mapped, and the fan structure is already valid. The chapter on data lifting covers each step in full detail.

The result:

{
  "text": "Hello, world!",
  "createdAt": "2025-01-15T12:00:00Z"
}

The tags field is gone. The surviving fields are untouched.

Exercise

What happens if you lift a v2 record with tags: [] (empty array)? Does the output differ from lifting a record with no tags field at all?

Solution

The output is the same either way. The lift operation prunes the Tags node regardless of whether the array is empty or populated, because v1’s schema has no vertex corresponding to Tags. The distinction between “field present but empty” and “field absent” exists in v2 but is erased by the migration.

5.5 The same pattern in SQL

To see that the framework is not specific to tree-shaped data, here is the same “add a column” migration in SQL, where the instance theory is \(\text{ThFunctor}\).

The v1 schema has a posts table with columns id, text, and created_at. The v2 schema adds a tags column:

ALTER TABLE posts ADD COLUMN tags text[] DEFAULT '{}';

The migration morphism maps posts → posts and all v1 columns to their v2 counterparts.

A v2 instance (a set of rows) looks like this:

id	text	created_at	tags
1	“Hello, world!”	2025-01-15	{“greeting”, “test”}
2	“Another post”	2025-01-16	{}

The lift (restriction via precomposition) produces the same set of rows, viewed through v1’s schema, which has no tags column:

id	text	created_at
1	“Hello, world!”	2025-01-15
2	“Another post”	2025-01-16

The tags column vanishes because v1’s schema has no morphism corresponding to it.

Same migration, different mechanics

The ATProto and SQL versions of “add a field” produce the same observable result: the new field is dropped when migrating backward. The mechanism differs. ATProto prunes a tree node (W-type restriction). SQL projects away a column (functor restriction via precomposition). The schema morphism is the same kind of mathematical object in both cases, but the data-level operation is determined by the instance theory.

5.6 Composing migrations

Schemas rarely change just once. If you have a migration \(m_{12}\colon S_1 \to S_2\) and another \(m_{23}\colon S_2 \to S_3\), their composite \(m_{13} = m_{23} \circ m_{12}\colon S_1 \to S_3\) is also a valid migration.

For our running example, suppose v3 adds a langs field (language codes) on top of v2. Then:

\(m_{12}\): v1 \(\to\) v2 (adds tags)
\(m_{23}\): v2 \(\to\) v3 (adds langs)
\(m_{13} = m_{23} \circ m_{12}\): v1 \(\to\) v3 (adds both tags and langs)

Lifting through the composite is equivalent to lifting through each step in sequence. For set-valued map instances, this is a theorem (precomposition is contravariantly compositional). For W-type instances, the engine verifies the property for each migration pair, but in practice it holds whenever both migrations pass the existence checks described in the next chapter.

Composability is what makes migration scalable. You maintain a chain \(S_1 \to S_2 \to \cdots \to S_n\), and the engine composes as needed.

5.7 Migrations beyond adding fields

Adding a field is the simplest case. The same framework handles much more complex structural changes.

Renaming a field. Suppose v2 renames text to body. The vertex map still sends \(\text{Text} \mapsto \text{Text}\) (the type has not changed), but the edge map sends \(\text{text} \mapsto \text{body}\). During edge resolution, the engine relabels body back to text, producing a record with the old field name. The data is untouched; only the label changes.

Extracting a nested object. Suppose v1 has a flat post with text, createdAt, and authorName. v2 extracts authorName into a nested author object with a name field. The vertex map sends v1’s AuthorName to v2’s Name (inside the Author object). The edge map sends v1’s authorName to the composed path author.name. During ancestor contraction, the engine recognizes that Author is an intermediate vertex not present in v1 and contracts the path Post → Author → Name to Post → AuthorName.

Removing a required field. Suppose v2 drops createdAt entirely. The vertex map does not include CreatedAt, and the edge map does not include createdAt. If v1 requires createdAt, this migration may not exist: there is no valid way to produce a v1 record without it. The engine detects this during compilation and reports an existence failure. The next chapter covers the precise conditions.

5.8 Cross-format migration

The same machinery supports migration across protocols. Because panproto represents every schema as a graph regardless of origin format, you can migrate between formats by composing three steps: parse a Protobuf schema into a schema graph, build a migration morphism mapping Protobuf concepts to Avro concepts (vertex by vertex and edge by edge), and emit the result as an Avro schema by applying the Avro emitter to the target graph.

The migration morphism in the second step is the same kind of object as the intra-protocol migrations above. panproto validates that the morphism respects both theories and that source constraints are satisfiable in the target. Chapter 11 covers cross-format conversion in depth.

5.9 Or, skip all that

Everything in this chapter—\(f_V\), \(f_E\), resolvers—can be computed automatically. panproto’s homomorphism search finds the best migration between two schemas without manual specification. For the common case of adding or removing fields, schema add in the VCS derives migrations automatically. For complex structural changes and cross-protocol translation, the full machinery is in Chapter 13.

5.10 Terminology

Table 5.1: How panproto’s vocabulary relates to other communities.

panproto	CQL / Spivak	Cambria	SQL
Migration morphism	Schema morphism	Lens	ALTER TABLE + mapping
Compile	(implicit)	Compile lens	Query planning
Lift (backward)	`restrict`	Convert	View / projection
Fill (forward)	`extend` / `complete`	(manual)	INSERT with defaults

panproto’s terminology is closest to the Cambria project (Litt et al. 2022), which was an important practical inspiration. The key difference is that Cambria hardcodes its schema language and migration semantics, while panproto parameterizes both by theories.

Every migration in this chapter succeeded. Not every proposed migration is valid. The next chapter examines what goes wrong and how the engine detects it.

Litt, Geoffrey, Martin Kleppmann, and Marc Shapiro. 2022. Project Cambria: Schema Evolution for CRDTs. Ink & Switch research essay. https://www.inkandswitch.com/cambria/.

Spivak, David I. 2012. “Functorial Data Migration.” Information and Computation 217: 31–51. https://arxiv.org/abs/1009.1166.

In categorical language, the lift operation is restriction along a theory morphism \(\Delta_f\). See Spivak (2012) and Appendix A.↩︎
When the schema theory is richer than a plain graph—for example, SQL with foreign keys—the morphism generalizes to a theory morphism that additionally preserves equations.↩︎