21 Versioning Data with Schemas

You have a repository with three commits, each changing the schema. Your records/ directory still holds data from the first version. It’s stale. For three commits, manual migration is manageable. For thirty commits across five branches with merges—it’s not.

The schema data migrate command walks the commit DAG, derives a protolens chain for each step, and applies the composed chain to every record. Given a data set $D$ conforming to schema $S_i$ and a target schema $S_j$, it computes the DAG path $S_i \to S_{i+1} \to \cdots \to S_j$, constructs the composed protolens chain $\pi_{j-1} \circ \cdots \circ \pi_i$, and applies the resulting lens $\ell = \text{inst}(\pi, S_i)$ to each record $d \in D$. Complements are stored for reversibility.

21.1 Staging data alongside schema changes

The --data flag on schema add stages data files alongside the schema change:

schema add schema-v2.json --data records/

This does two things:

Stages schema-v2.json as the next schema version (the existing behavior).
Snapshots every file in records/ as a DataSetObject, content-addressed and bound to the current schema version (the one being replaced).

The snapshot is a DataSetObject: a blob of MessagePack-encoded instances with a schema_id pointing to the schema they conform to and a record_count for quick inspection.

schema data status records/

Output:

On branch main
Changes to be committed:
  modified: schema (schema-v2.json)
  new data: records/ (42 records, bound to schema abc123)

The data snapshot is immutable. Once committed, it serves as the canonical “before” image for forward migration and as the restore target for backward migration.

What if you stage data without a schema change?

schema add --data records/ snapshots data bound to the current schema version. If you haven’t changed the schema, does this produce a redundant snapshot, or does content-addressing deduplicate it?

Answer

Content-addressing deduplicates it. schema add --data records/ computes the blake3 hash of the data snapshot. If an identical snapshot already exists in the object store (because the data hasn’t changed since the last commit), no new object is created. The commit records the same data_id as before. Storage cost is zero for an unchanged data set.

21.2 Running migrations

schema data migrate records/

This command:

Reads each record in records/ and determines which schema version it conforms to (via the schema_id stored in the data snapshot).
Finds the path in the commit DAG from that schema version to HEAD.
Derives a protolens chain for each step along the path.
Composes the chains and applies the result to every record.
Stores complements for each step as ComplementObject values in the object store.
Writes the migrated records back to records/.

The --dry-run flag shows what would happen without writing:

schema data migrate records/ --dry-run

Dry run: 42 records in records/
  Path: abc123 -> def456 -> HEAD (2 steps)
  Step 1: RenameVertex("userName" -> "handle") : lossless
  Step 2: AddVertex("version", default: 1) : lossless
  Complement: 0 bytes (all steps lossless)
  Result: 42 records migrated, 0 bytes complement stored

The --range flag restricts migration to a specific commit range:

schema data migrate records/ --range HEAD~3..HEAD

The --backward flag migrates data in the reverse direction, restoring from stored complements:

schema data migrate records/ --backward

The -o flag writes migrated records to a different directory instead of overwriting:

schema data migrate records/ -o migrated/

21.3 Complement storage and reversibility

Forward migration through a RemoveVertex step is lossy: the removed vertex’s data disappears from the migrated record. The complement captures that data.

panproto stores complements as ComplementObject values in the same content-addressed object store used for schemas and commits. A ComplementObject contains:

migration_id: the ObjectId of the migration that produced the complement.
data_id: the ObjectId of the DataSetObject the complement was computed from.
The complement data itself: serialized as MessagePack.

When you run schema data migrate records/ --backward, the system:

Finds the complement objects associated with the current data snapshot.
Applies the inverse migration, feeding complement data back into the restoration.
Produces records conforming to the earlier schema version.

Backward migration fails if no complement exists for a given step. This happens when the forward migration was lossless (no complement needed) or when the complement was garbage-collected. Lossless steps are always reversible without complements.

Tip

Complements are content-addressed. Two migrations that produce identical complement data share the same object. This deduplication keeps storage costs proportional to the actual data lost during migration, not the number of migration operations.

21.4 Detecting stale data

schema data status records/

On branch main
Data staleness:
  records/ : 42 records at schema abc123 (3 commits behind HEAD)
    Path: abc123 -> def456 -> ghi789 -> HEAD
    Steps: RenameVertex, AddVertex, RemoveEdge
    Complement required: yes (RemoveEdge is lossy)

The detect_staleness function compares each data snapshot’s schema_id against HEAD. If they differ, it computes the DAG path and reports the number of steps, the transform types, and whether complements will be generated.

This integrates with CI. A non-zero staleness count can fail a pipeline, ensuring data is always migrated before deployment.

How does staleness interact with branches?

If records/ was migrated to HEAD on main, and you switch to a feature branch whose HEAD is three commits ahead of main, does schema data status report staleness relative to feature’s HEAD or main’s?

Answer

Staleness is always relative to the current branch’s HEAD. If records/ was migrated to HEAD on main and you switch to feature (whose HEAD is three commits ahead), schema data status reports staleness relative to feature’s HEAD. The schema_id stored in the data snapshot is compared against the current HEAD’s schema, regardless of which branch created the snapshot.

21.5 Switching branches with data

schema checkout feature --migrate records/

This calls Repository::checkout_with_data(target, data_dir) under the hood. The sequence:

Resolve the target branch’s HEAD commit.
Find the DAG path from the current commit to the target commit (this may go “up” to a common ancestor and “down” to the target).
Migrate data along that path, storing complements for each forward step and consuming complements for each backward step.
Switch HEAD to the target branch.
Write migrated records to data_dir.

Without --migrate, schema checkout switches the schema but leaves data untouched. Subsequent schema data status will show the data as stale.

21.6 Merging with data

schema merge feature --migrate records/

This calls Repository::merge_with_data(branch, author, data_dir). The merge proceeds as usual (three-way pushout on schemas), and then data is migrated from both branches to the merged schema:

Compute the merge result (schema pushout + conflict detection).
If the merge succeeds, derive protolens chains from each branch head to the merged schema.
Apply the chain from the current branch’s schema to the merged schema on records/.
Store complements for each step.

If the merge has conflicts, data migration is deferred until the conflicts are resolved.

21.7 Protocol versioning

Protocols (the theories that define what a valid schema looks like) can themselves evolve. A new version of ATProto might add a sort, change an equation, or rename an operation. panproto stores protocol definitions as first-class Object::Protocol values in the commit DAG.

schema add schema.json --protocol atproto-v2.json

This stages both the schema and the protocol definition. The resulting CommitObject stores a protocol_id field pointing to the protocol object, pinning the commit to a specific protocol version.

Repository::add_protocol(protocol) is the library-level equivalent:

repo.add_protocol(&atproto_v2)?;
repo.add(&schema)?;
repo.commit("upgrade to ATProto v2", "alice")?;

Protocol versioning matters for long-lived repositories where the metaschema itself changes over time. Without it, there’s no record of which protocol a historical commit was validated against.

What happens during migration across a protocol change?

If commit $C_1$ uses ATProto v1 and commit $C_2$ uses ATProto v2, does schema data migrate compose protolens chains that span the protocol change? Or does it require separate migration steps?

Answer

schema data migrate composes protolens chains that span the protocol change. The DAG path from $C_1$ to $C_2$ includes the commit that introduced the new protocol, and that commit’s migration morphism encodes the theory-level mapping between ATProto v1 and ATProto v2. The protolens chain includes both the protocol upgrade step (a theory morphism application) and any schema-level changes. No separate migration steps are required; the composed chain handles both levels.

21.8 Full data versioning workflow

A worked example combining all the pieces:

# Initialize and commit initial schema + data
schema init
schema add schema-v1.json --data records/
schema commit -m "initial schema with data"

# Branch and evolve
schema branch feature
schema checkout feature --migrate records/
# Edit schema-v2.json (rename a field, add a vertex)
schema add schema-v2.json --data records/
schema commit -m "rename userName to handle, add version field"

# Check staleness on main
schema checkout main --migrate records/
schema data status records/
# records/ is back at v1; no staleness

# Merge feature branch with data migration
schema merge feature --migrate records/
# records/ now conforms to the merged schema

# Verify: no staleness
schema data status records/
# records/ : 42 records at schema HEAD (up to date)

# Backward migration if needed
schema data migrate records/ --backward
# records/ restored to pre-merge state using stored complements

21.9 Incremental migration

Batch migration (schema data migrate) re-processes every record. For live systems where edits arrive one at a time, schema data sync provides incremental migration:

schema data sync records/

With the --edits flag, the command records an EditLogObject in the VCS, capturing the translated edit sequence:

schema data sync records/ --edits

The EditLogObject stores the schema ID, data set ID, the MessagePack-encoded edits, the edit count, and the final complement state. This enables replaying the incremental migration or auditing which edits were applied.

For the theory behind incremental migration (the EditLens, the TreeEdit monoid, and the edit lens laws), see Chapter 25.

21.10 Exercises

Create a repository with two schema versions where the second removes a vertex. Commit data at v1, migrate forward, and verify the complement is stored. Then migrate backward and confirm the original data is restored byte-for-byte.
Create a branch that renames a field and another branch that adds a field. Merge them with --migrate and inspect the resulting records. What does the complement contain?
Use schema data migrate --dry-run to estimate complement sizes for a chain of five schema changes. Which steps are lossless? Which produce complements?
Stage a protocol definition with schema add --protocol and inspect the resulting commit with schema show. Where does the protocol_id appear?
Write a CI script that runs schema data status records/ and fails the build if any data is stale.

# Versioning Data with Schemas {#sec-data-versioning} You have a repository with three commits, each changing the schema. Your `records/` directory still holds data from the first version. It's stale. For three commits, manual migration is manageable. For thirty commits across five branches with merges—it's not. The `schema data migrate` command walks the commit DAG, derives a protolens chain for each step, and applies the composed chain to every record. Given a data set $D$ conforming to schema $S_i$ and a target schema $S_j$, it computes the DAG path $S_i \to S_{i+1} \to \cdots \to S_j$, constructs the composed protolens chain $\pi_{j-1} \circ \cdots \circ \pi_i$, and applies the resulting lens $\ell = \text{inst}(\pi, S_i)$ to each record $d \in D$. Complements are stored for reversibility. ## Staging data alongside schema changes {#sec-add-data} The `--data` flag on `schema add` stages data files alongside the schema change: ```{.sh} schema add schema-v2.json --data records/ ``` This does two things: 1. Stages `schema-v2.json` as the next schema version (the existing behavior). 2. Snapshots every file in `records/` as a `DataSetObject`, content-addressed and bound to the *current* schema version (the one being replaced). The snapshot is a `DataSetObject`: a blob of MessagePack-encoded instances with a `schema_id` pointing to the schema they conform to and a `record_count` for quick inspection. ```{.sh} schema data status records/ ``` Output: ``` On branch main Changes to be committed: modified: schema (schema-v2.json) new data: records/ (42 records, bound to schema abc123) ``` The data snapshot is immutable. Once committed, it serves as the canonical "before" image for forward migration and as the restore target for backward migration. ::: {.callout-caution} ## What if you stage data without a schema change? `schema add --data records/` snapshots data bound to the current schema version. If you haven't changed the schema, does this produce a redundant snapshot, or does content-addressing deduplicate it? ::: ::: {.callout-tip collapse=true} ## Answer Content-addressing deduplicates it. `schema add --data records/` computes the blake3 hash of the data snapshot. If an identical snapshot already exists in the object store (because the data hasn't changed since the last commit), no new object is created. The commit records the same `data_id` as before. Storage cost is zero for an unchanged data set. ::: ## Running migrations {#sec-schema-migrate} ```{.sh} schema data migrate records/ ``` This command: 1. Reads each record in `records/` and determines which schema version it conforms to (via the `schema_id` stored in the data snapshot). 2. Finds the path in the commit DAG from that schema version to HEAD. 3. Derives a protolens chain for each step along the path. 4. Composes the chains and applies the result to every record. 5. Stores complements for each step as `ComplementObject` values in the object store. 6. Writes the migrated records back to `records/`. The `--dry-run` flag shows what *would* happen without writing: ```{.sh} schema data migrate records/ --dry-run ``` ``` Dry run: 42 records in records/ Path: abc123 -> def456 -> HEAD (2 steps) Step 1: RenameVertex("userName" -> "handle") : lossless Step 2: AddVertex("version", default: 1) : lossless Complement: 0 bytes (all steps lossless) Result: 42 records migrated, 0 bytes complement stored ``` The `--range` flag restricts migration to a specific commit range: ```{.sh} schema data migrate records/ --range HEAD~3..HEAD ``` The `--backward` flag migrates data in the reverse direction, restoring from stored complements: ```{.sh} schema data migrate records/ --backward ``` The `-o` flag writes migrated records to a different directory instead of overwriting: ```{.sh} schema data migrate records/ -o migrated/ ``` ## Complement storage and reversibility {#sec-complement-storage} Forward migration through a `RemoveVertex` step is lossy: the removed vertex's data disappears from the migrated record. The complement captures that data. panproto stores complements as `ComplementObject` values in the same content-addressed object store used for schemas and commits. A `ComplementObject` contains: - **`migration_id`**: the `ObjectId` of the migration that produced the complement. - **`data_id`**: the `ObjectId` of the `DataSetObject` the complement was computed from. - **The complement data itself**: serialized as MessagePack. When you run `schema data migrate records/ --backward`, the system: 1. Finds the complement objects associated with the current data snapshot. 2. Applies the inverse migration, feeding complement data back into the restoration. 3. Produces records conforming to the earlier schema version. Backward migration fails if no complement exists for a given step. This happens when the forward migration was lossless (no complement needed) or when the complement was garbage-collected. Lossless steps are always reversible without complements. :::{.callout-tip} Complements are content-addressed. Two migrations that produce identical complement data share the same object. This deduplication keeps storage costs proportional to the actual data lost during migration, not the number of migration operations. ::: ## Detecting stale data {#sec-data-staleness} ```{.sh} schema data status records/ ``` ``` On branch main Data staleness: records/ : 42 records at schema abc123 (3 commits behind HEAD) Path: abc123 -> def456 -> ghi789 -> HEAD Steps: RenameVertex, AddVertex, RemoveEdge Complement required: yes (RemoveEdge is lossy) ``` The `detect_staleness` function compares each data snapshot's `schema_id` against HEAD. If they differ, it computes the DAG path and reports the number of steps, the transform types, and whether complements will be generated. This integrates with CI. A non-zero staleness count can fail a pipeline, ensuring data is always migrated before deployment. ::: {.callout-caution} ## How does staleness interact with branches? If `records/` was migrated to HEAD on `main`, and you switch to a `feature` branch whose HEAD is three commits ahead of `main`, does `schema data status` report staleness relative to `feature`'s HEAD or `main`'s? ::: ::: {.callout-tip collapse=true} ## Answer Staleness is always relative to the current branch's HEAD. If `records/` was migrated to HEAD on `main` and you switch to `feature` (whose HEAD is three commits ahead), `schema data status` reports staleness relative to `feature`'s HEAD. The schema_id stored in the data snapshot is compared against the current HEAD's schema, regardless of which branch created the snapshot. ::: ## Switching branches with data {#sec-checkout-migrate} ```{.sh} schema checkout feature --migrate records/ ``` This calls `Repository::checkout_with_data(target, data_dir)` under the hood. The sequence: 1. Resolve the target branch's HEAD commit. 2. Find the DAG path from the current commit to the target commit (this may go "up" to a common ancestor and "down" to the target). 3. Migrate data along that path, storing complements for each forward step and consuming complements for each backward step. 4. Switch HEAD to the target branch. 5. Write migrated records to `data_dir`. Without `--migrate`, `schema checkout` switches the schema but leaves data untouched. Subsequent `schema data status` will show the data as stale. ## Merging with data {#sec-merge-migrate} ```{.sh} schema merge feature --migrate records/ ``` This calls `Repository::merge_with_data(branch, author, data_dir)`. The merge proceeds as usual (three-way pushout on schemas), and then data is migrated from both branches to the merged schema: 1. Compute the merge result (schema pushout + conflict detection). 2. If the merge succeeds, derive protolens chains from each branch head to the merged schema. 3. Apply the chain from the current branch's schema to the merged schema on `records/`. 4. Store complements for each step. If the merge has conflicts, data migration is deferred until the conflicts are resolved. ## Protocol versioning {#sec-protocol-versioning} Protocols (the theories that define what a valid schema looks like) can themselves evolve. A new version of ATProto might add a sort, change an equation, or rename an operation. panproto stores protocol definitions as first-class `Object::Protocol` values in the commit DAG. ```{.sh} schema add schema.json --protocol atproto-v2.json ``` This stages both the schema and the protocol definition. The resulting `CommitObject` stores a `protocol_id` field pointing to the protocol object, pinning the commit to a specific protocol version. `Repository::add_protocol(protocol)` is the library-level equivalent: ```{.rust} repo.add_protocol(&atproto_v2)?; repo.add(&schema)?; repo.commit("upgrade to ATProto v2", "alice")?; ``` Protocol versioning matters for long-lived repositories where the metaschema itself changes over time. Without it, there's no record of *which* protocol a historical commit was validated against. ::: {.callout-caution} ## What happens during migration across a protocol change? If commit $C_1$ uses ATProto v1 and commit $C_2$ uses ATProto v2, does `schema data migrate` compose protolens chains that span the protocol change? Or does it require separate migration steps? ::: ::: {.callout-tip collapse=true} ## Answer `schema data migrate` composes protolens chains that span the protocol change. The DAG path from $C_1$ to $C_2$ includes the commit that introduced the new protocol, and that commit's migration morphism encodes the theory-level mapping between ATProto v1 and ATProto v2. The protolens chain includes both the protocol upgrade step (a theory morphism application) and any schema-level changes. No separate migration steps are required; the composed chain handles both levels. ::: ## Full data versioning workflow {#sec-data-versioning-workflow} A worked example combining all the pieces: ```{.sh} # Initialize and commit initial schema + data schema init schema add schema-v1.json --data records/ schema commit -m "initial schema with data" # Branch and evolve schema branch feature schema checkout feature --migrate records/ # Edit schema-v2.json (rename a field, add a vertex) schema add schema-v2.json --data records/ schema commit -m "rename userName to handle, add version field" # Check staleness on main schema checkout main --migrate records/ schema data status records/ # records/ is back at v1; no staleness # Merge feature branch with data migration schema merge feature --migrate records/ # records/ now conforms to the merged schema # Verify: no staleness schema data status records/ # records/ : 42 records at schema HEAD (up to date) # Backward migration if needed schema data migrate records/ --backward # records/ restored to pre-merge state using stored complements ``` ## Incremental migration {#sec-data-sync} Batch migration (`schema data migrate`) re-processes every record. For live systems where edits arrive one at a time, `schema data sync` provides incremental migration: ```{.sh} schema data sync records/ ``` With the `--edits` flag, the command records an `EditLogObject` in the VCS, capturing the translated edit sequence: ```{.sh} schema data sync records/ --edits ``` The `EditLogObject` stores the schema ID, data set ID, the MessagePack-encoded edits, the edit count, and the final complement state. This enables replaying the incremental migration or auditing which edits were applied. For the theory behind incremental migration (the `EditLens`, the `TreeEdit` monoid, and the edit lens laws), see @sec-edit-lenses. ## Exercises {#sec-data-versioning-exercises} 1. Create a repository with two schema versions where the second removes a vertex. Commit data at v1, migrate forward, and verify the complement is stored. Then migrate backward and confirm the original data is restored byte-for-byte. 2. Create a branch that renames a field and another branch that adds a field. Merge them with `--migrate` and inspect the resulting records. What does the complement contain? 3. Use `schema data migrate --dry-run` to estimate complement sizes for a chain of five schema changes. Which steps are lossless? Which produce complements? 4. Stage a protocol definition with `schema add --protocol` and inspect the resulting commit with `schema show`. Where does the `protocol_id` appear? 5. Write a CI script that runs `schema data status records/` and fails the build if any data is stale.