graph TD
A["A (root)"] --> B["B (removed)"]
B --> C["C (surviving)"]
B --> D["D (removed)"]
A --> E["E (surviving)"]
8 Lifting Data Through Schema Changes
9 Lifting Data Through Schema Changes
The previous chapters described migration abstractly: a schema morphism maps vertices to vertices and edges to edges, and the engine “lifts” data from one schema to another. This chapter opens the hood. We will trace the W-type restrict pipeline step by step on a real recursive data structure, then briefly cover functor restrict (the SQL path), and close with the performance characteristics that make it practical.
9.1 A worked example: threadViewPost
The ATProto threadViewPost type is a post that contains an array of replies, each of which is itself a threadViewPost. The recursion means the instance is a tree of arbitrary depth.
Here is a simplified v1 schema:
threadViewPost
├── post: postView
│ ├── uri: string
│ ├── author: profileViewBasic
│ │ ├── did: string
│ │ ├── handle: string
│ │ └── displayName: string
│ ├── text: string
│ └── likeCount: integer
└── replies: array<threadViewPost>
Version 2 drops likeCount from postView and displayName from profileViewBasic. A migration \(m\colon S_2 \to S_1\) exists (both changes are field removals, which are always backward-compatible).
Now consider a concrete instance: a thread with two replies, one of which has a nested reply.
root (threadViewPost)
├── post (postView)
│ ├── uri = "at://did:plc:abc/post/1"
│ ├── author (profileViewBasic)
│ │ ├── did = "did:plc:abc"
│ │ ├── handle = "alice.bsky.social"
│ │ └── displayName = "Alice"
│ ├── text = "Hello world"
│ └── likeCount = 42
└── replies
├── reply-1 (threadViewPost)
│ ├── post (postView)
│ │ ├── uri = "at://did:plc:def/post/2"
│ │ ├── author (profileViewBasic)
│ │ │ ├── did = "did:plc:def"
│ │ │ ├── handle = "bob.bsky.social"
│ │ │ └── displayName = "Bob"
│ │ ├── text = "Nice post"
│ │ └── likeCount = 7
│ └── replies
│ └── reply-1-1 (threadViewPost)
│ └── ...
└── reply-2 (threadViewPost)
└── ...
This is the instance we will push through the five-step pipeline.
9.2 The five-step pipeline
The W-type restrict algorithm transforms an instance tree in five sequential steps.1
9.2.1 Step 1: anchor survival
Every node in the instance sits over a vertex in the schema. That vertex is the node’s anchor: it determines what children the node can have, what edges connect it to its parent, and what data it carries.
Step 1 walks every node and checks whether its anchor vertex is in the surviving set (the image of the migration’s vertex map). If not, the node is marked for removal.
/// mark each node as surviving or pruned based on whether
/// its anchor vertex is in the migration's vertex map.
fn anchor_surviving(
instance: &WTree,
vertex_map: &HashMap<VertexId, VertexId>,
) -> Vec<bool> {
instance.nodes.iter().map(|node| {
vertex_map.values().any(|v| *v == node.anchor)
}).collect()
}In our example, the vertices likeCount and displayName are not in the image of the vertex map. Every node anchored to either is marked for removal: root.post.likeCount, root.post.author.displayName, and their counterparts in every nested reply. Everything else survives.
9.2.2 Step 2: reachability
Step 1 marks individual nodes, but some survivors may be stranded behind pruned parents.
Step 2 runs a breadth-first search from the root, traversing only through surviving nodes. Any surviving node not reached by the BFS is marked as unreachable.
/// BFS from root through surviving nodes.
fn reachable_from_root(
instance: &WTree,
surviving: &[bool],
) -> HashSet<u32> {
let mut visited = HashSet::new();
let mut queue = VecDeque::new();
if surviving[instance.root as usize] {
queue.push_back(instance.root);
}
while let Some(idx) = queue.pop_front() {
if !visited.insert(idx) {
continue;
}
for child in instance.children(idx) {
if surviving[child as usize] {
queue.push_back(child);
}
}
}
visited
}In our example, this step is a no-op: the pruned nodes (likeCount, displayName) are all leaves, so removing them does not disconnect anything. But consider a different migration that removes postView while keeping text. The text nodes would survive step 1 but become unreachable in step 2, because their parent (postView) was pruned and no path from the root reaches them through survivors alone. Step 3 rescues them.
9.2.3 Step 3: ancestor contraction
When intermediate nodes are pruned, surviving children need new parents. Step 3 re-parents each surviving child to its nearest surviving ancestor.
/// for each surviving node, find its nearest surviving ancestor.
fn ancestor_contraction(
instance: &WTree,
reachable: &HashSet<u32>,
) -> HashMap<u32, u32> {
let mut new_parent = HashMap::new();
for &idx in reachable {
if idx == instance.root {
continue;
}
let mut ancestor = instance.parent(idx);
while !reachable.contains(&ancestor) {
ancestor = instance.parent(ancestor);
}
new_parent.insert(idx, ancestor);
}
new_parent
}The visual intuition:
After contraction, C becomes a direct child of A:
graph TD
A["A (root)"] --> C["C (re-parented)"]
A --> E["E (surviving)"]
In our threadViewPost example, the pruned nodes are all leaves, so no contraction occurs. But imagine a migration that removes the postView wrapper while keeping uri, author, and text as direct children of threadViewPost:
Before: After:
threadViewPost threadViewPost
├── postView (removed) → ├── uri
│ ├── uri ├── author
│ ├── author │ ├── did
│ │ ├── did │ └── handle
│ │ ├── handle ├── text
│ │ └── displayName └── replies
│ ├── text └── ...
│ └── likeCount
└── replies
└── ...
Contraction never pushes nodes deeper. A node’s depth in the contracted tree is always less than or equal to its depth in the original.
9.2.4 Step 4: edge resolution
Contraction gave every surviving child a new parent, but the edge connecting them still needs a label from the target schema. In the original tree, the path from A to C went through B: \(A \xrightarrow{e_1} B \xrightarrow{e_2} C\). After removing B, we need a single edge label for the direct \(A \to C\) connection.
In a simple graph, there is at most one edge between any pair of vertices, so the choice is trivial. But schemas are often multigraphs (an object with two fields of the same type has parallel edges), and the choice becomes ambiguous. The resolver table, part of the migration specification, provides the deterministic answer.
/// given a contracted parent-child pair, look up the correct
/// edge label from the resolver table.
fn resolve_edge(
parent_anchor: VertexId,
child_anchor: VertexId,
contracted_path: &[VertexId],
resolver: &ResolverTable,
) -> EdgeId {
resolver
.lookup(parent_anchor, child_anchor, contracted_path)
.expect("migration guarantees a resolution exists")
}The existence checker from the previous chapter guarantees that a valid resolution exists for every contracted path, so resolve_edge never panics on a well-formed migration.
A schema has threadViewPost with two fields of type profileViewBasic: author and moderator. A migration removes the wrapper object that contained both. After ancestor contraction, both profileViewBasic nodes are re-parented to threadViewPost. How does the resolver disambiguate?
The resolver table is keyed by the full contracted path, not just the endpoint pair. The first node’s path threadViewPost → wrapper → author → profileViewBasic maps to the author edge. The second node’s path threadViewPost → wrapper → moderator → profileViewBasic maps to the moderator edge. The intermediate path distinguishes them even though the endpoints are the same.
9.2.5 Step 5: fan reconstruction
The final step handles hypergraph schemas, where a single edge connects a parent to multiple children simultaneously. In practice this means SQL: a table row is a hyperedge connecting a primary key to all its column values.
When some columns are pruned, the surviving columns must be collected into a new, smaller hyperedge.
/// reconstruct fans (hyperedges) after pruning some children.
fn reconstruct_fans(
instance: &WTree,
reachable: &HashSet<u32>,
fan_map: &HashMap<FanId, FanId>,
) -> Vec<Fan> {
instance.fans.iter().filter_map(|fan| {
let new_children: Vec<u32> = fan.children.iter()
.copied()
.filter(|c| reachable.contains(c))
.collect();
if new_children.is_empty() {
return None;
}
let new_fan_id = fan_map.get(&fan.id)?;
Some(Fan {
id: *new_fan_id,
parent: fan.parent,
children: new_children,
})
}).collect()
}For pure tree schemas (JSON, CBOR, Protobuf), step 5 is a no-op. It exists because panproto’s data model handles both trees and hypergraphs uniformly.
9.2.6 Summary
| Step | Operation | When it matters |
|---|---|---|
| 1 | anchor_surviving |
Always |
| 2 | reachable_from_root |
Intermediate nodes pruned |
| 3 | ancestor_contraction |
Children of pruned nodes survive |
| 4 | resolve_edge |
Multigraph schemas |
| 5 | reconstruct_fans |
Hypergraph (SQL) schemas |
For a typical JSON schema migration that adds or removes fields, only step 1 does meaningful work. Steps 2 through 5 are structural no-ops. The common case is fast by design.
9.3 Tracing the full pipeline
For the threadViewPost example:
Step 1 marks likeCount and displayName nodes as pruned. In a thread with three posts, this marks six nodes for removal (one of each per post). Everything else survives.
Step 2 runs BFS from the root. All surviving nodes are reachable, because the pruned nodes are leaves. No additional removals.
Steps 3 and 4 are no-ops: every surviving node’s parent also survives, so no contraction or resolution is needed.
Step 5 is a no-op: the schema is a tree, not a hypergraph.
The output is the original instance tree, minus the likeCount and displayName leaf nodes. Every other node, edge, and value is preserved exactly.
Construct a schema migration where all five steps do nontrivial work. What properties must the schema have for step 5 to fire?
Start with a SQL-like schema that has hypergraph structure: a Person vertex connected to Address and Phone via a compound foreign key (a hyperedge). Add a ContactInfo wrapper vertex between Person and Address. The migration removes ContactInfo and drops Phone. Step 1 prunes ContactInfo and Phone nodes. Step 2 marks Address nodes as unreachable. Step 3 re-parents Address to Person. Step 4 resolves the new Person → Address edge label. Step 5 reconstructs the compound foreign key fan without the Phone column. The schema must be a hypergraph with an intermediate wrapper and a multigraph edge for all five steps to fire.
9.4 Functor restrict (the SQL path)
For tabular data, restriction takes a simpler form. Given a schema morphism \(F\colon S_1 \to S_2\) and an instance \(I\) over \(S_2\), the restricted instance is the precomposition \(\Delta_F(I)\).2
In concrete terms: if \(F\) drops a table, its rows vanish. If \(F\) drops a column, that column disappears from every row. If \(F\) renames a table, the rows are relabeled. No data is computed; everything is rearranged.
/// functor restriction: precompose an instance along a schema morphism.
fn restrict_functor<F, I>(morphism: &F, instance: &I) -> RestrictedInstance
where
F: SchemaMorphism,
I: Instance,
{
let mut result = RestrictedInstance::new(morphism.source());
for table in morphism.source().tables() {
let target_table = morphism.map_table(table);
for row in instance.rows(target_table) {
let restricted_row = row.project(morphism.column_map(table));
result.insert(table, restricted_row);
}
}
result
}Precomposition is always total: if \(F\) is well-typed and \(I\) is well-typed, the result is well-typed. There are no existence conditions to check.
9.4.1 Adjoints: extend and product
The restrict functor sits in the middle of a triple.3
extend (left adjoint \(\Sigma_F\)) pushes data forward from the old schema to the new, filling in new fields by taking unions across fibers. In SQL terms, this is roughly UNION across related rows.
restrict (pullback \(\Delta_F\)) is the primary migration operator. Lossless with complement, reversible via lenses.
product (right adjoint \(\Pi_F\)) computes the most conservative approximation by intersecting across fibers. In SQL terms, roughly JOIN across related rows.
panproto implements all three. restrict is the default for schema version migration. extend and product are available for cross-protocol data integration and algebraic data integration patterns.
9.5 Performance
The pipeline targets more than one million records per second for simple projections (field additions and removals) running in WASM.
9.5.1 Compiled migrations
When a migration is loaded, panproto pre-computes everything it can. The surviving set becomes a bitset (step 1 reduces to a single lookup per node). Edge resolution becomes a flat lookup table indexed by (source_anchor, child_anchor). Fan reconstruction becomes a pre-allocated mapping from old fan IDs to new fan IDs. The migration specification is compiled into a small program that the restrict pipeline executes.
9.5.2 Arena allocation
Instance trees are allocated in a bumpalo arena. Allocation is O(1) (a pointer increment, no free-list management), and deallocation is O(1) (the entire arena is freed at once). The output tree is allocated in a fresh arena, and surviving nodes are written sequentially, giving excellent cache locality.
9.5.3 Zero-copy fast path
For the common case—no contraction, no fan reconstruction, trivial edge resolution—the pipeline reduces to walking the input tree, checking the surviving bitset per node, and copying structural metadata (anchor, parent pointer, child list) to the output arena. The node’s payload (strings, integers, booleans) is not copied; it is referenced by pointer into the input arena.
In benchmarks on an M2 MacBook Air, this fast path processes approximately 3.2 million nodes per second for a threadViewPost-like schema with eight fields per post. The WASM target is somewhat slower (approximately 1.8M nodes/sec due to bounds checking and lack of SIMD) but still well above the million-node target.
9.5.4 Fused single-pass pipeline
The production implementation fuses steps 1 through 4 into a single BFS traversal. When we visit a node, its parent’s nearest surviving ancestor was already computed (BFS visits parents before children). If the current node survives, its ancestor is its parent’s ancestor (or the parent itself, if the parent survives). This reduces four passes to one, cutting the constant factor substantially. The individual step functions are retained for unit testing; the fused version must produce identical output for every input, which is verified by the property test suite.
9.5.5 When it gets expensive
The pipeline slows down when contraction chains are deep (walking up to find the nearest surviving ancestor is O(depth)), when the resolver table is large (many parallel edges), or when fans are wide (hundreds of SQL columns). Pre-computation during migration compilation amortizes the per-instance cost in all three cases.
9.6 Further reading
The W-type perspective on tree-shaped data originates with Gambino and Hyland (2004). Abbott et al. (2005) developed the container framework that panproto uses to determine how a schema’s structure constrains its data’s tree shape. The graph-transformation approach to steps 3 and 4 follows Ehrig et al. (2006), and Drewes et al. (1997) extends it to hypergraph structures. The framework of migrating data by composing schema maps is due to Spivak (2012). Appendix A has the full formal treatment.
The next chapter shows how to define a new protocol from scratch, wiring up a custom pair of theories and registering it with the engine.
Each step has a precise algebraic meaning in the theory of polynomial functor algebras. The pipeline implements restriction of a polynomial functor algebra along a morphism of polynomial functors. The instance internals chapter in the developer guide covers the formal treatment.↩︎
Categorically, this is the pullback functor \(\Delta_F\) from Spivak (2012).↩︎