8  panproto-inst: Instances and the Restrict Pipeline

A W-type (wellfounded tree) instance is a tree whose branching structure is prescribed by a schema. Positions are schema vertices; arities are determined by outgoing edges of each vertex. For a schema with \(\mathrm{Vertex} = \{A, B\}\) and \(\mathrm{Edge} = \{A \xrightarrow{e} B\}\), an instance is a tree where \(A\)-nodes have one child slot (via \(e\)) and \(B\)-nodes are leaves.1

panproto-inst is a Level 1 crate. For the user-facing walkthrough of the restrict pipeline, see the tutorial.

8.1 CompiledMigration

The CompiledMigration struct bridges panproto-mig (which compiles migration specifications) and panproto-inst (which executes them). It contains everything the restrict pipeline needs.

use crate::value::Value;

/// A compiled migration specification (minimal version for panproto-inst).
///
/// The full `CompiledMigration` lives in `panproto-mig`. This type provides
/// the subset of fields that `wtype_restrict` and `functor_restrict` need.
#[derive(Clone, Debug, Serialize, Deserialize)]
pub struct CompiledMigration {
    /// Vertices that survive the migration.
    pub surviving_verts: HashSet<Name>,
    /// Edges that survive the migration.
    pub surviving_edges: HashSet<Edge>,
    /// Vertex remapping: source vertex ID to target vertex ID.
    pub vertex_remap: HashMap<Name, Name>,
    /// Edge remapping: source edge to target edge.
    pub edge_remap: HashMap<Edge, Edge>,
    /// Binary contraction resolver: (`src_anchor`, `tgt_anchor`) to resolved edge.
    pub resolver: HashMap<(Name, Name), Edge>,
    /// Hyper-edge contraction resolver.

The fields serve distinct roles:

  • surviving_verts and surviving_edges drive Step 1 (anchor survival).
  • vertex_remap maps source vertex IDs to target vertex IDs, used when building the restricted instance’s node set.
  • edge_remap maps source edges to target edges for edges that survive with potentially different metadata.
  • resolver handles the ambiguous case in Step 4: when ancestor contraction produces a (src, tgt) pair that has multiple edges in the target schema, the resolver provides the correct choice.
  • hyper_resolver provides analogous resolution for hyper-edge fans in Step 5.
Important

CompiledMigration in panproto-inst is a minimal subset of the full CompiledMigration in panproto-mig. The inst crate only imports the fields it needs, keeping the Level 1 dependency footprint small. If you need the full compilation pipeline, use panproto-mig directly.

8.2 WInstance

A WInstance is a tree-shaped data instance conforming to a schema. Nodes are keyed by numeric IDs, connected by arcs that carry the schema edge they correspond to.

    /// Value-level field transforms applied to surviving nodes' `extra_fields`.
    ///
    /// Keyed by source vertex anchor. Each entry is a list of field operations
    /// applied in order after the node survives and is remapped.
    #[serde(default, skip_serializing_if = "HashMap::is_empty")]
    pub field_transforms: HashMap<Name, Vec<FieldTransform>>,
    /// Value-dependent survival predicates.
    ///
    /// During `wtype_restrict`, after checking that a node's anchor vertex
    /// is in `surviving_verts`, the conditional survival predicate (if any)
    /// is evaluated with the node's `extra_fields` bound as variables.
    /// If the predicate evaluates to `false`, the node is dropped despite
    /// its anchor surviving.
    ///
    /// This enables value-dependent filtering: "keep this vertex only if
    /// attrs.level == 2" (matchAttrs), or "keep this vertex only if
    /// class contains 'u-url'" (matchAttrsAll).
    ///
    /// Categorically, this is a refinement of the survival predicate
    /// from a structural predicate (vertex set membership) to a
    /// value-dependent predicate (vertex set membership AND value predicate).
    #[serde(default, skip_serializing_if = "HashMap::is_empty")]

8.2.1 Node structure

Each Node (defined in inst/src/metadata.rs) carries:

  • id: u32: unique within the instance.
  • anchor: String: the schema vertex this node is an instance of.
  • value: Option<FieldPresence>: the data payload for leaf nodes (strings, integers, etc.). Interior nodes (objects, arrays) have None.
  • discriminator: Option<String>: for union-typed nodes, the $type tag.
  • extra_fields: HashMap<String, Value>: unrecognized JSON fields preserved for round-trip fidelity.

8.2.2 Arcs and trees

An arc (parent_id, child_id, Edge) connects a parent node to a child node along a schema edge. The tree invariant is enforced structurally: every non-root node has exactly one parent. The parent_map and children_map are precomputed in WInstance::new() from the arc list.

Tip

The children_map uses SmallVec<u32, 4> because most instance nodes have between 1 and 4 children. This avoids heap allocation for the common case while still supporting arbitrarily large fan-outs for array nodes.

8.3 RestrictState and the arena allocator

The restrict pipeline uses a RestrictState struct to carry intermediate results between steps without re-allocating:

  • surviving_nodes: Vec<u32>: node IDs that passed anchor filtering (Step 1) and reachability (Step 2).
  • ancestor_map: FxHashMap<u32, u32>: maps each surviving node to its nearest surviving ancestor (Step 3).
  • resolved_arcs: Vec<(u32, u32, Edge)>: the final arc list after edge resolution (Step 4).

All three collections are allocated once at pipeline entry and reused across steps. The surviving_nodes vec is sized to instance.node_count() as an upper bound; the ancestor map is sized to the surviving count from Step 2. This avoids per-step allocations that would otherwise dominate runtime on large instances.

CautionWhy not just allocate in each step?

For small instances (fewer than 100 nodes), per-step allocation is negligible. For large instances (100k+ nodes from format-specific parsers like HTML or CoNLL-U), the allocator overhead of five separate Vec/HashMap constructions becomes measurable. The fused single-pass design keeps allocation count constant regardless of instance size.

8.4 The 5-step restrict pipeline (fused single-pass)

The wtype_restrict function projects an instance along a compiled migration. Conceptually it’s five steps; the implementation fuses Steps 1 and 2 into a single BFS pass and Steps 3 and 4 into a single ancestor-walk pass.

flowchart TD
    I["Source WInstance"] --> S1
    S1["Step 1: anchor_surviving<br>(keep nodes whose anchor survives)"] --> S2
    S2["Step 2: reachable_from_root<br>(BFS from root in surviving subgraph)"] --> S3
    S3["Step 3: ancestor_contraction<br>(find nearest surviving ancestor)"] --> S4
    S4["Step 4: resolve_edge<br>(remap edges for contracted arcs)"] --> S5
    S5["Step 5: reconstruct_fans<br>(rebuild hyper-edge fans)"] --> O
    O["Target WInstance"]

    style S1 fill:#e8f0fe,stroke:#1a73e8
    style S2 fill:#e8f0fe,stroke:#1a73e8
    style S3 fill:#e8f0fe,stroke:#1a73e8
    style S4 fill:#e8f0fe,stroke:#1a73e8
    style S5 fill:#e8f0fe,stroke:#1a73e8
Figure 8.1: The 5-step wtype_restrict pipeline. Each step filters or transforms the instance, producing a progressively restricted version.

8.4.1 Step 1: signature restriction (anchor_surviving)

The first step selects the subset of instance nodes whose schema anchor vertex is retained by the migration. This is a pure filter: if a node’s anchor is in CompiledMigration::surviving_verts, it’s a candidate; otherwise it’s dropped.

    /// Unlike `ApplyExpr` which binds a single field, `ComputeField` binds
    /// all `extra_fields`, nested attrs, AND scalar values from immediate
    /// child nodes (the dependent-sum projection) as variables, evaluates
    /// the expression, and stores the result in the target field.
    ///
    /// This means `ComputeField` can access any scalar property of the
    /// parent object, whether it was parsed as an extra field or as a
    /// schema-defined child vertex (e.g., a string field with a `"format"`
    /// annotation like `"at-uri"`). Computed results are always written to
    /// `extra_fields`, making them available to subsequent transforms and

After this step, some surviving nodes may be orphaned (their parent was dropped) or disconnected (reachable only through dropped intermediaries). Steps 2 and 3 handle these cases.

8.4.2 Step 2: reachability BFS (reachable_from_root)

Not all anchor-surviving nodes are necessarily reachable from the root. Step 2 performs a breadth-first search starting from the root (if it survived Step 1) through children that are also in the candidate set.

    ///   `PutGet` holds for modifications to the computed field.
    /// - `Opaque`: no inverse exists; the complement stores the entire
    ///   original value. Modifications to the computed field in the view
    ///   are not independently round-trippable. This is analogous to SQL
    ///   computed columns: the lens law holds for the independent
    ///   (non-derived) components of the view, and the derived components
    ///   are re-computed deterministically.
    ///
    /// This enables template name computation like
    ///   `target_key`: "name",
    ///   `expr`: `(concat "h" (int_to_str attrs.level))`
    /// which computes "h1", "h2", etc. from the level attribute, as well
    /// as AT-URI decomposition where the `repo` field is a schema-defined
    /// child vertex.
    ComputeField {
        /// The field to store the computed result in.
        target_key: String,
        /// The expression, with all `extra_fields` bound as variables.
        expr: panproto_expr::Expr,
        /// Optional inverse expression for round-tripping.
        inverse: Option<panproto_expr::Expr>,
        /// Round-trip classification of this transformation.

If the root itself doesn’t survive anchor restriction, the entire instance is pruned and wtype_restrict returns RestrictError::RootPruned. An instance without a root isn’t a valid tree.

Note

The BFS traversal follows the instance tree’s parent-child arcs, not the schema graph’s edges. A node is reachable if there’s a path of surviving nodes from the root to it through the instance tree.

8.4.3 Step 3: ancestor contraction (ancestor_contraction)

When intermediate nodes are pruned, their children need to be re-attached to the nearest surviving ancestor. Step 3 walks up the parent chain for each surviving non-root node to find that ancestor.

    /// `Π(x : Value). FieldTransform`, a transform that depends on the
    /// runtime value of the node, not just its schema vertex. It composes
    /// naturally with all other transform variants (including nesting
    /// inside `PathTransform`).
    ///
    /// Use cases:
    /// - `matchAttrs`: "if `level == 1` then rename to `h1`, if `level == 2`
    ///   then rename to `h2`", where each heading level is a branch.
    /// - Conditional attribute injection: "if `list == 'ordered'` then add
    ///   `type: ol`, else add `type: ul`".
    Case {
        /// Ordered branches: first matching predicate wins.
        branches: Vec<CaseBranch>,
    },
    /// Update string values that reference vertex names.
    ///
    /// When vertices are renamed or dropped during migration, string fields
    /// that reference those vertices by name must be updated to reflect the
    /// new names. This is the functorial action of the vertex rename map
    /// on the name-reference algebra.
    ///
    /// For each field value:
    /// - If the value is a `Value::Str` matching a key in `rename_map`,
    ///   it is replaced with the mapped value (or removed if mapped to None).
    /// - If the value is a `Value::Unknown` containing an `__array_len`
    ///   sentinel (encoded array), each string element is checked.
    ///
    /// This handles parent reference arrays, cross-annotation links,
    /// and any other string fields that carry vertex identity.
    MapReferences {
        /// The field containing references (e.g., "parents").
        field: String,
        /// Map from old name to new name (None = remove the reference).
        rename_map: HashMap<String, Option<String>>,
    },
}

impl FieldTransform {
    /// Compute the coercion class of this field transform.
    ///
    /// The class describes the round-trip properties: whether the transform
    /// is lossless (`Iso`), has a left inverse (`Retraction`), is a
    /// deterministic derivation (`Projection`), or has no structural
    /// round-trip property (`Opaque`).

The result is a map from each surviving node to its nearest surviving ancestor. This map drives the arc construction in Step 4: instead of the original (parent, child) arc, the restricted instance will have an (ancestor, child) arc.

The function includes cycle detection as a safety measure. In a well-formed tree instance, cycles in the parent chain can’t occur (the tree invariant forbids them), but the guard prevents infinite loops if an invalid instance is passed in.

8.4.4 Step 4: edge resolution (resolve_edge)

Each contracted arc needs a schema edge in the target schema. Step 4 resolves the edge for each (ancestor, child) pair by:

  1. Consulting the CompiledMigration::resolver map first (for cases where the migration author explicitly specified which edge to use).
  2. Falling back to a unique-edge heuristic: if there’s exactly one edge between the ancestor’s and child’s remapped anchor vertices in the target schema, use it.
  3. Failing with RestrictError::AmbiguousEdge if multiple edges exist and no resolver entry was provided.
                .fold(panproto_gat::CoercionClass::Iso, |acc, t| {
                    acc.compose(t.coercion_class())
                }),
        }
    }
}

/// A branch in a [`FieldTransform::Case`] analysis.
///
/// Contains a predicate expression and a sequence of transforms to apply
/// if the predicate evaluates to `true`.
#[derive(Clone, Debug, Serialize, Deserialize)]
pub struct CaseBranch {
    /// Predicate evaluated with the node's `extra_fields` as variables.
    pub predicate: panproto_expr::Expr,
    /// Transforms to apply if the predicate is true.
    pub transforms: Vec<FieldTransform>,
}

impl CompiledMigration {
    /// Compute the composite coercion class of all field transforms in this migration.
    ///
    /// Folds over every transform across all vertices using `CoercionClass::compose`,
    /// starting from `Iso` (the identity element).
    #[must_use]
    pub fn coercion_class(&self) -> panproto_gat::CoercionClass {
        self.field_transforms
            .values()

The resolver is populated during migration compilation by panproto-mig. In practice, ambiguity is rare. Most vertex pairs in a schema are connected by at most one edge. When it does occur (e.g., a vertex has both a "prop" and a "ref" edge to the same target), the migration specification must include an explicit resolution.

8.4.5 Step 5: fan reconstruction (reconstruct_fans)

For schemas that use hyper-edges, the final step rebuilds the hyper-edge fans. A fan is an instance-level manifestation of a hyper-edge: it connects a parent node to multiple child nodes via labeled slots.

When some children of a fan are pruned, the fan must be rebuilt with only the surviving children. If the hyper_resolver provides a mapping, the fan is remapped to a new hyper-edge ID with relabeled slots. Otherwise, the original fan is kept with its surviving children.

        self.field_transforms
            .entry(Name::from(vertex))
            .or_default()
            .push(FieldTransform::RenameField {
                old_key: old_key.to_owned(),
                new_key: new_key.to_owned(),
            });
    }

    /// Add a field drop transform for a vertex.
    ///
    /// The field `key` is removed from the node's `extra_fields`.
    pub fn add_field_drop(&mut self, vertex: &str, key: &str) {
        self.field_transforms
            .entry(Name::from(vertex))
            .or_default()
            .push(FieldTransform::DropField {
                key: key.to_owned(),
            });
    }

    /// Add a field with a default value for a vertex.
    ///
    /// The field `key` is added to `extra_fields` with the given value
    /// if it does not already exist.
    pub fn add_field_default(&mut self, vertex: &str, key: &str, value: Value) {
        self.field_transforms
            .entry(Name::from(vertex))
            .or_default()
            .push(FieldTransform::AddField {
                key: key.to_owned(),
                value,
            });
    }

    /// Add a keep-fields transform for a vertex.
    ///
    /// Only the specified fields are retained in `extra_fields`;
    /// all others are dropped.
    pub fn add_field_keep(&mut self, vertex: &str, keys: &[&str]) {
        self.field_transforms
            .entry(Name::from(vertex))
            .or_default()
            .push(FieldTransform::KeepFields {
                keys: keys.iter().map(|k| (*k).to_owned()).collect(),
            });
    }

    /// Add an expression transform for a field on a vertex.
    ///

Fans whose parent was pruned are dropped entirely. Fans with no surviving children are also dropped.

CautionWhen does fan reconstruction actually run?

Only for protocols whose schema theory includes the HyperEdge sort (Group B protocols: SQL, Parquet, Cassandra, etc.). For Group A protocols (JSON Schema, AT Protocol, OpenAPI), there are no hyper-edges and Step 5 is a no-op. The cost is a single emptiness check on the fan list.

8.5 Functor instances (functor_restrict)

panproto-inst also provides an alternative instance representation for relational (tabular) data: FInstance, a set-valued table. Each schema vertex maps to a table (a Vec of row maps), and each edge maps to foreign-key relationships.

The restrict operation for functor instances is simpler than the W-type pipeline: for each vertex in the target schema, pull the corresponding table from the source instance via the migration’s vertex map.2

Tip

Use FInstance and functor_restrict for relational protocols (SQL, MongoDB). Use WInstance and wtype_restrict for document protocols (AT Protocol, JSON Schema, OpenAPI). The choice is determined by the protocol’s instance_theory field.

8.6 Json round-trip (parse_json / to_json)

The parse module provides schema-guided JSON parsing and serialization:

  • parse_json(schema, root_vertex, json_value): recursively walks a JSON value, creating a WInstance tree. At each level, it consults the schema’s outgoing edges to determine which JSON fields correspond to child nodes and which are leaf values.
  • to_json(schema, instance): reconstructs JSON from a WInstance by walking the tree from the root, using edge names as property keys.

The round-trip preserves:

  • Structural fidelity: every schema-recognized field is correctly nested.
  • Extra fields: JSON properties not recognized by the schema are stored in Node::extra_fields and re-emitted on serialization.
  • Discriminators: $type tags on union-typed objects are preserved via Node::discriminator.

8.7 Validation axioms (i1 through i7)

The validate module checks that a WInstance satisfies seven structural axioms:

Table 8.1: The seven validation axioms.
Axiom Check Description
I1 check_anchors All node anchors exist in the schema
I2 check_edges All arc edges exist in the schema
I3 check_root Root node exists in the node set
I4 check_reachability All nodes reachable from root via arcs
I5 check_required_edges Required edges present for each node
I6 check_parent_map Parent map consistent with arc list
I7 check_fans Fan parent and children exist; hyper-edge in schema

validate_wtype(schema, instance) returns a Vec<ValidationError>. An empty vector means the instance is valid. Collecting all violations (rather than failing on the first) supports diagnostic tooling: the CLI can report all problems at once.

Important

The restrict pipeline is designed to preserve axioms I1 through I7. If the source instance is valid and the compiled migration is well-formed, the restricted instance is guaranteed to be valid with respect to the target schema.

8.7.1 Axiom preservation through the pipeline

Each step of the restrict pipeline maintains or establishes specific axioms:

  • Step 1 (anchor_surviving) ensures that all surviving nodes have valid anchors in the target schema (establishing I1 for the restricted set).
  • Step 2 (reachable_from_root) establishes I4: only nodes reachable from the root survive.
  • Step 3 (ancestor_contraction) re-establishes the tree structure after intermediate nodes are pruned.
  • Step 4 (resolve_edge) establishes I2: all new arcs correspond to edges in the target schema.
  • Step 5 (reconstruct_fans) establishes I7: all fans reference valid hyper-edges and nodes.

I3 (root exists) is checked explicitly. If the root is pruned, the pipeline returns an error rather than producing an invalid instance.

8.8 The acsetops trait

The AcsetOps trait provides a unified interface for operations across all three instance shapes: WInstance, FInstance, and GInstance.

8.8.1 Trait definition

pub trait AcsetOps: Clone + std::fmt::Debug {
    fn restrict(
        &self,
        src_schema: &Schema,
        tgt_schema: &Schema,
        migration: &CompiledMigration,
    ) -> Result<Self, RestrictError>;

    fn extend(
        &self,
        tgt_schema: &Schema,
        migration: &CompiledMigration,
    ) -> Result<Self, RestrictError>;

    fn element_count(&self) -> usize;

    fn shape_name(&self) -> &'static str;
}

The four methods cover the core operations:

  • restrict: project an instance along a compiled migration. Each shape delegates to its specialized implementation (wtype_restrict, functor_restrict, graph_restrict).3
  • extend: push an instance forward along a migration, filling in new fields with defaults or computed values.4
  • element_count: returns the number of top-level elements (nodes for tree and graph instances, tables for functor instances).
  • shape_name: returns a static string identifying the shape ("wtype", "functor", or "graph").

8.8.2 Implementations

Table 8.2: AcsetOps delegation by instance shape.
Shape restrict delegates to extend delegates to element_count returns
WInstance wtype_restrict wtype_extend node_count()
FInstance functor_restrict functor_extend table_count()
GInstance graph_restrict graph_extend node_count()

Note that FInstance::restrict and GInstance::restrict don’t use the src_schema and tgt_schema parameters. Their restrict operations depend only on the compiled migration. The schema parameters are required by WInstance::restrict for the ancestor contraction and fan reconstruction steps. The trait uses the most general signature to accommodate all shapes.

8.8.3 The instance Enum

pub enum Instance {
    WType(WInstance),
    Functor(FInstance),
    Graph(GInstance),
}

Instance dispatches restrict, extend, element_count, and shape_name to the inner type’s AcsetOps implementation. This is the type used by the CLI’s schema lift command and by the VCS layer when storing instances in the object store.

8.8.4 Ginstance: graph-shaped instances

The GInstance type handles the most general instance shape: a directed graph of nodes and edges with no distinguished root and cycles allowed. Both WInstance (trees) and FInstance (tables) are special cases.

GInstance is the natural instance theory for knowledge graphs (RDF, OWL, JSON-LD), property graphs (Neo4j), and dependency graphs. Its graph_restrict operation is simpler than the 5-step tree pipeline: it keeps nodes whose anchor survives the migration (remapping the anchor) and keeps edges whose endpoints both survive and whose schema edge survives or has a remap entry.

Tip

Use GInstance when your protocol’s data is inherently graph-shaped (no natural root node, cycles expected). For tree-structured data (JSON, XML), use WInstance. For tabular data (SQL, CSV), use FInstance. The AcsetOps trait ensures that the migration engine works identically regardless of which shape you choose.

8.9 Polynomial operations and fibers

panproto-inst also provides polynomial functor operations in poly.rs and hom.rs. These include fiber decomposition (preimage of a migration at a target anchor), section construction (enriching a base instance with annotation data), group-by (partitioning by migration image), join (pullback of two instances along shared projections), and the internal hom (a schema whose instances represent migrations). The restrict_with_complement function in poly.rs extends the restrict pipeline to track dropped nodes and arcs, enabling fiber_at_node queries. For full details, see Chapter 25.

8.10 Design rationale

8.10.1 Why separate node ids from anchors

Node IDs are u32 values. Anchors are String schema vertex IDs. Multiple nodes can share the same anchor (e.g., an array of strings all anchored to the same "items" vertex). Keeping IDs numeric and anchors semantic gives \(O(1)\) node lookup by ID while preserving the schema-level meaning through the anchor.

8.10.2 Why precomputed parent and children maps

The parent_map and children_map are computed once in WInstance::new() and never updated. This is safe because WInstance is immutable after construction. The restrict pipeline creates a new WInstance rather than mutating the source.

8.10.3 Why five steps instead of one

Decomposing the restrict algorithm into five independent functions serves two purposes:

  1. Testability: each step can be unit-tested in isolation with small, purpose-built instances. The test suite in wtype.rs tests each step individually before testing the composed pipeline.
  2. Debuggability: when a migration produces unexpected results, you can inspect the output of each step to locate the problem. The CLI’s --verbose mode logs intermediate results at each step.

The five-step decomposition also mirrors the mathematical structure: signature restriction (Step 1) selects surviving elements, reachability (Step 2) enforces the wellfoundedness condition, contraction (Step 3) computes re-parenting, edge resolution (Step 4) assigns edges to contracted arcs, and fan reconstruction (Step 5) handles multi-arity structure.5

Abbott, Michael, Thorsten Altenkirch, and Neil Ghani. 2005. “Containers: Constructing Strictly Positive Types.” Theoretical Computer Science 342 (1): 3–27. https://doi.org/10.1016/j.tcs.2005.06.002.
Gambino, Nicola, and J. Martin E. Hyland. 2004. “Wellfounded Trees and Dependent Polynomial Functors.” Types for Proofs and Programs (TYPES 2003), Lecture notes in computer science, vol. 3085: 210–25. https://doi.org/10.1007/978-3-540-24849-1_14.
Spivak, David I. 2012. “Functorial Data Migration.” Information and Computation 217: 31–51. https://arxiv.org/abs/1009.1166.

  1. The tree representation is formally a W-type from dependent type theory (Gambino and Hyland 2004). The restrict pipeline corresponds to a morphism of containers (Abbott et al. 2005), using the polynomial functor framework.↩︎

  2. Categorically, this is the pullback functor \(\Delta_F\) (Spivak 2012). It composes correctly by construction: restricting along a composed migration is the same as restricting twice.↩︎

  3. Categorically, this is the pullback functor \(\Delta_F\) (Spivak 2012).↩︎

  4. Categorically, this is the left Kan extension \(\Sigma_F\) (Spivak 2012).↩︎

  5. In categorical terms, Step 1 is the pullback on objects, Step 3 is the pushforward on the tree, and Step 4 lifts the morphism to arcs.↩︎