17 VCS Engine

The panproto-vcs crate implements schematic version control. Schemas are content-addressed objects stored in a commit DAG, with branches, three-way merge, and data lifting through history.

17.1 Module layout

The VCS engine is organized into focused modules:

crates/panproto-vcs/src/
  lib.rs           -- public API, re-exports
  error.rs         -- VcsError (thiserror, non_exhaustive)
  hash.rs          -- canonical serialization + blake3 content addressing
  object.rs        -- ObjectId, CommitObject, Object enum
  store.rs         -- Store trait, HeadState, ReflogEntry
  fs_store.rs      -- FsStore (.panproto/ on disk)
  mem_store.rs     -- MemStore (HashMap-backed, for tests + WASM)
  refs.rs          -- branches, tags, resolve_ref
  reflog.rs        -- append-only audit log of ref changes
  index.rs         -- staging area + validation pipeline
  dag.rs           -- merge base, path finding, log walk, compose_path
  merge.rs         -- three-way schema merge + conflict detection
  rebase.rs        -- replay commits onto a new base
  cherry_pick.rs   -- apply single commit's migration
  reset.rs         -- soft/mixed/hard HEAD reset
  stash.rs         -- save/restore working state
  bisect.rs        -- binary search for breaking commit
  blame.rs         -- schema element attribution
  auto_mig.rs      -- derive Migration from SchemaDiff
  rename_detect.rs -- heuristic rename detection between versions
  status.rs        -- HEAD vs index vs working file
  repo.rs          -- Repository orchestration (porcelain)
  gc.rs            -- garbage collection

17.2 Object store

ObjectId is a blake3 hash of the canonical MessagePack serialization. Canonical forms use BTreeMap instead of HashMap and exclude precomputed indices (derived data). The store contains:

Object::Schema(Schema): a schema snapshot
Object::Migration { src, tgt, mapping }: a morphism between two schemas
Object::Commit(CommitObject): a point in the DAG
Object::Tag(TagObject): an annotated tag storing target object ID, tagger name, timestamp, and message
Object::DataSet(DataSetObject): a content-addressed data snapshot bound to a schema version
Object::Complement(ComplementObject): complement from data migration for backward restoration
Object::Protocol(Protocol): a protocol definition
Object::Expr(Expr): a standalone expression
Object::EditLog(EditLogObject): a sequence of edits for incremental migration

A CommitObject stores the core schema history:

pub struct CommitObject {
    pub schema_id: ObjectId,
    pub parents: Vec<ObjectId>,       // 0=root, 1=normal, 2=merge
    pub migration_id: Option<ObjectId>,
    pub protocol: String,
    pub author: String,
    pub timestamp: u64,
    pub message: String,
    pub renames: Vec<SiteRename>,
    pub protocol_id: Option<ObjectId>,
    pub data_ids: Vec<ObjectId>,
    pub complement_ids: Vec<ObjectId>,
    pub edit_log_ids: Vec<ObjectId>,  // references to EditLogObject values
}

17.3 Configuration and options

MergeOptions controls merge behavior:

pub struct MergeOptions {
    pub no_commit: bool,   // stage result without committing
    pub ff_only: bool,     // fail if not fast-forwardable
    pub no_ff: bool,       // force merge commit even on fast-forward
    pub squash: bool,      // collapse into a single non-merge commit
    pub message: Option<String>,  // custom commit message
}

CherryPickOptions controls cherry-pick behavior: no_commit stages without committing, record_origin appends the source commit ID to the message.

GcOptions has a single dry_run field. When true, gc::gc reports unreachable objects without deleting them.

17.4 Dag algorithms

All DAG algorithms operate on the commit graph $G = (V, E)$ where $V$ is the set of commits and $E$ is the parent relation.

Merge base: Two-frontier BFS. Alternately expand from each side; the first vertex visited by both frontiers is the lowest common ancestor. For commits $a$ and $b$, this finds $\text{LCA}(a, b)$ in $O(|V|)$ worst case, typically much less.
Path finding: BFS backwards from target, reconstruct via parent map. Returns the sequence of commits $(c_0, c_1, \ldots, c_k)$ where $c_0$ is the source and $c_k$ is the target.
Log walk: Max-heap ordered by timestamp for topological-chronological order. Each commit is visited exactly once (tracked via a HashSet). Merge commits push both parents onto the heap. Complexity: $O(n \log n)$ for $n$ commits.
Compose path: Given a path $(c_0, \ldots, c_k)$, chain migrations via panproto_mig::compose. The composed migration $M_{0 \to k} = M_{k-1 \to k} \circ \cdots \circ M_{0 \to 1}$ transforms data from schema $S_0$ to schema $S_k$.
Is ancestor: BFS reachability check from candidate to target. Returns true if there exists a path in $G$ from the candidate to the target following parent edges.

Max-Heap Over DFS for Log Walk

DFS visits one branch fully before starting the other, producing non-chronological output for interleaved branches. The timestamp-ordered heap interleaves branches chronologically, matching what users expect from git log.

17.5 Repository operations

Beyond core commits, repo.rs exposes additional operations:

Repository::amend(): creates a commit with the same parent(s) as current HEAD but updated schema, migration, and message. The previous HEAD becomes unreachable (collectible by gc).
refs::force_delete_branch: deletes a branch regardless of merge status
refs::rename_branch: renames a branch ref, updating HEAD if the current branch is renamed
refs::create_and_checkout_branch: creates a branch at HEAD and switches to it atomically
refs::create_annotated_tag: creates a TagObject in the store and points a tag ref at it
refs::create_tag_force: overwrites an existing tag ref
stash::stash_apply: re-applies the top stash entry without removing it from the stack
stash::stash_show: returns the schema diff between the stash entry and its parent
stash::stash_clear: drops all stash entries and removes the refs/stash ref

17.6 Error handling

VcsError is #[non_exhaustive] and includes:

ObjectNotFound, RefNotFound, BranchNotFound, TagNotFound
BranchAlreadyExists, TagAlreadyExists
NotOnBranch, DetachedHead, IndexEmpty, IndexNotEmpty
MergeConflict, RebaseConflict, CherryPickConflict
FastForwardRequired, NothingToAmend
StashEmpty, InvalidRef, CorruptObject
RemoteNotSupported

17.7 Three-way merge

Given schemas base, ours, theirs, the merge computes a combined result. Categorically, this is a pushout (colimit) over the shared base.

The algorithm:

diff(base, ours) and diff(base, theirs): the extended diff covers all 13 schema fields (vertices, edges, constraints, hyper-edges, required, NSIDs, variants, orderings, recursion points, usage modes, spans, nominal)
For every element in every field, classify its fate on each side: unchanged, added, removed, or modified
Apply the merge rule: one-sided changes accepted; identical changes deduplicated; incompatible changes flagged as typed conflicts with the base value retained
No tie-breaking: the merge is commutative, meaning merge(base, A, B) == merge(base, B, A)
Rebuild precomputed indices
Derive migrations from ours to merged and theirs to merged via auto_mig::derive_migration

Commutativity in Merge

Commutativity means the merge result doesn’t depend on which branch is “ours” vs “theirs.” This eliminates bugs where merge order affects the outcome. It’s a stronger guarantee than git’s text merge, which can produce different conflict markers depending on merge direction.

17.8 Rename detection

The rename_detect module detects renames between schema versions. When auto_mig::derive_migration encounters a vertex absent in one schema and a structurally similar vertex in the other, the detector identifies this as a rename rather than a delete-and-add.

17.8.1 Algorithm

For vertex rename detection:

Collect removed vertices (in old, not in new) and added vertices (in new, not in old)
For each (removed, added) pair, compute structural similarity:
- Same vertex kind: +0.3
- Same outgoing edge names: +0.3
- Same incoming edge names: +0.2
- Edit distance of vertex ID names ≤ 3: +0.2
Greedy bipartite matching: take highest-scoring pairs first
Return pairs ≥ threshold

For edge label rename detection, the algorithm compares removed and added edges between the same surviving vertex pair, scoring by name edit distance normalized to string length.

17.8.2 Commitobject integration

Detected renames are stored in CommitObject.renames as Vec<SiteRename>. This field is #[serde(default, skip_serializing_if = "Vec::is_empty")], so old commits without renames deserialize correctly.

See Section 18.2 for the full naming and identity architecture.

# VCS Engine {#sec-vcs-engine} The `panproto-vcs` crate implements schematic version control. Schemas are content-addressed objects stored in a commit DAG, with branches, three-way merge, and data lifting through history. ## Module layout The VCS engine is organized into focused modules: ``` crates/panproto-vcs/src/ lib.rs -- public API, re-exports error.rs -- VcsError (thiserror, non_exhaustive) hash.rs -- canonical serialization + blake3 content addressing object.rs -- ObjectId, CommitObject, Object enum store.rs -- Store trait, HeadState, ReflogEntry fs_store.rs -- FsStore (.panproto/ on disk) mem_store.rs -- MemStore (HashMap-backed, for tests + WASM) refs.rs -- branches, tags, resolve_ref reflog.rs -- append-only audit log of ref changes index.rs -- staging area + validation pipeline dag.rs -- merge base, path finding, log walk, compose_path merge.rs -- three-way schema merge + conflict detection rebase.rs -- replay commits onto a new base cherry_pick.rs -- apply single commit's migration reset.rs -- soft/mixed/hard HEAD reset stash.rs -- save/restore working state bisect.rs -- binary search for breaking commit blame.rs -- schema element attribution auto_mig.rs -- derive Migration from SchemaDiff rename_detect.rs -- heuristic rename detection between versions status.rs -- HEAD vs index vs working file repo.rs -- Repository orchestration (porcelain) gc.rs -- garbage collection ``` ## Object store `ObjectId` is a blake3 hash of the canonical MessagePack serialization. Canonical forms use `BTreeMap` instead of `HashMap` and exclude precomputed indices (derived data). The store contains: - `Object::Schema(Schema)`: a schema snapshot - `Object::Migration { src, tgt, mapping }`: a morphism between two schemas - `Object::Commit(CommitObject)`: a point in the DAG - `Object::Tag(TagObject)`: an annotated tag storing target object ID, tagger name, timestamp, and message - `Object::DataSet(DataSetObject)`: a content-addressed data snapshot bound to a schema version - `Object::Complement(ComplementObject)`: complement from data migration for backward restoration - `Object::Protocol(Protocol)`: a protocol definition - `Object::Expr(Expr)`: a standalone expression - `Object::EditLog(EditLogObject)`: a sequence of edits for incremental migration A `CommitObject` stores the core schema history: ```rust pub struct CommitObject { pub schema_id: ObjectId, pub parents: Vec<ObjectId>, // 0=root, 1=normal, 2=merge pub migration_id: Option<ObjectId>, pub protocol: String, pub author: String, pub timestamp: u64, pub message: String, pub renames: Vec<SiteRename>, pub protocol_id: Option<ObjectId>, pub data_ids: Vec<ObjectId>, pub complement_ids: Vec<ObjectId>, pub edit_log_ids: Vec<ObjectId>, // references to EditLogObject values } ``` ## Configuration and options `MergeOptions` controls merge behavior: ```rust pub struct MergeOptions { pub no_commit: bool, // stage result without committing pub ff_only: bool, // fail if not fast-forwardable pub no_ff: bool, // force merge commit even on fast-forward pub squash: bool, // collapse into a single non-merge commit pub message: Option<String>, // custom commit message } ``` `CherryPickOptions` controls cherry-pick behavior: `no_commit` stages without committing, `record_origin` appends the source commit ID to the message. `GcOptions` has a single `dry_run` field. When true, `gc::gc` reports unreachable objects without deleting them. ## Dag algorithms All DAG algorithms operate on the commit graph $G = (V, E)$ where $V$ is the set of commits and $E$ is the parent relation. - **Merge base**: Two-frontier BFS. Alternately expand from each side; the first vertex visited by both frontiers is the lowest common ancestor. For commits $a$ and $b$, this finds $\text{LCA}(a, b)$ in $O(|V|)$ worst case, typically much less. - **Path finding**: BFS backwards from target, reconstruct via parent map. Returns the sequence of commits $(c_0, c_1, \ldots, c_k)$ where $c_0$ is the source and $c_k$ is the target. - **Log walk**: Max-heap ordered by timestamp for topological-chronological order. Each commit is visited exactly once (tracked via a `HashSet`). Merge commits push both parents onto the heap. Complexity: $O(n \log n)$ for $n$ commits. - **Compose path**: Given a path $(c_0, \ldots, c_k)$, chain migrations via `panproto_mig::compose`. The composed migration $M_{0 \to k} = M_{k-1 \to k} \circ \cdots \circ M_{0 \to 1}$ transforms data from schema $S_0$ to schema $S_k$. - **Is ancestor**: BFS reachability check from candidate to target. Returns true if there exists a path in $G$ from the candidate to the target following parent edges. ::: {.callout-note} ## Max-Heap Over DFS for Log Walk DFS visits one branch fully before starting the other, producing non-chronological output for interleaved branches. The timestamp-ordered heap interleaves branches chronologically, matching what users expect from `git log`. ::: ## Repository operations Beyond core commits, `repo.rs` exposes additional operations: - `Repository::amend()`: creates a commit with the same parent(s) as current HEAD but updated schema, migration, and message. The previous HEAD becomes unreachable (collectible by `gc`). - `refs::force_delete_branch`: deletes a branch regardless of merge status - `refs::rename_branch`: renames a branch ref, updating HEAD if the current branch is renamed - `refs::create_and_checkout_branch`: creates a branch at HEAD and switches to it atomically - `refs::create_annotated_tag`: creates a `TagObject` in the store and points a tag ref at it - `refs::create_tag_force`: overwrites an existing tag ref - `stash::stash_apply`: re-applies the top stash entry without removing it from the stack - `stash::stash_show`: returns the schema diff between the stash entry and its parent - `stash::stash_clear`: drops all stash entries and removes the `refs/stash` ref ## Error handling `VcsError` is `#[non_exhaustive]` and includes: - `ObjectNotFound`, `RefNotFound`, `BranchNotFound`, `TagNotFound` - `BranchAlreadyExists`, `TagAlreadyExists` - `NotOnBranch`, `DetachedHead`, `IndexEmpty`, `IndexNotEmpty` - `MergeConflict`, `RebaseConflict`, `CherryPickConflict` - `FastForwardRequired`, `NothingToAmend` - `StashEmpty`, `InvalidRef`, `CorruptObject` - `RemoteNotSupported` ## Three-way merge Given schemas `base`, `ours`, `theirs`, the merge computes a combined result. Categorically, this is a pushout (colimit) over the shared base. The algorithm: 1. `diff(base, ours)` and `diff(base, theirs)`: the extended diff covers all 13 schema fields (vertices, edges, constraints, hyper-edges, required, NSIDs, variants, orderings, recursion points, usage modes, spans, nominal) 2. For every element in every field, classify its fate on each side: unchanged, added, removed, or modified 3. Apply the merge rule: one-sided changes accepted; identical changes deduplicated; incompatible changes flagged as typed conflicts with the base value retained 4. No tie-breaking: the merge is commutative, meaning `merge(base, A, B) == merge(base, B, A)` 5. Rebuild precomputed indices 6. Derive migrations from ours to merged and theirs to merged via `auto_mig::derive_migration` ::: {.callout-tip} ## Commutativity in Merge Commutativity means the merge result doesn't depend on which branch is "ours" vs "theirs." This eliminates bugs where merge order affects the outcome. It's a stronger guarantee than git's text merge, which can produce different conflict markers depending on merge direction. ::: ## Rename detection The `rename_detect` module detects renames between schema versions. When `auto_mig::derive_migration` encounters a vertex absent in one schema and a structurally similar vertex in the other, the detector identifies this as a rename rather than a delete-and-add. ### Algorithm For vertex rename detection: 1. Collect removed vertices (in old, not in new) and added vertices (in new, not in old) 2. For each (removed, added) pair, compute structural similarity: - Same vertex kind: +0.3 - Same outgoing edge names: +0.3 - Same incoming edge names: +0.2 - Edit distance of vertex ID names ≤ 3: +0.2 3. Greedy bipartite matching: take highest-scoring pairs first 4. Return pairs ≥ threshold For edge label rename detection, the algorithm compares removed and added edges between the same surviving vertex pair, scoring by name edit distance normalized to string length. ### Commitobject integration Detected renames are stored in `CommitObject.renames` as `Vec<SiteRename>`. This field is `#[serde(default, skip_serializing_if = "Vec::is_empty")]`, so old commits without renames deserialize correctly. See @sec-naming-identity-section for the full naming and identity architecture. ## Differentiation from related work panproto-vcs distinguishes itself from semantic version control tools (AST-aware diffing for source code) and schema registries (which store versions linearly without branching). ### Versus semantic VCS tools (sem, cow, difftastic, mergiraf) These tools understand syntax but not the semantics of the data being versioned. sem and weave provide entity-level merge for source code but sit on top of git, not replacing it. Cow works on parse trees but can't understand typed schema structure. Mergiraf falls back to line-based merging when AST merge fails. panproto-vcs's merge always produces a result with explicit typed conflict descriptors. ### Versus algebraic VCS (darcs, pijul) Darcs uses syntactic patches (line insertions/deletions). Pijul uses byte-chunk graphs. Both lack the type information that panproto uses to distinguish "both branches renamed vertex X" (conflict) from "both branches added unrelated vertices" (clean merge). panproto can migrate instance data through history via lifting, which requires understanding the semantics of the objects being versioned. ### Versus schema migration tools (cambria, buf, atlas, flyway, liquibase) These manage schema evolution but aren't version control systems. They lack content-addressing, commit DAGs, branching, and structural merge. Cambria operates only on JSON; panproto works across 76 protocols. Atlas generates imperative migration scripts; panproto provides declarative, semantics-preserving lenses with lens law checking. ### Versus schema registries (confluent, schemaver, apicurio) Schema registries store versions linearly with no branching or merge. Versions are identified by sequential numbers rather than by their structure. panproto uses content-addressing: identity is derived from structure. ### The capability difference panproto-vcs combines: - **Structure understanding**: GAT models with protocol-specific sorts, operations, equations, and constraints - **Content-addressing**: blake3 hashing for structural identity - **Commit DAG**: full version control with branching and merging - **Structural merge**: pushout in the category of GATs, not heuristic - **Instance data migration**: lifting through history with round-trip fidelity via lens complements - **Cross-protocol**: the same merge algorithm works across 76 data formats This combination is unique. No other tool provides all six capabilities.

17.1 Module layout

17.2 Object store

17.3 Configuration and options

17.4 Dag algorithms

17.5 Repository operations

17.6 Error handling

17.7 Three-way merge

17.8 Rename detection

17.8.1 Algorithm

17.8.2 Commitobject integration

17.9 Differentiation from related work

17.9.1 Versus semantic VCS tools (sem, cow, difftastic, mergiraf)

17.9.2 Versus algebraic VCS (darcs, pijul)

17.9.3 Versus schema migration tools (cambria, buf, atlas, flyway, liquibase)

17.9.4 Versus schema registries (confluent, schemaver, apicurio)

17.9.5 The capability difference