22 Data Versioning Internals
The panproto-vcs object model extends beyond schemas and migrations to include data snapshots, complements, and protocol definitions as first-class versioned objects. For the user-facing workflow, see the tutorial’s data versioning chapter.
22.1 Object model: datasetobject, complementobject, Protocol
File: crates/panproto-vcs/src/objects.rs
Three new variants join the Object enum:
pub enum Object {
Schema(Schema),
Migration(CompiledMigration),
Commit(CommitObject),
Tag(TagObject),
DataSet(DataSetObject),
Complement(ComplementObject),
Protocol(Protocol),
}22.1.1 DataSetObject
A DataSetObject is a content-addressed snapshot of instance data bound to a specific schema version:
pub struct DataSetObject {
pub schema_id: ObjectId,
pub record_count: usize,
pub data: Vec<u8>, // MessagePack-encoded instances
}The data field stores all instances as a single MessagePack array. Each element is a serialized WInstance or FInstance. The schema_id links the snapshot to the schema these instances conform to.
Content addressing means two snapshots with identical data and schema produce the same ObjectId. This provides deduplication: if the same records are staged multiple times against the same schema, only one object is stored.
22.1.2 ComplementObject
A ComplementObject stores the data lost during a forward migration step:
pub struct ComplementObject {
pub migration_id: ObjectId,
pub data_id: ObjectId,
pub complement: Vec<u8>, // MessagePack-encoded complement
}migration_id: theObjectIdof theMigrationobject that produced this complement.data_id: theObjectIdof theDataSetObjectthe complement was extracted from.complement: the serialized complement data (field values removed byRemoveVertex, edge data removed byRemoveEdge, etc.).
The complement’s content hash depends on all three fields. Two different migrations applied to the same data produce different complement objects even if the complement bytes happen to be identical, because the migration_id differs.
22.1.3 Protocol variant
The Protocol variant stores a complete panproto_protocols::Protocol value. This is the same struct used throughout the codebase; no wrapper or subset. Serialization uses the serde implementation with MessagePack encoding.
22.2 CommitObject extensions
File: crates/panproto-vcs/src/objects.rs
CommitObject gains three fields:
pub struct CommitObject {
pub schema_id: ObjectId,
pub parent_ids: Vec<ObjectId>,
pub migration_id: Option<ObjectId>,
pub message: String,
pub author: String,
pub timestamp: u64,
pub renames: Vec<SiteRename>,
// New fields:
pub protocol_id: Option<ObjectId>,
pub data_ids: Vec<ObjectId>,
pub complement_ids: Vec<ObjectId>,
}protocol_id: optional reference to aProtocolobject. When present, this commit is pinned to a specific protocol version. Validation uses this protocol rather than the ambient protocol.data_ids: references toDataSetObjectvalues staged with this commit. A commit can have zero or more data snapshots (one per data directory).complement_ids: references toComplementObjectvalues generated during data migration. These are stored at commit time so that backward migration can retrieve them.
These fields are not serde(default). Existing serialized commits will fail to deserialize. This is a breaking change. Repositories created before this version must be re-initialized.
22.3 Hash functions: canonical forms for data types
File: crates/panproto-vcs/src/hash.rs
Content addressing requires a canonical serialization. The hash_object function dispatches on the Object variant. Three arms handle these types:
Object::DataSet(ds) => {
let mut buf = Vec::new();
buf.extend_from_slice(b"dataset\0");
buf.extend_from_slice(&ds.schema_id.0);
rmp_serde::encode::write(&mut buf, &ds.data)?;
blake3::hash(&buf)
}
Object::Complement(c) => {
let mut buf = Vec::new();
buf.extend_from_slice(b"complement\0");
buf.extend_from_slice(&c.migration_id.0);
buf.extend_from_slice(&c.data_id.0);
rmp_serde::encode::write(&mut buf, &c.complement)?;
blake3::hash(&buf)
}
Object::Protocol(p) => {
let mut buf = Vec::new();
buf.extend_from_slice(b"protocol\0");
rmp_serde::encode::write(&mut buf, p)?;
blake3::hash(&buf)
}Each canonical form starts with a type tag (dataset\0, complement\0, protocol\0) to prevent hash collisions between objects of different types that happen to have the same byte content. This follows the same pattern used by Schema, Migration, and Commit.
22.4 data_mig.rs: migrate_forward/backward algorithms
File: crates/panproto-vcs/src/data_mig.rs
22.4.1 migrate_forward
migrate_forward takes a DataSetObject, a CompiledMigration, and the store, and returns a new DataSetObject plus a ComplementObject:
- Deserialize the MessagePack instances from
data.data. - For each instance, apply the migration’s
lift_wtype(orlift_functorfor functor instances), capturing the complement. - Collect all complements into a single MessagePack array.
- Construct the new
DataSetObjectwith the target schema’sObjectIdand the migrated instances. - Construct the
ComplementObjectwith the migration’sObjectId, the original data’sObjectId, and the complement bytes. - Store both objects and return their
ObjectIdvalues.
If the migration is lossless (complement is empty for every instance), no ComplementObject is created. The function returns None for the complement ID.
22.4.2 migrate_backward
migrate_backward takes a DataSetObject, a ComplementObject, and a CompiledMigration:
- Invert the migration via
panproto_mig::invert. - Deserialize both the data instances and the complement instances.
- For each (instance, complement) pair, apply the inverted migration’s
lift_wtypewith the complement data restored. - Construct a new
DataSetObjectbound to the source schema.
Backward migration fails with Error::MissingComplement if the complement object can’t be found in the store. Lossless steps don’t need complements; the inverted migration can reconstruct the original data without external input.
22.4.3 detect_staleness
detect_staleness takes a list of DataSetObject IDs and the current HEAD commit’s schema ID:
- For each data set, compare
data.schema_idagainst the HEAD schema ID. - If they differ, find the DAG path from
data.schema_idto the HEAD schema. - Return a
StalenessReportfor each stale data set: the path length, the transform types along the path, and whether any step requires a complement.
The function doesn’t migrate data. It’s a read-only diagnostic.
22.4.4 migrate_through_path
migrate_through_path composes migrate_forward along a DAG path:
- Walk the commit path from the data’s schema version to the target commit.
- At each step, extract the migration from the commit and call
migrate_forward. - Chain the resulting
DataSetObjectinto the next step’s input. - Collect all
ComplementObjectIDs for storage in the final commit.
For backward segments of a path (when the target is an ancestor), the function uses migrate_backward with the stored complements.
The function returns Error::MissingComplement with the migration ID and data set ID. This can happen if someone ran schema gc and the complement was collected. The remedy is to re-migrate forward from an older snapshot that still has its complements.
22.5 Complement lifecycle: creation, storage, retrieval, gc
Complements follow the same lifecycle as other objects in the store:
- Creation:
migrate_forwardproduces aComplementObjectand writes it to the store. - Storage: The
ComplementObject’sObjectIdis recorded in theCommitObject’scomplement_idsfield. - Retrieval:
migrate_backwardlooks up the complement by walking the commit’scomplement_idsand matching onmigration_idanddata_id. - GC:
gc::mark_sweepmarks complements reachable from any commit in the reflog. Unreachable complements are deleted.
Complements are only generated for lossy migrations. A chain of five rename steps produces zero complement objects. A single RemoveVertex step produces one complement object per data set. Storage cost is proportional to the data actually lost, not the number of migration steps.
The retrieval algorithm in migrate_backward is:
- Find the commit whose
data_idscontains the current data set’sObjectId. - Scan that commit’s
complement_idsfor aComplementObjectwhosemigration_idmatches the step’s migration and whosedata_idmatches the input data set. - If found, deserialize and use. If not found and the step is lossy, return
Error::MissingComplement.
22.6 DAG path migration: multi-step data migration through commit history
migrate_through_path handles three cases:
Linear path (ancestor to descendant): walk forward through each commit, applying migrate_forward at each step. Each step’s output DataSetObject becomes the next step’s input.
Linear path (descendant to ancestor): walk backward through each commit, applying migrate_backward with stored complements. This is the --backward flag in the CLI.
Non-linear path (across branches): find the LCA (lowest common ancestor) of the source and target commits. Walk backward from the source to the LCA, then forward from the LCA to the target. The backward leg consumes complements; the forward leg produces new ones.
The path is computed by dag::find_path, which returns a sequence of (CommitObject, Direction) pairs. Direction::Forward means the migration at this commit should be applied as-is. Direction::Backward means the migration should be inverted.
22.7 schema data sync pipeline
The schema data sync command (implemented in crates/panproto-cli/src/cmd/data.rs) synchronizes data files to match a target schema version:
- Open the repository and resolve the old and new schema versions (from HEAD’s parent and HEAD, or a specific target ref).
- If the schemas are identical, report “already in sync” and return.
- Load both schemas and the protocol. Call
auto_generateto derive a lens. - Read JSON files from the data directory.
- Apply the lens to each file (forward direction).
- If
--editsis set, construct anEditLogObjectwith the new schema ID, the old schema ID as data reference, an empty edits vector (the batch lens was used, not edit-by-edit), the migrated file count, and a zero complement ID. Store the object in the VCS.
The schema data status command reads the HEAD commit, counts JSON files in the data directory, and reports the HEAD schema ID and number of tracked data sets.
Both commands are in the cmd::data module and use shared helpers from cmd::helpers and cmd::migrate.
22.8 checkout_with_data / merge_with_data integration
File: crates/panproto-vcs/src/repo.rs
22.8.1 checkout_with_data
pub fn checkout_with_data(
&mut self,
target: &str,
data_dir: &Path,
) -> Result<Vec<ObjectId>>- Resolve
targetto a commit ID. - Read all data files from
data_dirand construct aDataSetObjectbound to the current HEAD schema. - Find the DAG path from the current HEAD to the target commit.
- Call
migrate_through_pathwith the data set and the path. - Write migrated records back to
data_dir. - Perform the normal
checkout(update HEAD, update index). - Return the
ObjectIdvalues of any complements generated.
22.8.2 merge_with_data
pub fn merge_with_data(
&mut self,
branch: &str,
author: &str,
data_dir: &Path,
) -> Result<MergeResult>- Perform the normal
merge(three-way pushout, conflict detection). - If the merge succeeds (no conflicts), derive the protolens chain from the current branch’s schema to the merged schema.
- Read data from
data_dir, construct aDataSetObject, and applymigrate_forwardwith the derived migration. - Store complements.
- Write migrated records to
data_dir. - Include the data and complement
ObjectIdvalues in the merge commit’sdata_idsandcomplement_ids.
If the merge has conflicts, data migration is skipped. The MergeResult contains the conflicts, and the caller must resolve them before migrating data.
22.9 Index staging: stageddata, staged_protocol
File: crates/panproto-vcs/src/objects.rs
The Index struct gains two fields:
pub struct Index {
pub staged_schema: Option<StagedSchema>,
pub staged_data: Vec<StagedData>,
pub staged_protocol: Option<Protocol>,
}22.9.1 StagedData
pub struct StagedData {
pub path: PathBuf,
pub data: DataSetObject,
}StagedData records the filesystem path and the constructed DataSetObject. Multiple data directories can be staged simultaneously (one StagedData per directory).
Repository::add_data(path) reads all files in path, serializes them as instances, constructs a DataSetObject bound to the current HEAD schema, and appends to staged_data.
22.9.2 staged_protocol
Repository::add_protocol(protocol) sets staged_protocol to the provided Protocol value. At commit time, the protocol is stored as an Object::Protocol and its ObjectId is written to the CommitObject’s protocol_id field.
Only one protocol can be staged per commit. Calling add_protocol twice overwrites the first.
The Index is serialized as JSON to .panproto/index.json. The staged_data field uses serde(default) (defaults to empty vec) and staged_protocol is Option<Protocol> (defaults to None).