15 Testing and Simplification

How do you test whether a migration preserves the equations it claims to preserve? You need concrete data that actually exercises those equations—data that makes them active, not trivially true. schema scaffold generates exactly that: the simplest valid instance for any schema, the minimal dataset that touches every part of the theory.

Formally, the simplest valid instance is the smallest model $M$ of a theory $T$ such that every sort $s \in T$ has at least one element and every equation $e \in T$ is exercised by at least one tuple.

15.1 Generating test data with `schema scaffold`

Every schema theory has equations (Chapter 3 for details). A migration must preserve those equations. The scaffolder generates minimal data that exercises all of them.

Think of scaffolded data as the “hello world” of your schema: the minimal dataset that exercises every sort, every operation, and every equation at least once. The key property: if a migration correctly handles the simplest valid data, it will handle the structural aspects of any valid data. Simplest valid data is a universal test case for structural correctness.

15.1.1 Basic usage

schema scaffold --protocol atproto --schema my-schema.json

This reads your schema, determines which theory it uses, and generates a minimal instance that satisfies all equations. The output is JSON conforming to your schema.

For a blog post schema with vertices {post, author, tag} and edges {written_by, tagged_with}, the scaffolded data might look like:

{
  "post": [{"id": "post_0"}],
  "author": [{"id": "author_0"}],
  "tag": [{"id": "tag_0"}],
  "written_by": [{"src": "post_0", "tgt": "author_0"}],
  "tagged_with": [{"src": "post_0", "tgt": "tag_0"}]
}

One element per sort, one mapping per operation. This populates every part of the schema with minimal cardinality.

15.1.2 Controlling generation size

The minimal instance is useful for smoke tests, but you often want richer data that exercises more combinations. Two flags control this.

--depth: Controls how many layers of structure to generate. At depth 1 (the default), you get one element per sort. At depth 2, you get enough elements to exercise every pair of operations. At depth 3, every triple, and so on.

schema scaffold --protocol atproto --schema my-schema.json --depth 2

At depth 2, the blog post schema might produce multiple posts with different authors, exercising both written_by and tagged_with edges in various combinations.

--max-terms: Caps the total number of generated elements across all sorts. Useful when higher depths would produce more data than you want.

schema scaffold --protocol atproto --schema my-schema.json --depth 3 --max-terms 50

Exercise: Existence and termination

Does the simplest valid instance always exist for panproto’s theories? When does construction terminate?

Answer

Yes, for panproto’s equational theories. The theories are purely algebraic (no recursive constraints), so the construction terminates: start with one element per sort, then add elements required by operations, then verify equations. For theories with recursive types, an unfolding bound prevents non-termination, and the scaffolder produces the smallest instance within that bound.

15.2 Protocol-specific generation

schema scaffold works with any protocol panproto supports. The generated data format matches the protocol’s instance theory:

# ATProto / JSON Schema: generates JSON objects
schema scaffold --protocol atproto --schema lexicon.json

# SQL DDL: generates INSERT statements
schema scaffold --protocol sql --schema schema.sql

# Protobuf: generates text-format proto messages
schema scaffold --protocol protobuf --schema messages.proto

# GraphQL: generates JSON matching the type system
schema scaffold --protocol graphql --schema schema.graphql

The structural properties are the same regardless of protocol. Simplest valid data exercises every equation in the theory, and migrations that handle it correctly are structurally sound.

15.3 Testing migrations with scaffolded data

The real power of schema scaffold is in migration testing. Here’s the workflow.

Step 1: Scaffold data for the old schema.

schema scaffold --protocol atproto --schema v1.json -o test-data-v1.json

Step 2: Run the migration.

schema lift --migration mig.json \
  --src-schema v1.json --tgt-schema v2.json \
  test-data-v1.json -o test-data-v2.json

Step 3: Scaffold data for the new schema and compare.

schema scaffold --protocol atproto --schema v2.json -o expected-v2.json

The migrated data (test-data-v2.json) should be a valid instance of the new schema. If the migration is correct, it’ll have the same structural shape as the freshly scaffolded data (expected-v2.json), though element names may differ.

Step 4: Validate.

schema verify --schema v2.json test-data-v2.json

This checks that the migrated data actually conforms to the new schema. If the migration broke an equation (say, it mapped a reflexive graph to a structure missing identity edges), this step catches it.

15.3.1 When scaffolding is not enough

Scaffolded data tests structural correctness: do the sorts, operations, and equations survive the migration? It doesn’t test validation correctness: do the maxLength constraints, regex patterns, and numeric bounds survive?

For validation testing, you need data that exercises the boundary conditions of each constraint. panproto doesn’t generate this automatically. Constraint boundaries are domain-specific and often require human judgment about what constitutes a meaningful test case.

The recommended approach: use schema scaffold for structural regression tests (automated, run in CI), and write hand-crafted test cases for validation boundary tests (manual, reviewed by domain experts). Together, they cover the two layers of schema correctness: the algebraic structure (sorts, operations, equations) and the validation layer (constraints).

Exercise: Limitations of structural testing

Scaffolded data is a universal test case for structural correctness. What categories of bugs can it miss?

Answer

Scaffolded data is a universal test case for structural correctness: it exercises every sort, operation, and equation. But it cannot catch validation bugs (a maxLength set to the wrong value), semantic bugs (mapping email to username), or boundary condition bugs (data at exactly the constraint limit). Structural tests and hand-crafted boundary tests cover different failure modes and are complementary.

15.4 Simplifying schemas with `schema normalize`

Schemas accumulate cruft. A field gets copied instead of referenced. Two teams independently define the same concept under different names. A refactor leaves behind duplicate type definitions that are structurally identical but syntactically distinct. schema normalize identifies equivalent definitions, merges them, and produces a simplified schema that is structurally identical to the original but without the duplication.

15.4.1 When schemas have duplicates

Consider an ATProto schema where two contributors independently defined type nodes for timestamps: DateTime and Timestamp. Both have exactly the same structure: same sort, same constraints, same edges. The schema has two names for the same concept.

This isn’t a bug. Both definitions are valid, and instances using either one are well-formed. But it’s a problem for migrations. When you diff two versions, changes to DateTime and changes to Timestamp show up as separate modifications, even though they’re logically the same thing. Merge conflicts between the two are spurious. Migration mappings have to map both, doubling the work.

15.4.2 Basic usage

The simplest invocation identifies two elements as equivalent:

schema normalize --protocol atproto --schema my-schema.json --identify DateTime=Timestamp

This tells panproto: “treat DateTime and Timestamp as the same thing.” The output is a new schema where:

One definition survives. By default, the first-named element (DateTime) is kept and the second (Timestamp) is removed.
All references are updated. Every edge that pointed to Timestamp points to DateTime instead. Every constraint attached to Timestamp is merged into DateTime’s constraint set.
Equations are re-verified. The simplified schema’s theory is type-checked to ensure that merging didn’t break any equations.

You can choose which name survives with --prefer:

schema normalize --protocol atproto --schema my-schema.json \
  --identify DateTime=Timestamp --prefer Timestamp

If the two elements have different kinds (one is string and the other is integer), normalization fails. You can only identify elements that are structurally compatible.

15.4.3 What happens during normalization

When you identify two elements, panproto performs three steps.

Merge sorts. If DateTime and Timestamp are both vertices of the same kind, they’re merged into a single vertex.

Merge operations. All operations (edges) that reference the removed element are rewritten to reference the surviving one. If both elements had an outgoing edge with the same label and target, those edges are merged into one. If they had edges with the same label but different targets, both edges are kept (attached to the surviving vertex), and a warning is emitted suggesting you resolve the discrepancy.

Re-verify equations. After merging, the theory’s equations are type-checked in the simplified schema. This catches cases where the identification breaks an invariant.

15.4.4 Multiple identifications

You can identify more than two elements in a single command:

schema normalize --protocol atproto --schema my-schema.json \
  --identify DateTime=Timestamp \
  --identify UserID=AuthorID \
  --identify PostBody=ContentText

Identifications are applied in order, left to right. Each one simplifies the schema before the next is applied.

15.4.5 Discovering candidates for identification

For larger schemas, schema normalize has a discovery mode:

schema normalize --protocol atproto --schema my-schema.json --suggest

This analyzes the schema and reports pairs of elements that are structurally similar: same sort kind, same or similar edge patterns, same or similar constraints. Each suggestion includes a similarity score and explanation.

Suggested identifications:
  DateTime = Timestamp     (score: 1.00, identical structure)
  UserID = AuthorID        (score: 0.95, differ in: edge label "display_name" vs "name")
  PostBody = ContentText   (score: 0.88, differ in: constraint maxLength 300 vs 500)

A score of 1.00 means the elements are structurally identical; they can be merged with no loss. Lower scores indicate partial similarity where merging would require choosing which constraints or edges to keep.

Exercise: When not to normalize

Two elements might be structurally identical today but semantically different. When should you resist the temptation to normalize?

Answer

Two elements might be structurally identical today but semantically different. For example, EmailAddress and Username might both be string vertices with a maxLength constraint. Merging them would lose the distinction. A future schema change might add a pattern constraint to EmailAddress (requiring an @ sign) that shouldn’t apply to Username. Normalize only when the elements are genuinely the same concept under different names.

15.5 The full workflow: generate, migrate, simplify

The testing and simplification tools combine into a natural workflow for schema evolution:

# 1. Generate test data for the current schema
schema scaffold --protocol atproto --schema v1.json -o test-v1.json

# 2. Run the migration
schema lift --migration mig.json \
  --src-schema v1.json --tgt-schema v2.json \
  test-v1.json -o test-v2.json

# 3. Validate the migrated data
schema verify --schema v2.json test-v2.json

# 4. Simplify the new schema if needed
schema normalize --protocol atproto --schema v2.json --suggest

# 5. Commit the simplified version
schema add normalized.json
schema commit -m "normalize: merge DateTime and Timestamp"

The commit’s migration mapping captures exactly what changed: Timestamp vertices are mapped to DateTime vertices, and all downstream edges are remapped accordingly. This migration is invertible (with complement data), so you can reconstruct the pre-normalization schema if needed.

Normalization commits are especially useful before a merge. If two branches independently added similar definitions, normalizing one or both before merging reduces the chance of spurious conflicts.

# Testing and Simplification {#sec-testing-simplification} How do you test whether a migration preserves the equations it claims to preserve? You need concrete data that actually exercises those equations—data that makes them active, not trivially true. `schema scaffold` generates exactly that: the simplest valid instance for any schema, the minimal dataset that touches every part of the theory. Formally, the simplest valid instance is the smallest model $M$ of a theory $T$ such that every sort $s \in T$ has at least one element and every equation $e \in T$ is exercised by at least one tuple. ## Generating test data with `schema scaffold` {#sec-scaffold} Every schema theory has equations (@sec-schemas-have-in-common for details). A migration must preserve those equations. The scaffolder generates minimal data that exercises all of them. Think of scaffolded data as the "hello world" of your schema: the minimal dataset that exercises every sort, every operation, and every equation at least once. The key property: if a migration correctly handles the simplest valid data, it will handle the structural aspects of any valid data. Simplest valid data is a *universal test case* for structural correctness. ### Basic usage ```bash schema scaffold --protocol atproto --schema my-schema.json ``` This reads your schema, determines which theory it uses, and generates a minimal instance that satisfies all equations. The output is JSON conforming to your schema. For a blog post schema with vertices `{post, author, tag}` and edges `{written_by, tagged_with}`, the scaffolded data might look like: ```json { "post": [{"id": "post_0"}], "author": [{"id": "author_0"}], "tag": [{"id": "tag_0"}], "written_by": [{"src": "post_0", "tgt": "author_0"}], "tagged_with": [{"src": "post_0", "tgt": "tag_0"}] } ``` One element per sort, one mapping per operation. This populates every part of the schema with minimal cardinality. ### Controlling generation size The minimal instance is useful for smoke tests, but you often want richer data that exercises more combinations. Two flags control this. **`--depth`**: Controls how many layers of structure to generate. At depth 1 (the default), you get one element per sort. At depth 2, you get enough elements to exercise every *pair* of operations. At depth 3, every triple, and so on. ```bash schema scaffold --protocol atproto --schema my-schema.json --depth 2 ``` At depth 2, the blog post schema might produce multiple posts with different authors, exercising both `written_by` and `tagged_with` edges in various combinations. **`--max-terms`**: Caps the total number of generated elements across all sorts. Useful when higher depths would produce more data than you want. ```bash schema scaffold --protocol atproto --schema my-schema.json --depth 3 --max-terms 50 ``` :::{.callout-caution} ## Exercise: Existence and termination Does the simplest valid instance always exist for panproto's theories? When does construction terminate? ::: ::: {.callout-tip collapse=true} ## Answer Yes, for panproto's equational theories. The theories are purely algebraic (no recursive constraints), so the construction terminates: start with one element per sort, then add elements required by operations, then verify equations. For theories with recursive types, an unfolding bound prevents non-termination, and the scaffolder produces the smallest instance within that bound. ::: ## Protocol-specific generation {#sec-protocol-scaffold} `schema scaffold` works with any protocol panproto supports. The generated data format matches the protocol's instance theory: ```bash # ATProto / JSON Schema: generates JSON objects schema scaffold --protocol atproto --schema lexicon.json # SQL DDL: generates INSERT statements schema scaffold --protocol sql --schema schema.sql # Protobuf: generates text-format proto messages schema scaffold --protocol protobuf --schema messages.proto # GraphQL: generates JSON matching the type system schema scaffold --protocol graphql --schema schema.graphql ``` The structural properties are the same regardless of protocol. Simplest valid data exercises every equation in the theory, and migrations that handle it correctly are structurally sound. ## Testing migrations with scaffolded data {#sec-migration-testing} The real power of `schema scaffold` is in migration testing. Here's the workflow. **Step 1: Scaffold data for the old schema.** ```bash schema scaffold --protocol atproto --schema v1.json -o test-data-v1.json ``` **Step 2: Run the migration.** ```bash schema lift --migration mig.json \ --src-schema v1.json --tgt-schema v2.json \ test-data-v1.json -o test-data-v2.json ``` **Step 3: Scaffold data for the new schema and compare.** ```bash schema scaffold --protocol atproto --schema v2.json -o expected-v2.json ``` The migrated data (`test-data-v2.json`) should be a valid instance of the new schema. If the migration is correct, it'll have the same structural shape as the freshly scaffolded data (`expected-v2.json`), though element names may differ. **Step 4: Validate.** ```bash schema verify --schema v2.json test-data-v2.json ``` This checks that the migrated data actually conforms to the new schema. If the migration broke an equation (say, it mapped a reflexive graph to a structure missing identity edges), this step catches it. ### When scaffolding is not enough Scaffolded data tests structural correctness: do the sorts, operations, and equations survive the migration? It doesn't test *validation* correctness: do the `maxLength` constraints, regex patterns, and numeric bounds survive? For validation testing, you need data that exercises the boundary conditions of each constraint. panproto doesn't generate this automatically. Constraint boundaries are domain-specific and often require human judgment about what constitutes a meaningful test case. The recommended approach: use `schema scaffold` for structural regression tests (automated, run in CI), and write hand-crafted test cases for validation boundary tests (manual, reviewed by domain experts). Together, they cover the two layers of schema correctness: the algebraic structure (sorts, operations, equations) and the validation layer (constraints). :::{.callout-caution} ## Exercise: Limitations of structural testing Scaffolded data is a universal test case for *structural* correctness. What categories of bugs can it miss? ::: ::: {.callout-tip collapse=true} ## Answer Scaffolded data is a universal test case for *structural* correctness: it exercises every sort, operation, and equation. But it cannot catch validation bugs (a `maxLength` set to the wrong value), semantic bugs (mapping `email` to `username`), or boundary condition bugs (data at exactly the constraint limit). Structural tests and hand-crafted boundary tests cover different failure modes and are complementary. ::: ## Simplifying schemas with `schema normalize` {#sec-normalize} Schemas accumulate cruft. A field gets copied instead of referenced. Two teams independently define the same concept under different names. A refactor leaves behind duplicate type definitions that are structurally identical but syntactically distinct. `schema normalize` identifies equivalent definitions, merges them, and produces a simplified schema that is structurally identical to the original but without the duplication. ### When schemas have duplicates Consider an ATProto schema where two contributors independently defined type nodes for timestamps: `DateTime` and `Timestamp`. Both have exactly the same structure: same sort, same constraints, same edges. The schema has two names for the same concept. This isn't a bug. Both definitions are valid, and instances using either one are well-formed. But it's a problem for migrations. When you diff two versions, changes to `DateTime` and changes to `Timestamp` show up as separate modifications, even though they're logically the same thing. Merge conflicts between the two are spurious. Migration mappings have to map both, doubling the work. ### Basic usage The simplest invocation identifies two elements as equivalent: ```bash schema normalize --protocol atproto --schema my-schema.json --identify DateTime=Timestamp ``` This tells panproto: "treat `DateTime` and `Timestamp` as the same thing." The output is a new schema where: 1. **One definition survives.** By default, the first-named element (`DateTime`) is kept and the second (`Timestamp`) is removed. 2. **All references are updated.** Every edge that pointed to `Timestamp` points to `DateTime` instead. Every constraint attached to `Timestamp` is merged into `DateTime`'s constraint set. 3. **Equations are re-verified.** The simplified schema's theory is type-checked to ensure that merging didn't break any equations. You can choose which name survives with `--prefer`: ```bash schema normalize --protocol atproto --schema my-schema.json \ --identify DateTime=Timestamp --prefer Timestamp ``` If the two elements have different kinds (one is `string` and the other is `integer`), normalization fails. You can only identify elements that are structurally compatible. ### What happens during normalization When you identify two elements, panproto performs three steps. **Merge sorts.** If `DateTime` and `Timestamp` are both vertices of the same kind, they're merged into a single vertex. **Merge operations.** All operations (edges) that reference the removed element are rewritten to reference the surviving one. If both elements had an outgoing edge with the same label and target, those edges are merged into one. If they had edges with the same label but different targets, both edges are kept (attached to the surviving vertex), and a warning is emitted suggesting you resolve the discrepancy. **Re-verify equations.** After merging, the theory's equations are type-checked in the simplified schema. This catches cases where the identification breaks an invariant. ### Multiple identifications You can identify more than two elements in a single command: ```bash schema normalize --protocol atproto --schema my-schema.json \ --identify DateTime=Timestamp \ --identify UserID=AuthorID \ --identify PostBody=ContentText ``` Identifications are applied in order, left to right. Each one simplifies the schema before the next is applied. ### Discovering candidates for identification For larger schemas, `schema normalize` has a discovery mode: ```bash schema normalize --protocol atproto --schema my-schema.json --suggest ``` This analyzes the schema and reports pairs of elements that are structurally similar: same sort kind, same or similar edge patterns, same or similar constraints. Each suggestion includes a similarity score and explanation. ``` Suggested identifications: DateTime = Timestamp (score: 1.00, identical structure) UserID = AuthorID (score: 0.95, differ in: edge label "display_name" vs "name") PostBody = ContentText (score: 0.88, differ in: constraint maxLength 300 vs 500) ``` A score of 1.00 means the elements are structurally identical; they can be merged with no loss. Lower scores indicate partial similarity where merging would require choosing which constraints or edges to keep. :::{.callout-caution} ## Exercise: When not to normalize Two elements might be structurally identical today but semantically different. When should you resist the temptation to normalize? ::: ::: {.callout-tip collapse=true} ## Answer Two elements might be structurally identical today but semantically different. For example, `EmailAddress` and `Username` might both be `string` vertices with a `maxLength` constraint. Merging them would lose the distinction. A future schema change might add a `pattern` constraint to `EmailAddress` (requiring an `@` sign) that shouldn't apply to `Username`. Normalize only when the elements are genuinely the same concept under different names. ::: ## The full workflow: generate, migrate, simplify {#sec-full-workflow} The testing and simplification tools combine into a natural workflow for schema evolution: ```bash # 1. Generate test data for the current schema schema scaffold --protocol atproto --schema v1.json -o test-v1.json # 2. Run the migration schema lift --migration mig.json \ --src-schema v1.json --tgt-schema v2.json \ test-v1.json -o test-v2.json # 3. Validate the migrated data schema verify --schema v2.json test-v2.json # 4. Simplify the new schema if needed schema normalize --protocol atproto --schema v2.json --suggest # 5. Commit the simplified version schema add normalized.json schema commit -m "normalize: merge DateTime and Timestamp" ``` The commit's migration mapping captures exactly what changed: `Timestamp` vertices are mapped to `DateTime` vertices, and all downstream edges are remapped accordingly. This migration is invertible (with complement data), so you can reconstruct the pre-normalization schema if needed. Normalization commits are especially useful before a merge. If two branches independently added similar definitions, normalizing one or both before merging reduces the chance of spurious conflicts.

15.1 Generating test data with schema scaffold

15.1.1 Basic usage

15.1.2 Controlling generation size

15.2 Protocol-specific generation

15.3 Testing migrations with scaffolded data

15.3.1 When scaffolding is not enough

15.4 Simplifying schemas with schema normalize

15.4.1 When schemas have duplicates

15.4.2 Basic usage

15.4.3 What happens during normalization

15.4.4 Multiple identifications

15.4.5 Discovering candidates for identification

15.5 The full workflow: generate, migrate, simplify

15.1 Generating test data with `schema scaffold`

15.4 Simplifying schemas with `schema normalize`