1 The Schema Migration Problem

2 The Schema Migration Problem

Every schema language has the same problem: data outlives the schema it was written against. The moment a schema changes, a gap opens between old data and new expectations, and something must close it. That something is a migration—a transformation derived from the relationship between the two schema versions.

panproto starts from the observation that schema languages, despite their surface differences, all describe the same kind of structural object: named containers holding typed, constrained members connected by edges. Because the structure is shared, the migration machinery can be shared too.

2.1 Four schemas for a post

To make the pattern concrete, here is the same entity—a social media post—described in four different schema languages.

2.1.1 ATProto Lexicon

Lexicon is a JSON-based schema language with protocol-specific concepts: namespaced identifiers (NSIDs), record types, and grapheme-cluster counting.

{
  "lexicon": 1,
  "id": "app.bsky.feed.post",
  "defs": {
    "main": {
      "type": "record",
      "description": "A post in a social feed.",
      "key": "tid",
      "record": {
        "type": "object",
        "required": ["text", "createdAt"],
        "properties": {
          "text": {
            "type": "string",
            "maxLength": 3000,
            "maxGraphemes": 300
          },
          "createdAt": {
            "type": "string",
            "format": "datetime"
          },
          "tags": {
            "type": "array",
            "items": {
              "type": "string",
              "maxLength": 640,
              "maxGraphemes": 64
            },
            "maxLength": 8
          }
        }
      }
    }
  }
}

The NSID app.bsky.feed.post identifies this record type across the AT Protocol network. Inside the record block, the shape language looks a lot like JSON Schema: a required array, a properties object, type annotations on each property. The one thing that is distinctly Lexicon is maxGraphemes, which counts Unicode grapheme clusters rather than bytes or code points.

2.1.2 SQL DDL (PostgreSQL)

CREATE TABLE posts (
    id          BIGINT GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
    text        VARCHAR(3000)   NOT NULL,
    created_at  TIMESTAMPTZ     NOT NULL DEFAULT now(),
    tags        TEXT[]
);

SQL brings concepts the other formats lack: identity (PRIMARY KEY), server-side defaults (DEFAULT now()), and write-time enforcement (NOT NULL). The TEXT[] type is PostgreSQL’s array syntax; most other databases would model tags as a join table, which is itself a schema design decision with migration consequences.

2.1.3 Protocol Buffers

syntax = "proto3";

package social.feed;

import "google/protobuf/timestamp.proto";

message Post {
  string text = 1;
  google.protobuf.Timestamp created_at = 2;
  repeated string tags = 3;
}

Protobuf’s distinguishing feature is field numbers (= 1, = 2, = 3). They decouple the wire encoding from field names, which means you can rename a field without breaking existing consumers. The trade-off is that you can never reuse a field number. There is no built-in constraint language; if you want maxLength validation, you reach for an external tool like Buf and its protovalidate toolchain.

2.1.4 GraphQL SDL

"""A post in a social feed."""
type Post {
  text: String!
  createdAt: DateTime!
  tags: [String!]
}

GraphQL describes API surfaces, not storage. The ! suffix means non-nullable (GraphQL’s analog of required). [String!] is a nullable list of non-nullable strings; nullability is a type-level concern here, not a separate constraint annotation. There is no maxLength in the SDL. Validation happens in resolvers or custom directives.

2.2 What these four descriptions share

The syntax is different. The terminology is different. The operational semantics are different. But look at the structure:

Concept	ATProto Lexicon	SQL DDL	Protobuf	GraphQL SDL
Container	record > object	table	message	type
Named member	property	column	field	field
Member type	type	data type	type	return type
Required marker	required[]	NOT NULL	(always set)	`!` suffix
Length constraint	maxLength	VARCHAR(n)	(external)	(external)
List marker	type: array	ARRAY / table	repeated	`[T]` syntax

The table has one row per structural concept and one column per language. Every column is filled. The differences are real—foreign keys, field numbers, nullability operators, grapheme counting—but they are extensions to a shared core, not alternative foundations. If the differences were superficial, a format converter would suffice. If they were fundamental, no general tool could help. The reality is in between, and that middle ground is what panproto formalizes.

Exercise

Which of these four languages can express that tags has at most 8 elements? Which cannot? What does this tell you about how migration between these languages must handle constraints?

Solution

ATProto Lexicon can express it directly (maxLength: 8 on the array). SQL can approximate it with a CHECK constraint but has no native array-length syntax in the DDL. Protobuf and GraphQL cannot; they have no constraint vocabulary in their schema languages. Cross-protocol migration must either silently drop the constraint, carry it as an out-of-band annotation, or reject the translation as lossy.

Structurally, the post schema is a directed graph: three vertices (Post, String, DateTime) and three edges (text, createdAt, tags). The next chapter makes that precise.

Exercise

SQL has a PRIMARY KEY concept. Does any other language in the table above have an equivalent? What does this asymmetry mean for cross-protocol translation?

Solution

None of the other three has a direct equivalent. GraphQL has no identity concept in the SDL. Protobuf’s field numbers are wire-format metadata, not primary keys. ATProto’s key: "tid" is the closest analog—it specifies an ID generation strategy—but it is a record-level annotation rather than a structural constraint on a column. Cross-protocol translation must either drop the PRIMARY KEY information or carry it as an opaque annotation.

Comparing schemas, then, requires more than syntax translation. We need a representation that can track what each language expresses and what it leaves out.

2.3 Building schemas in panproto

Here is the same post schema built with panproto’s TypeScript SDK for two protocols. The point is not the API surface (which you can look up in the reference); the point is the shape of the construction.

2.3.1 ATProto

import { Panproto } from '@panproto/core';

// start snippet init
const panproto = await Panproto.init();
const atproto = panproto.protocol('atproto');
// end snippet init

// start snippet build-schema
const postSchema = atproto.schema()
  .vertex('post', 'record', { nsid: 'app.bsky.feed.post' })
  .vertex('post:body', 'object')
  .vertex('post:body.text', 'string')
  .vertex('post:body.createdAt', 'string')
  .edge('post', 'post:body', 'record-schema')
  .edge('post:body', 'post:body.text', 'prop', { name: 'text' })
  .edge('post:body', 'post:body.createdAt', 'prop', { name: 'createdAt' })
  .constraint('post:body.text', 'maxLength', '3000')
  .constraint('post:body.text', 'maxGraphemes', '300')
  .constraint('post:body.createdAt', 'format', 'datetime')
  .build();
// end snippet build-schema

Schema types become vertices. The fields connecting them become edges. Constraints are annotations on vertices. The builder API mirrors Lexicon’s own structure: vertices for the record, the inner object, and each property’s type; edges connecting them; constraints attached where they belong.

2.3.2 SQL

import { Panproto } from '@panproto/core';

const panproto = await Panproto.init();

// start snippet build-schema
const sql = panproto.protocol('sql');

const postTable = sql.schema()
  .vertex('posts', 'table')
  .vertex('posts.id', 'integer')
  .vertex('posts.text', 'string')
  .vertex('posts.created_at', 'string')
  .edge('posts', 'posts.id', 'column', { name: 'id' })
  .edge('posts', 'posts.text', 'column', { name: 'text' })
  .edge('posts', 'posts.created_at', 'column', { name: 'created_at' })
  .constraint('posts.text', 'maxLength', '3000')
  .build();
// end snippet build-schema

Same shape. Different vocabulary. The word “table” replaces “record,” “column” replaces “property,” but the construction logic—declare vertices, connect them with edges, attach constraints—is identical. What varies is the vocabulary of vertex kinds, edge kinds, and constraint types.

That vocabulary is exactly what we will formalize next. The next chapter asks: what, precisely, do these graph structures have in common, and how can we describe the variation between schema languages formally enough for an engine to compute with it?

--- title: "The Schema Migration Problem" id: sec-schema-migration-problem --- # The Schema Migration Problem Every schema language has the same problem: data outlives the schema it was written against. The moment a schema changes, a gap opens between old data and new expectations, and something must close it. That something is a migration---a transformation derived from the relationship between the two schema versions. panproto starts from the observation that schema languages, despite their surface differences, all describe the same kind of structural object: named containers holding typed, constrained members connected by edges. Because the structure is shared, the migration machinery can be shared too. ## Four schemas for a post To make the pattern concrete, here is the same entity---a social media post---described in four different schema languages. ### ATProto Lexicon Lexicon is a JSON-based schema language with protocol-specific concepts: namespaced identifiers (NSIDs), record types, and grapheme-cluster counting. ```json { "lexicon": 1, "id": "app.bsky.feed.post", "defs": { "main": { "type": "record", "description": "A post in a social feed.", "key": "tid", "record": { "type": "object", "required": ["text", "createdAt"], "properties": { "text": { "type": "string", "maxLength": 3000, "maxGraphemes": 300 }, "createdAt": { "type": "string", "format": "datetime" }, "tags": { "type": "array", "items": { "type": "string", "maxLength": 640, "maxGraphemes": 64 }, "maxLength": 8 } } } } } } ``` The NSID `app.bsky.feed.post` identifies this record type across the AT Protocol network. Inside the `record` block, the shape language looks a lot like JSON Schema: a `required` array, a `properties` object, `type` annotations on each property. The one thing that is distinctly Lexicon is `maxGraphemes`, which counts Unicode grapheme clusters rather than bytes or code points. ### SQL DDL (PostgreSQL) ```sql CREATE TABLE posts ( id BIGINT GENERATED ALWAYS AS IDENTITY PRIMARY KEY, text VARCHAR(3000) NOT NULL, created_at TIMESTAMPTZ NOT NULL DEFAULT now(), tags TEXT[] ); ``` SQL brings concepts the other formats lack: identity (`PRIMARY KEY`), server-side defaults (`DEFAULT now()`), and write-time enforcement (`NOT NULL`). The `TEXT[]` type is PostgreSQL's array syntax; most other databases would model tags as a join table, which is itself a schema design decision with migration consequences. ### Protocol Buffers ```protobuf syntax = "proto3"; package social.feed; import "google/protobuf/timestamp.proto"; message Post { string text = 1; google.protobuf.Timestamp created_at = 2; repeated string tags = 3; } ``` Protobuf's distinguishing feature is field numbers (`= 1`, `= 2`, `= 3`). They decouple the wire encoding from field names, which means you can rename a field without breaking existing consumers. The trade-off is that you can never reuse a field number. There is no built-in constraint language; if you want `maxLength` validation, you reach for an external tool like [Buf](https://buf.build/) and its `protovalidate` toolchain. ### GraphQL SDL ```graphql """A post in a social feed.""" type Post { text: String! createdAt: DateTime! tags: [String!] } ``` GraphQL describes API surfaces, not storage. The `!` suffix means non-nullable (GraphQL's analog of `required`). `[String!]` is a nullable list of non-nullable strings; nullability is a type-level concern here, not a separate constraint annotation. There is no `maxLength` in the SDL. Validation happens in resolvers or custom directives. ## What these four descriptions share The syntax is different. The terminology is different. The operational semantics are different. But look at the structure: | Concept | ATProto Lexicon | SQL DDL | Protobuf | GraphQL SDL | |------------------|-----------------|---------------|-----------------|---------------| | Container | record > object | table | message | type | | Named member | property | column | field | field | | Member type | type | data type | type | return type | | Required marker | required[] | NOT NULL | (always set) | `!` suffix | | Length constraint | maxLength | VARCHAR(n) | (external) | (external) | | List marker | type: array | ARRAY / table | repeated | `[T]` syntax | The table has one row per structural concept and one column per language. Every column is filled. The differences are real---foreign keys, field numbers, nullability operators, grapheme counting---but they are extensions to a shared core, not alternative foundations. If the differences were superficial, a format converter would suffice. If they were fundamental, no general tool could help. The reality is in between, and that middle ground is what panproto formalizes. ::: {.callout-caution} ## Exercise Which of these four languages can express that `tags` has at most 8 elements? Which cannot? What does this tell you about how migration between these languages must handle constraints? ::: ::: {.callout-tip collapse=true} ## Solution ATProto Lexicon can express it directly (`maxLength: 8` on the array). SQL can approximate it with a CHECK constraint but has no native array-length syntax in the DDL. Protobuf and GraphQL cannot; they have no constraint vocabulary in their schema languages. Cross-protocol migration must either silently drop the constraint, carry it as an out-of-band annotation, or reject the translation as lossy. ::: Structurally, the post schema is a directed graph: three vertices (`Post`, `String`, `DateTime`) and three edges (`text`, `createdAt`, `tags`). The next chapter makes that precise. ::: {.callout-caution} ## Exercise SQL has a `PRIMARY KEY` concept. Does any other language in the table above have an equivalent? What does this asymmetry mean for cross-protocol translation? ::: ::: {.callout-tip collapse=true} ## Solution None of the other three has a direct equivalent. GraphQL has no identity concept in the SDL. Protobuf's field numbers are wire-format metadata, not primary keys. ATProto's `key: "tid"` is the closest analog---it specifies an ID generation strategy---but it is a record-level annotation rather than a structural constraint on a column. Cross-protocol translation must either drop the PRIMARY KEY information or carry it as an opaque annotation. ::: Comparing schemas, then, requires more than syntax translation. We need a representation that can track what each language expresses and what it leaves out. ## Building schemas in panproto Here is the same post schema built with panproto's TypeScript SDK for two protocols. The point is not the API surface (which you can look up in the reference); the point is the shape of the construction. ### ATProto ```{.typescript include="../code/ts/01-atproto-schema.ts" snippet="build-schema"} ``` Schema types become vertices. The fields connecting them become edges. Constraints are annotations on vertices. The builder API mirrors Lexicon's own structure: vertices for the record, the inner object, and each property's type; edges connecting them; constraints attached where they belong. ### SQL ```{.typescript include="../code/ts/01-sql-schema.ts" snippet="build-schema"} ``` Same shape. Different vocabulary. The word "table" replaces "record," "column" replaces "property," but the construction logic---declare vertices, connect them with edges, attach constraints---is identical. What varies is the *vocabulary* of vertex kinds, edge kinds, and constraint types. That vocabulary is exactly what we will formalize next. The next chapter asks: what, precisely, do these graph structures have in common, and how can we describe the variation between schema languages formally enough for an engine to compute with it?