11 Cross-Protocol Translation

Every protocol in panproto has two functions: $\mathrm{parse}_P$ reads a native schema document into the universal Schema graph $G$, and $\mathrm{emit}_P$ writes $G$ back out to native format. The translation pipeline is $\mathrm{emit}_B \circ\; m \circ\; \mathrm{parse}_A$, where $m$ is an optional migration. This is pandoc for schemas: a shared intermediate representation connecting every pair of formats.

11.1 The parse / schema / migrate / emit pipeline

Every translation has four stages:

graph LR
    A["Source document<br/>(format A)"]
    B["Schema graph<br/>(universal IR)"]
    C["Schema graph<br/>(optionally transformed)"]
    D["Target document<br/>(format B)"]

    A -->|"parse_*()"| B
    B -->|"migrate / transform"| C
    C -->|"emit_*()"| D

Figure 11.1: The translation pipeline. Every protocol has both a parse and an emit function. The Schema graph is the universal intermediate representation.

Parse. The function $\mathrm{parse}_A$ reads native schema notation and constructs a Schema graph $G$. Constructs with no graph-level equivalent are dropped or converted to best-effort annotations.

Schema graph. The universal intermediate representation is the same Schema graph used for diffs, migrations, lens derivation, and breaking-change detection. It isn’t a separate format; it’s the standard representation.

Migrate / transform. This stage is optional. Apply a panproto migration to $G$ between parse and emit to rename a field, flatten a nested type, or restructure a hierarchy. Without this stage, the pipeline is direct format translation; with it, you translate and transform simultaneously.

Emit. The function $\mathrm{emit}_B$ walks the Schema graph and produces native schema notation, assigning field numbers, choosing appropriate type names, and generating the correct syntax.

Every one of panproto’s 76 built-in protocols has both a parse and an emit function. Every pair of protocols is connected by a potential translation path.

11.2 A real example: Protobuf to GraphQL

You have gRPC services defined in .proto files and want to expose them through a GraphQL API. Tools like grpc-gateway solve this with hand-written glue code. With panproto, the schema translation is structural.

Here is the Protobuf service definition:

syntax = "proto3";

package social.v1;

message UserProfile {
  string user_id = 1;
  string display_name = 2;
  repeated string interests = 3;
  ProfileStatus status = 4;
}

enum ProfileStatus {
  PROFILE_STATUS_UNSPECIFIED = 0;
  ACTIVE = 1;
  SUSPENDED = 2;
}

message GetUserRequest {
  string user_id = 1;
}

message GetUserResponse {
  UserProfile profile = 1;
}

service UserService {
  rpc GetUser(GetUserRequest) returns (GetUserResponse);
  rpc ListUsers(ListUsersRequest) returns (ListUsersResponse);
}

panproto emits this as GraphQL SDL:

type UserProfile {
  userId: String!
  displayName: String!
  interests: [String!]!
  status: ProfileStatus!
}

enum ProfileStatus {
  ACTIVE
  SUSPENDED
}

type Query {
  getUser(userId: String!): UserProfile
  listUsers(limit: Int, cursor: String): UserProfileConnection
}

type UserProfileConnection {
  edges: [UserProfileEdge!]!
  pageInfo: PageInfo!
}

type UserProfileEdge {
  node: UserProfile!
  cursor: String!
}

The translation pipeline looks like this:

import { parse_proto, emit_graphql, PROTOBUF_SPEC, GRAPHQL_SPEC } from "@panproto/core";

const schema = parse_proto(protoSource, PROTOBUF_SPEC);
const graphqlSdl = emit_graphql(schema, GRAPHQL_SPEC);

What happens during translation:

Message types become GraphQL type declarations. Field names shift from snake_case to camelCase.
Enums carry over directly, minus the UNSPECIFIED sentinel (no GraphQL equivalent).
Field numbers are dropped. GraphQL has no wire-format metadata.
repeated fields become GraphQL list types ([String!]!).
RPC methods become Query fields. The request message is flattened into arguments; the response becomes the return type.
Service definitions drive generation of connection/edge types for list endpoints (Relay pagination pattern).

The structural content (type names, field names, field types, enum variants, nesting relationships) survives. What was lost is Protobuf-specific wire metadata. What was gained is GraphQL-specific idiom.

Exercise: Translation composition

If you translate Protobuf to GraphQL and then GraphQL to OpenAPI, do you get the same result as translating Protobuf directly to OpenAPI? Under what conditions does $\mathrm{emit}_C \circ \mathrm{parse}_C \circ \mathrm{emit}_B \circ \mathrm{parse}_A = \mathrm{emit}_C \circ \mathrm{parse}_A$?

Answer

Not in general. The intermediate round-trip through format B can lose information. If B’s theory is a strict subset of A’s, the first translation drops what B cannot represent, and the second translation cannot recover it. The equation holds only when B can represent everything that the Schema graph carries from A. Within a theory group (where protocols share theories), the equation holds.

11.3 Another example: ATProto Lexicon to ActivityPub

You need to bridge Bluesky and Mastodon by translating between ATProto’s Lexicon schema format and ActivityPub’s JSON-LD vocabulary.

ATProto Lexicon (Bluesky post schema):

{
  "lexicon": 1,
  "id": "app.bsky.feed.post",
  "defs": {
    "main": {
      "type": "record",
      "key": "tid",
      "record": {
        "type": "object",
        "required": ["text", "createdAt"],
        "properties": {
          "text": {
            "type": "string",
            "maxLength": 3000,
            "maxGraphemes": 300
          },
          "createdAt": { "type": "string", "format": "datetime" },
          "reply": { "type": "ref", "ref": "#replyRef" },
          "embed": { "type": "union", "refs": [
            "app.bsky.embed.images",
            "app.bsky.embed.video",
            "app.bsky.embed.external",
            "app.bsky.embed.record",
            "app.bsky.embed.recordWithMedia"
          ]},
          "langs": {
            "type": "array",
            "items": { "type": "string" },
            "maxLength": 3
          }
        }
      }
    }
  }
}

Translated to an ActivityPub Note (JSON-LD):

{
  "@context": "https://www.w3.org/ns/activitystreams",
  "type": "Note",
  "attributedTo": null,
  "content": { "@type": "xsd:string", "maxLength": 300 },
  "published": { "@type": "xsd:dateTime" },
  "inReplyTo": { "@type": "@id" },
  "attachment": {
    "@type": "@set",
    "items": { "oneOf": ["Image", "Link"] }
  },
  "contentMap": {
    "@type": "@language",
    "items": { "@type": "xsd:string" }
  }
}

Two function calls handle it:

import { parse_lexicon, emit_activitypub } from "@panproto/core";

const schema = parse_lexicon(lexiconSource, ATPROTO_SPEC);
const apSchema = emit_activitypub(schema, ACTIVITYPUB_SPEC);

The translation maps:

text (with maxLength: 3000 bytes and maxGraphemes: 300 grapheme clusters) to content (with maxLength: 300): constraint approximation, since ATProto tracks both byte and grapheme length while ActivityPub has a single length concept
createdAt to published: direct datetime mapping
reply.ref to inReplyTo: reference semantics preserved
embed union (images, video, external links, record embeds) to attachment set with oneOf: union-to-set mapping
langs array to contentMap language map: structural reinterpretation

The Bluesky-specific dual constraints are approximated to a single maxLength. The ATProto tid key has no ActivityPub equivalent and is dropped. The core content structure (text, timestamps, replies, embeds, language tags) translates faithfully.

11.4 Round-trip fidelity

Translation is not lossless in general. What survives a round trip ($\mathrm{parse}_A \to \mathrm{emit}_B \to \mathrm{parse}_B \to \mathrm{emit}_A$) depends on how much structural territory the two formats share.

Two sources of information loss exist.

Parse loss. When $\mathrm{parse}_A$ reads a source document, constructs with no graph-level equivalent are dropped. SQL CHECK constraints containing arbitrary expressions may be captured only as opaque annotations.

Emit loss. When $\mathrm{emit}_B$ writes a target document, constructs in the Schema graph with no target-format equivalent are dropped or approximated. Hyperedges are dropped when emitting to a format with no hyperedge concept.

Translation	What survives	What is typically lost
JSON Schema to TypeScript	Field names, types, optionality	Constraint keywords (`maxLength`, `pattern`)
SQL DDL to Parquet	Column names, types, nullability	Foreign keys, `CHECK` constraints, defaults
GraphQL SDL to OpenAPI	Type names, field names, types, interfaces	Directives, resolver-level semantics
Protobuf to FlatBuffers	Message names, field names, types	Field numbers, service definitions
Protobuf to GraphQL	Type names, field names, types, enums	Field numbers, wire encoding, streaming RPCs
ATProto Lexicon to ActivityPub	Content structure, timestamps, references	Grapheme constraints, record keys, NSIDs

The middle column corresponds to the structural skeleton: the graph of types and relationships. The right column corresponds to format-specific decorations outside the Schema graph’s scope.

11.5 The theory group advantage

The theory architecture explains why some translations are lossless and others are not.

Every protocol has a pair of theories: a schema theory and an instance theory (Chapter 9). Protocols with closely related theories produce higher-fidelity translations. This gives “structural comparability” a precise meaning.

panproto’s built-in protocols cluster into theory groups based on their schema and instance theories:

Table 11.1: Theory groups for selected protocols. Protocols in the same group share the same schema and instance theories.

Group	Schema theory family	Instance theory	Representative protocols
A	$\text{ThCategory}$	$\text{ThFunctor}$	CQL
B	$\text{ThHypergraph} + \text{ThConstraint}$	$\text{ThFunctor}$	SQL DDL, Cassandra, DynamoDB
C	$\text{ThGraph} + \text{ThConstraint} + \text{ThMulti}$	$\text{ThWType}$	JSON Schema, ATProto, Avro, Protobuf, Thrift, FlatBuffers
D	$\text{ThGraph} + \text{ThConstraint} + \text{ThMulti} + \text{ThInterface}$	$\text{ThWType}$	GraphQL SDL, OpenAPI, AsyncAPI
E	$\text{ThGraph} + \text{ThConstraint} + \text{ThMulti}$	$\text{ThFunctor}$	Parquet, Arrow, DataFrame

Within-group translation is structurally lossless. Two Group C protocols (Avro and Protobuf) share the same schema theory. A Schema graph that is a valid Avro schema is, structurally, also a valid Protobuf schema. The translation reduces to remapping surface syntax: Avro’s "type": "record" becomes Protobuf’s message. The graph structure itself doesn’t change.

Cross-group translation involves structural mismatch. Converting from Group B (SQL, hypergraph) to Group C (Protobuf, multigraph) loses hyperedge structure: foreign keys, composite unique constraints, and multi-column primary keys can’t be expressed in Group C’s multigraph theory. Converting in the other direction loses tree-shaped instance semantics: SQL rows are flat, and nested document structure doesn’t survive without denormalization decisions the engine can’t make.

graph LR
    A["Group A<br/>CQL"]
    B["Group B<br/>SQL, Cassandra, DynamoDB"]
    C["Group C<br/>JSON Schema, Avro,<br/>Protobuf, FlatBuffers"]
    D["Group D<br/>GraphQL, OpenAPI,<br/>AsyncAPI"]
    E["Group E<br/>Parquet, Arrow"]

    C <-->|lossless| C
    D <-->|lossless| D
    B <-->|lossless| B

    C <-.->|"structural loss<br/>(hyperedges dropped)"| B
    C <-.->|"structural loss<br/>(no interfaces)"| D
    D <-.->|"structural loss<br/>(hyperedges dropped)"| B
    B <-.->|"structural loss<br/>(instance theory mismatch)"| E
    C <-.->|"structural loss<br/>(instance theory mismatch)"| E

Figure 11.2: Theory groups and translation fidelity. Solid arrows: structurally lossless (within-group). Dashed arrows: structural loss (cross-group).

Group membership is determined by which building-block theories a protocol composes (Chapter 9). When two protocols share the same building blocks, their Schema graphs have the same shape, and translation is relabeling. When the building blocks differ, translation must bridge a structural gap, and information is lost there.

Exercise: Theory group boundaries

Group C and Group D differ only by $\text{ThInterface}$. Does a Group D protocol always embed losslessly into a “Group D without interfaces” representation? Or does the presence of $\text{ThInterface}$ in the colimit change the structure of sorts that existed before?

Answer

Not always. The presence of $\text{ThInterface}$ in the colimit can introduce new equations and sort identifications. If a Group D schema uses interface types (e.g., a GraphQL Node interface implemented by User and Post), removing $\text{ThInterface}$ loses the subtyping relationships. The User and Post types would appear as unrelated vertices, erasing the information that they share a common interface.

11.5.1 A note on “lossless”

“Structurally lossless at the schema level” means the graph structure of the schema survives the round trip. It does not mean the round trip produces identical source bytes, or that all protocol-specific metadata is preserved. Protobuf field numbers aren’t part of the multigraph structure; they’re a surface feature of the wire format. A Protobuf-to-Avro-to-Protobuf round trip won’t preserve field numbers; the emit step assigns fresh ones. The schema graph (message names, field names, types, relationships) will be intact.

If your translation requirements include preserving wire-format metadata, carry it as an opaque annotation through the Schema graph. The Schema graph supports arbitrary key-value metadata on vertices and edges; parse functions use this to preserve information meaningful to the source format. A custom emit wrapper can then read and apply these annotations selectively.

11.6 Comparison with pandoc

The pandoc analogy is worth examining in detail, because the structural parallel is exact in some places and breaks down in others.

What’s the same. Both pandoc and panproto translate between multiple formats using a shared intermediate representation. pandoc’s IR is a document abstract syntax tree. panproto’s IR is the Schema graph $G$. In both systems, all translation goes through the IR; neither system converts directly between pairs of formats.

Both face the same fundamental limitation: information that can’t be represented in the IR is lost. pandoc can’t round-trip Word documents with custom styles and tracked changes. panproto can’t round-trip SQL schemas with arbitrary CHECK expressions.

What’s different. pandoc’s IR is a fixed, hand-designed document AST. Adding a new element type requires changing pandoc’s source code. panproto’s IR isn’t fixed; its structure is determined by the protocol theories. Adding a new structural concept means extending a theory via composition (Chapter 9), not patching the IR definition.

The second difference is in the use of the IR. In pandoc, the IR exists solely for conversion. In panproto, the Schema graph is the primary representation for all operations: diff, migration, existence checking, lens derivation, breaking-change detection, and translation all operate on the same object.

Table 11.2: Structural comparison between pandoc and panproto.

	pandoc	panproto
Translates between	Document formats	Schema formats
Intermediate representation	Pandoc document AST	Schema graph $G$
IR defined by	Fixed Haskell data types	Protocol theories (extensible)
IR used for	Conversion only	Conversion, diff, migration, lenses, …
Loss mechanism	AST has no element for $X$	Schema graph has no vertex/edge for $X$
Format-specific extensions	Custom AST metadata	Theory-specific constraint annotations

Exercise: IR extensibility tradeoff

pandoc’s fixed IR means every format author targets the same stable structure. panproto’s extensible IR means theories can grow. If protocol A extends the IR with a new sort via composition, does protocol B (written before that extension) need to be updated? What guarantees backward compatibility of the shared IR?

Answer

No. Protocol B targets the building-block theories it was composed from. A new sort added by protocol A via composition lives in A’s theory, not in the shared building blocks. The shared IR (the building-block theories) is stable; extensions happen in individual protocol theories. Backward compatibility of the shared IR is guaranteed by the colimit construction: adding new sorts to a component theory does not alter the existing sorts or operations in the shared base.

11.7 Further reading

The functorial data migration framework underlying cross-protocol translation is due to Spivak (2012). The theory group structure is a consequence of the compositional protocol definitions described in Lynch et al. (2024). The connection between theory morphisms and information-preserving translations is developed in Cartmell (1986).

Cross-protocol translation works because panproto gives every schema element a structural identity. That identity is protocol-relative: two elements in different protocols may refer to “the same thing” under different names. The next chapter formalizes how naming correspondences are established, tracked, and composed.

Cartmell, John. 1986. “Generalised Algebraic Theories and Contextual Categories.” Annals of Pure and Applied Logic 32: 209–43. https://doi.org/10.1016/0168-0072(86)90053-9.

Lynch, Owen, Kris Brown, James Fairbanks, and Evan Patterson. 2024. “GATlab: Modeling and Programming with Generalized Algebraic Theories.” arXiv Preprint arXiv:2404.04837, ahead of print. https://doi.org/10.48550/arXiv.2404.04837.

Spivak, David I. 2012. “Functorial Data Migration.” Information and Computation 217: 31–51. https://arxiv.org/abs/1009.1166.

# Cross-Protocol Translation {#sec-cross-protocol} Every protocol in panproto has two functions: $\mathrm{parse}_P$ reads a native schema document into the universal Schema graph $G$, and $\mathrm{emit}_P$ writes $G$ back out to native format. The translation pipeline is $\mathrm{emit}_B \circ\; m \circ\; \mathrm{parse}_A$, where $m$ is an optional migration. This is pandoc for schemas: a shared intermediate representation connecting every pair of formats. ## The parse / schema / migrate / emit pipeline Every translation has four stages: ```{mermaid} %%| label: fig-translation-pipeline %%| fig-cap: "The translation pipeline. Every protocol has both a parse and an emit function. The Schema graph is the universal intermediate representation." graph LR A["Source document (format A)"] B["Schema graph (universal IR)"] C["Schema graph (optionally transformed)"] D["Target document (format B)"] A -->|"parse_*()"| B B -->|"migrate / transform"| C C -->|"emit_*()"| D ``` **Parse.** The function $\mathrm{parse}_A$ reads native schema notation and constructs a Schema graph $G$. Constructs with no graph-level equivalent are dropped or converted to best-effort annotations. **Schema graph.** The universal intermediate representation is the same Schema graph used for diffs, migrations, lens derivation, and breaking-change detection. It isn't a separate format; it's the standard representation. **Migrate / transform.** This stage is optional. Apply a panproto migration to $G$ between parse and emit to rename a field, flatten a nested type, or restructure a hierarchy. Without this stage, the pipeline is direct format translation; with it, you translate and transform simultaneously. **Emit.** The function $\mathrm{emit}_B$ walks the Schema graph and produces native schema notation, assigning field numbers, choosing appropriate type names, and generating the correct syntax. Every one of panproto's 76 built-in protocols has both a parse and an emit function. Every pair of protocols is connected by a potential translation path. ## A real example: Protobuf to GraphQL You have gRPC services defined in `.proto` files and want to expose them through a GraphQL API. Tools like [grpc-gateway](https://github.com/grpc-ecosystem/grpc-gateway) solve this with hand-written glue code. With panproto, the schema translation is structural. Here is the Protobuf service definition: ```protobuf syntax = "proto3"; package social.v1; message UserProfile { string user_id = 1; string display_name = 2; repeated string interests = 3; ProfileStatus status = 4; } enum ProfileStatus { PROFILE_STATUS_UNSPECIFIED = 0; ACTIVE = 1; SUSPENDED = 2; } message GetUserRequest { string user_id = 1; } message GetUserResponse { UserProfile profile = 1; } service UserService { rpc GetUser(GetUserRequest) returns (GetUserResponse); rpc ListUsers(ListUsersRequest) returns (ListUsersResponse); } ``` panproto emits this as GraphQL SDL: ```graphql type UserProfile { userId: String! displayName: String! interests: [String!]! status: ProfileStatus! } enum ProfileStatus { ACTIVE SUSPENDED } type Query { getUser(userId: String!): UserProfile listUsers(limit: Int, cursor: String): UserProfileConnection } type UserProfileConnection { edges: [UserProfileEdge!]! pageInfo: PageInfo! } type UserProfileEdge { node: UserProfile! cursor: String! } ``` The translation pipeline looks like this: ```typescript import { parse_proto, emit_graphql, PROTOBUF_SPEC, GRAPHQL_SPEC } from "@panproto/core"; const schema = parse_proto(protoSource, PROTOBUF_SPEC); const graphqlSdl = emit_graphql(schema, GRAPHQL_SPEC); ``` What happens during translation: - **Message types** become GraphQL `type` declarations. Field names shift from `snake_case` to `camelCase`. - **Enums** carry over directly, minus the `UNSPECIFIED` sentinel (no GraphQL equivalent). - **Field numbers** are dropped. GraphQL has no wire-format metadata. - **`repeated` fields** become GraphQL list types (`[String!]!`). - **RPC methods** become `Query` fields. The request message is flattened into arguments; the response becomes the return type. - **Service definitions** drive generation of connection/edge types for list endpoints (Relay pagination pattern). The structural content (type names, field names, field types, enum variants, nesting relationships) survives. What was lost is Protobuf-specific wire metadata. What was gained is GraphQL-specific idiom. :::{.callout-caution} ## Exercise: Translation composition If you translate Protobuf to GraphQL and then GraphQL to OpenAPI, do you get the same result as translating Protobuf directly to OpenAPI? Under what conditions does $\mathrm{emit}_C \circ \mathrm{parse}_C \circ \mathrm{emit}_B \circ \mathrm{parse}_A = \mathrm{emit}_C \circ \mathrm{parse}_A$? ::: ::: {.callout-tip collapse=true} ## Answer Not in general. The intermediate round-trip through format B can lose information. If B's theory is a strict subset of A's, the first translation drops what B cannot represent, and the second translation cannot recover it. The equation holds only when B can represent everything that the Schema graph carries from A. Within a theory group (where protocols share theories), the equation holds. ::: ## Another example: ATProto Lexicon to ActivityPub You need to bridge [Bluesky](https://bsky.app/) and [Mastodon](https://joinmastodon.org/) by translating between ATProto's Lexicon schema format and ActivityPub's JSON-LD vocabulary. ATProto Lexicon (Bluesky post schema): ```json { "lexicon": 1, "id": "app.bsky.feed.post", "defs": { "main": { "type": "record", "key": "tid", "record": { "type": "object", "required": ["text", "createdAt"], "properties": { "text": { "type": "string", "maxLength": 3000, "maxGraphemes": 300 }, "createdAt": { "type": "string", "format": "datetime" }, "reply": { "type": "ref", "ref": "#replyRef" }, "embed": { "type": "union", "refs": [ "app.bsky.embed.images", "app.bsky.embed.video", "app.bsky.embed.external", "app.bsky.embed.record", "app.bsky.embed.recordWithMedia" ]}, "langs": { "type": "array", "items": { "type": "string" }, "maxLength": 3 } } } } } } ``` Translated to an ActivityPub `Note` (JSON-LD): ```json { "@context": "https://www.w3.org/ns/activitystreams", "type": "Note", "attributedTo": null, "content": { "@type": "xsd:string", "maxLength": 300 }, "published": { "@type": "xsd:dateTime" }, "inReplyTo": { "@type": "@id" }, "attachment": { "@type": "@set", "items": { "oneOf": ["Image", "Link"] } }, "contentMap": { "@type": "@language", "items": { "@type": "xsd:string" } } } ``` Two function calls handle it: ```typescript import { parse_lexicon, emit_activitypub } from "@panproto/core"; const schema = parse_lexicon(lexiconSource, ATPROTO_SPEC); const apSchema = emit_activitypub(schema, ACTIVITYPUB_SPEC); ``` The translation maps: - `text` (with `maxLength: 3000` bytes and `maxGraphemes: 300` grapheme clusters) to `content` (with `maxLength: 300`): constraint approximation, since ATProto tracks both byte and grapheme length while ActivityPub has a single length concept - `createdAt` to `published`: direct datetime mapping - `reply.ref` to `inReplyTo`: reference semantics preserved - `embed` union (images, video, external links, record embeds) to `attachment` set with `oneOf`: union-to-set mapping - `langs` array to `contentMap` language map: structural reinterpretation The Bluesky-specific dual constraints are approximated to a single `maxLength`. The ATProto `tid` key has no ActivityPub equivalent and is dropped. The core content structure (text, timestamps, replies, embeds, language tags) translates faithfully. ## Round-trip fidelity Translation is not lossless in general. What survives a round trip ($\mathrm{parse}_A \to \mathrm{emit}_B \to \mathrm{parse}_B \to \mathrm{emit}_A$) depends on how much structural territory the two formats share. Two sources of information loss exist. **Parse loss.** When $\mathrm{parse}_A$ reads a source document, constructs with no graph-level equivalent are dropped. SQL `CHECK` constraints containing arbitrary expressions may be captured only as opaque annotations. **Emit loss.** When $\mathrm{emit}_B$ writes a target document, constructs in the Schema graph with no target-format equivalent are dropped or approximated. Hyperedges are dropped when emitting to a format with no hyperedge concept. | Translation | What survives | What is typically lost | |---|---|---| | JSON Schema to TypeScript | Field names, types, optionality | Constraint keywords (`maxLength`, `pattern`) | | SQL DDL to Parquet | Column names, types, nullability | Foreign keys, `CHECK` constraints, defaults | | GraphQL SDL to OpenAPI | Type names, field names, types, interfaces | Directives, resolver-level semantics | | Protobuf to FlatBuffers | Message names, field names, types | Field numbers, service definitions | | Protobuf to GraphQL | Type names, field names, types, enums | Field numbers, wire encoding, streaming RPCs | | ATProto Lexicon to ActivityPub | Content structure, timestamps, references | Grapheme constraints, record keys, NSIDs | The middle column corresponds to the structural skeleton: the graph of types and relationships. The right column corresponds to format-specific decorations outside the Schema graph's scope. ## The theory group advantage {#sec-theory-groups} The theory architecture explains why some translations are lossless and others are not. Every protocol has a pair of theories: a schema theory and an instance theory (@sec-building-protocol). Protocols with closely related theories produce higher-fidelity translations. This gives "structural comparability" a precise meaning. panproto's built-in protocols cluster into theory groups based on their schema and instance theories: | Group | Schema theory family | Instance theory | Representative protocols | |-------|---------------------|-----------------|--------------------------| | **A** | $\text{ThCategory}$ | $\text{ThFunctor}$ | CQL | | **B** | $\text{ThHypergraph} + \text{ThConstraint}$ | $\text{ThFunctor}$ | SQL DDL, Cassandra, DynamoDB | | **C** | $\text{ThGraph} + \text{ThConstraint} + \text{ThMulti}$ | $\text{ThWType}$ | JSON Schema, ATProto, Avro, Protobuf, Thrift, FlatBuffers | | **D** | $\text{ThGraph} + \text{ThConstraint} + \text{ThMulti} + \text{ThInterface}$ | $\text{ThWType}$ | GraphQL SDL, OpenAPI, AsyncAPI | | **E** | $\text{ThGraph} + \text{ThConstraint} + \text{ThMulti}$ | $\text{ThFunctor}$ | Parquet, Arrow, DataFrame | : Theory groups for selected protocols. Protocols in the same group share the same schema and instance theories. {#tbl-theory-groups} **Within-group translation is structurally lossless.** Two Group C protocols (Avro and Protobuf) share the same schema theory. A Schema graph that is a valid Avro schema is, structurally, also a valid Protobuf schema. The translation reduces to remapping surface syntax: Avro's `"type": "record"` becomes Protobuf's `message`. The graph structure itself doesn't change. **Cross-group translation involves structural mismatch.** Converting from Group B (SQL, hypergraph) to Group C (Protobuf, multigraph) loses hyperedge structure: foreign keys, composite unique constraints, and multi-column primary keys can't be expressed in Group C's multigraph theory. Converting in the other direction loses tree-shaped instance semantics: SQL rows are flat, and nested document structure doesn't survive without denormalization decisions the engine can't make. ```{mermaid} %%| label: fig-theory-groups %%| fig-cap: "Theory groups and translation fidelity. Solid arrows: structurally lossless (within-group). Dashed arrows: structural loss (cross-group)." graph LR A["Group A CQL"] B["Group B SQL, Cassandra, DynamoDB"] C["Group C JSON Schema, Avro, Protobuf, FlatBuffers"] D["Group D GraphQL, OpenAPI, AsyncAPI"] E["Group E Parquet, Arrow"] C <-->|lossless| C D <-->|lossless| D B <-->|lossless| B C <-.->|"structural loss (hyperedges dropped)"| B C <-.->|"structural loss (no interfaces)"| D D <-.->|"structural loss (hyperedges dropped)"| B B <-.->|"structural loss (instance theory mismatch)"| E C <-.->|"structural loss (instance theory mismatch)"| E ``` Group membership is determined by which building-block theories a protocol composes (@sec-building-protocol). When two protocols share the same building blocks, their Schema graphs have the same shape, and translation is relabeling. When the building blocks differ, translation must bridge a structural gap, and information is lost there. :::{.callout-caution} ## Exercise: Theory group boundaries Group C and Group D differ only by $\text{ThInterface}$. Does a Group D protocol always embed losslessly into a "Group D without interfaces" representation? Or does the presence of $\text{ThInterface}$ in the colimit change the structure of sorts that existed before? ::: ::: {.callout-tip collapse=true} ## Answer Not always. The presence of $\text{ThInterface}$ in the colimit can introduce new equations and sort identifications. If a Group D schema uses interface types (e.g., a GraphQL `Node` interface implemented by `User` and `Post`), removing $\text{ThInterface}$ loses the subtyping relationships. The `User` and `Post` types would appear as unrelated vertices, erasing the information that they share a common interface. ::: ### A note on "lossless" "Structurally lossless at the schema level" means the graph structure of the schema survives the round trip. It does not mean the round trip produces identical source bytes, or that all protocol-specific metadata is preserved. Protobuf field numbers aren't part of the multigraph structure; they're a surface feature of the wire format. A Protobuf-to-Avro-to-Protobuf round trip won't preserve field numbers; the emit step assigns fresh ones. The schema graph (message names, field names, types, relationships) will be intact. If your translation requirements include preserving wire-format metadata, carry it as an opaque annotation through the Schema graph. The Schema graph supports arbitrary key-value metadata on vertices and edges; parse functions use this to preserve information meaningful to the source format. A custom emit wrapper can then read and apply these annotations selectively. ## Comparison with pandoc The pandoc analogy is worth examining in detail, because the structural parallel is exact in some places and breaks down in others. **What's the same.** Both pandoc and panproto translate between multiple formats using a shared intermediate representation. pandoc's IR is a document abstract syntax tree. panproto's IR is the Schema graph $G$. In both systems, all translation goes through the IR; neither system converts directly between pairs of formats. Both face the same fundamental limitation: information that can't be represented in the IR is lost. pandoc can't round-trip Word documents with custom styles and tracked changes. panproto can't round-trip SQL schemas with arbitrary `CHECK` expressions. **What's different.** pandoc's IR is a fixed, hand-designed document AST. Adding a new element type requires changing pandoc's source code. panproto's IR isn't fixed; its structure is determined by the protocol theories. Adding a new structural concept means extending a theory via composition (@sec-building-protocol), not patching the IR definition. The second difference is in the use of the IR. In pandoc, the IR exists solely for conversion. In panproto, the Schema graph is the primary representation for *all* operations: diff, migration, existence checking, lens derivation, breaking-change detection, and translation all operate on the same object. | | pandoc | panproto | |---|---|---| | Translates between | Document formats | Schema formats | | Intermediate representation | Pandoc document AST | Schema graph $G$ | | IR defined by | Fixed Haskell data types | Protocol theories (extensible) | | IR used for | Conversion only | Conversion, diff, migration, lenses, ... | | Loss mechanism | AST has no element for $X$ | Schema graph has no vertex/edge for $X$ | | Format-specific extensions | Custom AST metadata | Theory-specific constraint annotations | : Structural comparison between pandoc and panproto. {#tbl-pandoc-comparison} :::{.callout-caution} ## Exercise: IR extensibility tradeoff pandoc's fixed IR means every format author targets the same stable structure. panproto's extensible IR means theories can grow. If protocol A extends the IR with a new sort via composition, does protocol B (written before that extension) need to be updated? What guarantees backward compatibility of the shared IR? ::: ::: {.callout-tip collapse=true} ## Answer No. Protocol B targets the building-block theories it was composed from. A new sort added by protocol A via composition lives in A's theory, not in the shared building blocks. The shared IR (the building-block theories) is stable; extensions happen in individual protocol theories. Backward compatibility of the shared IR is guaranteed by the colimit construction: adding new sorts to a component theory does not alter the existing sorts or operations in the shared base. ::: ## Further reading The functorial data migration framework underlying cross-protocol translation is due to @spivak2012. The theory group structure is a consequence of the compositional protocol definitions described in @lynch2024. The connection between theory morphisms and information-preserving translations is developed in @cartmell1986. Cross-protocol translation works because panproto gives every schema element a structural identity. That identity is protocol-relative: two elements in different protocols may refer to "the same thing" under different names. The next chapter formalizes how naming correspondences are established, tracked, and composed.

Group	Schema theory family	Instance theory	Representative protocols
A	\(\text{ThCategory}\)	\(\text{ThFunctor}\)	CQL
B	\(\text{ThHypergraph} + \text{ThConstraint}\)	\(\text{ThFunctor}\)	SQL DDL, Cassandra, DynamoDB
C	\(\text{ThGraph} + \text{ThConstraint} + \text{ThMulti}\)	\(\text{ThWType}\)	JSON Schema, ATProto, Avro, Protobuf, Thrift, FlatBuffers
D	\(\text{ThGraph} + \text{ThConstraint} + \text{ThMulti} + \text{ThInterface}\)	\(\text{ThWType}\)	GraphQL SDL, OpenAPI, AsyncAPI
E	\(\text{ThGraph} + \text{ThConstraint} + \text{ThMulti}\)	\(\text{ThFunctor}\)	Parquet, Arrow, DataFrame