11  Cross-Protocol Translation

Every protocol in panproto has two functions: \(\mathrm{parse}_P\) reads a native schema document into the universal Schema graph \(G\), and \(\mathrm{emit}_P\) writes \(G\) back out to native format. The translation pipeline is \(\mathrm{emit}_B \circ\; m \circ\; \mathrm{parse}_A\), where \(m\) is an optional migration. This is pandoc for schemas: a shared intermediate representation connecting every pair of formats.

11.1 The parse / schema / migrate / emit pipeline

Every translation has four stages:

graph LR
    A["Source document<br/>(format A)"]
    B["Schema graph<br/>(universal IR)"]
    C["Schema graph<br/>(optionally transformed)"]
    D["Target document<br/>(format B)"]

    A -->|"parse_*()"| B
    B -->|"migrate / transform"| C
    C -->|"emit_*()"| D
Figure 11.1: The translation pipeline. Every protocol has both a parse and an emit function. The Schema graph is the universal intermediate representation.

Parse. The function \(\mathrm{parse}_A\) reads native schema notation and constructs a Schema graph \(G\). Constructs with no graph-level equivalent are dropped or converted to best-effort annotations.

Schema graph. The universal intermediate representation is the same Schema graph used for diffs, migrations, lens derivation, and breaking-change detection. It isn’t a separate format; it’s the standard representation.

Migrate / transform. This stage is optional. Apply a panproto migration to \(G\) between parse and emit to rename a field, flatten a nested type, or restructure a hierarchy. Without this stage, the pipeline is direct format translation; with it, you translate and transform simultaneously.

Emit. The function \(\mathrm{emit}_B\) walks the Schema graph and produces native schema notation, assigning field numbers, choosing appropriate type names, and generating the correct syntax.

Every one of panproto’s 76 built-in protocols has both a parse and an emit function. Every pair of protocols is connected by a potential translation path.

11.2 A real example: Protobuf to GraphQL

You have gRPC services defined in .proto files and want to expose them through a GraphQL API. Tools like grpc-gateway solve this with hand-written glue code. With panproto, the schema translation is structural.

Here is the Protobuf service definition:

syntax = "proto3";

package social.v1;

message UserProfile {
  string user_id = 1;
  string display_name = 2;
  repeated string interests = 3;
  ProfileStatus status = 4;
}

enum ProfileStatus {
  PROFILE_STATUS_UNSPECIFIED = 0;
  ACTIVE = 1;
  SUSPENDED = 2;
}

message GetUserRequest {
  string user_id = 1;
}

message GetUserResponse {
  UserProfile profile = 1;
}

service UserService {
  rpc GetUser(GetUserRequest) returns (GetUserResponse);
  rpc ListUsers(ListUsersRequest) returns (ListUsersResponse);
}

panproto emits this as GraphQL SDL:

type UserProfile {
  userId: String!
  displayName: String!
  interests: [String!]!
  status: ProfileStatus!
}

enum ProfileStatus {
  ACTIVE
  SUSPENDED
}

type Query {
  getUser(userId: String!): UserProfile
  listUsers(limit: Int, cursor: String): UserProfileConnection
}

type UserProfileConnection {
  edges: [UserProfileEdge!]!
  pageInfo: PageInfo!
}

type UserProfileEdge {
  node: UserProfile!
  cursor: String!
}

The translation pipeline looks like this:

import { parse_proto, emit_graphql, PROTOBUF_SPEC, GRAPHQL_SPEC } from "@panproto/core";

const schema = parse_proto(protoSource, PROTOBUF_SPEC);
const graphqlSdl = emit_graphql(schema, GRAPHQL_SPEC);

What happens during translation:

  • Message types become GraphQL type declarations. Field names shift from snake_case to camelCase.
  • Enums carry over directly, minus the UNSPECIFIED sentinel (no GraphQL equivalent).
  • Field numbers are dropped. GraphQL has no wire-format metadata.
  • repeated fields become GraphQL list types ([String!]!).
  • RPC methods become Query fields. The request message is flattened into arguments; the response becomes the return type.
  • Service definitions drive generation of connection/edge types for list endpoints (Relay pagination pattern).

The structural content (type names, field names, field types, enum variants, nesting relationships) survives. What was lost is Protobuf-specific wire metadata. What was gained is GraphQL-specific idiom.

CautionExercise: Translation composition

If you translate Protobuf to GraphQL and then GraphQL to OpenAPI, do you get the same result as translating Protobuf directly to OpenAPI? Under what conditions does \(\mathrm{emit}_C \circ \mathrm{parse}_C \circ \mathrm{emit}_B \circ \mathrm{parse}_A = \mathrm{emit}_C \circ \mathrm{parse}_A\)?

Not in general. The intermediate round-trip through format B can lose information. If B’s theory is a strict subset of A’s, the first translation drops what B cannot represent, and the second translation cannot recover it. The equation holds only when B can represent everything that the Schema graph carries from A. Within a theory group (where protocols share theories), the equation holds.

11.3 Another example: ATProto Lexicon to ActivityPub

You need to bridge Bluesky and Mastodon by translating between ATProto’s Lexicon schema format and ActivityPub’s JSON-LD vocabulary.

ATProto Lexicon (Bluesky post schema):

{
  "lexicon": 1,
  "id": "app.bsky.feed.post",
  "defs": {
    "main": {
      "type": "record",
      "key": "tid",
      "record": {
        "type": "object",
        "required": ["text", "createdAt"],
        "properties": {
          "text": {
            "type": "string",
            "maxLength": 3000,
            "maxGraphemes": 300
          },
          "createdAt": { "type": "string", "format": "datetime" },
          "reply": { "type": "ref", "ref": "#replyRef" },
          "embed": { "type": "union", "refs": [
            "app.bsky.embed.images",
            "app.bsky.embed.video",
            "app.bsky.embed.external",
            "app.bsky.embed.record",
            "app.bsky.embed.recordWithMedia"
          ]},
          "langs": {
            "type": "array",
            "items": { "type": "string" },
            "maxLength": 3
          }
        }
      }
    }
  }
}

Translated to an ActivityPub Note (JSON-LD):

{
  "@context": "https://www.w3.org/ns/activitystreams",
  "type": "Note",
  "attributedTo": null,
  "content": { "@type": "xsd:string", "maxLength": 300 },
  "published": { "@type": "xsd:dateTime" },
  "inReplyTo": { "@type": "@id" },
  "attachment": {
    "@type": "@set",
    "items": { "oneOf": ["Image", "Link"] }
  },
  "contentMap": {
    "@type": "@language",
    "items": { "@type": "xsd:string" }
  }
}

Two function calls handle it:

import { parse_lexicon, emit_activitypub } from "@panproto/core";

const schema = parse_lexicon(lexiconSource, ATPROTO_SPEC);
const apSchema = emit_activitypub(schema, ACTIVITYPUB_SPEC);

The translation maps:

  • text (with maxLength: 3000 bytes and maxGraphemes: 300 grapheme clusters) to content (with maxLength: 300): constraint approximation, since ATProto tracks both byte and grapheme length while ActivityPub has a single length concept
  • createdAt to published: direct datetime mapping
  • reply.ref to inReplyTo: reference semantics preserved
  • embed union (images, video, external links, record embeds) to attachment set with oneOf: union-to-set mapping
  • langs array to contentMap language map: structural reinterpretation

The Bluesky-specific dual constraints are approximated to a single maxLength. The ATProto tid key has no ActivityPub equivalent and is dropped. The core content structure (text, timestamps, replies, embeds, language tags) translates faithfully.

11.4 Round-trip fidelity

Translation is not lossless in general. What survives a round trip (\(\mathrm{parse}_A \to \mathrm{emit}_B \to \mathrm{parse}_B \to \mathrm{emit}_A\)) depends on how much structural territory the two formats share.

Two sources of information loss exist.

Parse loss. When \(\mathrm{parse}_A\) reads a source document, constructs with no graph-level equivalent are dropped. SQL CHECK constraints containing arbitrary expressions may be captured only as opaque annotations.

Emit loss. When \(\mathrm{emit}_B\) writes a target document, constructs in the Schema graph with no target-format equivalent are dropped or approximated. Hyperedges are dropped when emitting to a format with no hyperedge concept.

Translation What survives What is typically lost
JSON Schema to TypeScript Field names, types, optionality Constraint keywords (maxLength, pattern)
SQL DDL to Parquet Column names, types, nullability Foreign keys, CHECK constraints, defaults
GraphQL SDL to OpenAPI Type names, field names, types, interfaces Directives, resolver-level semantics
Protobuf to FlatBuffers Message names, field names, types Field numbers, service definitions
Protobuf to GraphQL Type names, field names, types, enums Field numbers, wire encoding, streaming RPCs
ATProto Lexicon to ActivityPub Content structure, timestamps, references Grapheme constraints, record keys, NSIDs

The middle column corresponds to the structural skeleton: the graph of types and relationships. The right column corresponds to format-specific decorations outside the Schema graph’s scope.

11.5 The theory group advantage

The theory architecture explains why some translations are lossless and others are not.

Every protocol has a pair of theories: a schema theory and an instance theory (Chapter 9). Protocols with closely related theories produce higher-fidelity translations. This gives “structural comparability” a precise meaning.

panproto’s built-in protocols cluster into theory groups based on their schema and instance theories:

Table 11.1: Theory groups for selected protocols. Protocols in the same group share the same schema and instance theories.
Group Schema theory family Instance theory Representative protocols
A \(\text{ThCategory}\) \(\text{ThFunctor}\) CQL
B \(\text{ThHypergraph} + \text{ThConstraint}\) \(\text{ThFunctor}\) SQL DDL, Cassandra, DynamoDB
C \(\text{ThGraph} + \text{ThConstraint} + \text{ThMulti}\) \(\text{ThWType}\) JSON Schema, ATProto, Avro, Protobuf, Thrift, FlatBuffers
D \(\text{ThGraph} + \text{ThConstraint} + \text{ThMulti} + \text{ThInterface}\) \(\text{ThWType}\) GraphQL SDL, OpenAPI, AsyncAPI
E \(\text{ThGraph} + \text{ThConstraint} + \text{ThMulti}\) \(\text{ThFunctor}\) Parquet, Arrow, DataFrame

Within-group translation is structurally lossless. Two Group C protocols (Avro and Protobuf) share the same schema theory. A Schema graph that is a valid Avro schema is, structurally, also a valid Protobuf schema. The translation reduces to remapping surface syntax: Avro’s "type": "record" becomes Protobuf’s message. The graph structure itself doesn’t change.

Cross-group translation involves structural mismatch. Converting from Group B (SQL, hypergraph) to Group C (Protobuf, multigraph) loses hyperedge structure: foreign keys, composite unique constraints, and multi-column primary keys can’t be expressed in Group C’s multigraph theory. Converting in the other direction loses tree-shaped instance semantics: SQL rows are flat, and nested document structure doesn’t survive without denormalization decisions the engine can’t make.

graph LR
    A["Group A<br/>CQL"]
    B["Group B<br/>SQL, Cassandra, DynamoDB"]
    C["Group C<br/>JSON Schema, Avro,<br/>Protobuf, FlatBuffers"]
    D["Group D<br/>GraphQL, OpenAPI,<br/>AsyncAPI"]
    E["Group E<br/>Parquet, Arrow"]

    C <-->|lossless| C
    D <-->|lossless| D
    B <-->|lossless| B

    C <-.->|"structural loss<br/>(hyperedges dropped)"| B
    C <-.->|"structural loss<br/>(no interfaces)"| D
    D <-.->|"structural loss<br/>(hyperedges dropped)"| B
    B <-.->|"structural loss<br/>(instance theory mismatch)"| E
    C <-.->|"structural loss<br/>(instance theory mismatch)"| E
Figure 11.2: Theory groups and translation fidelity. Solid arrows: structurally lossless (within-group). Dashed arrows: structural loss (cross-group).

Group membership is determined by which building-block theories a protocol composes (Chapter 9). When two protocols share the same building blocks, their Schema graphs have the same shape, and translation is relabeling. When the building blocks differ, translation must bridge a structural gap, and information is lost there.

CautionExercise: Theory group boundaries

Group C and Group D differ only by \(\text{ThInterface}\). Does a Group D protocol always embed losslessly into a “Group D without interfaces” representation? Or does the presence of \(\text{ThInterface}\) in the colimit change the structure of sorts that existed before?

Not always. The presence of \(\text{ThInterface}\) in the colimit can introduce new equations and sort identifications. If a Group D schema uses interface types (e.g., a GraphQL Node interface implemented by User and Post), removing \(\text{ThInterface}\) loses the subtyping relationships. The User and Post types would appear as unrelated vertices, erasing the information that they share a common interface.

11.5.1 A note on “lossless”

“Structurally lossless at the schema level” means the graph structure of the schema survives the round trip. It does not mean the round trip produces identical source bytes, or that all protocol-specific metadata is preserved. Protobuf field numbers aren’t part of the multigraph structure; they’re a surface feature of the wire format. A Protobuf-to-Avro-to-Protobuf round trip won’t preserve field numbers; the emit step assigns fresh ones. The schema graph (message names, field names, types, relationships) will be intact.

If your translation requirements include preserving wire-format metadata, carry it as an opaque annotation through the Schema graph. The Schema graph supports arbitrary key-value metadata on vertices and edges; parse functions use this to preserve information meaningful to the source format. A custom emit wrapper can then read and apply these annotations selectively.

11.6 Comparison with pandoc

The pandoc analogy is worth examining in detail, because the structural parallel is exact in some places and breaks down in others.

What’s the same. Both pandoc and panproto translate between multiple formats using a shared intermediate representation. pandoc’s IR is a document abstract syntax tree. panproto’s IR is the Schema graph \(G\). In both systems, all translation goes through the IR; neither system converts directly between pairs of formats.

Both face the same fundamental limitation: information that can’t be represented in the IR is lost. pandoc can’t round-trip Word documents with custom styles and tracked changes. panproto can’t round-trip SQL schemas with arbitrary CHECK expressions.

What’s different. pandoc’s IR is a fixed, hand-designed document AST. Adding a new element type requires changing pandoc’s source code. panproto’s IR isn’t fixed; its structure is determined by the protocol theories. Adding a new structural concept means extending a theory via composition (Chapter 9), not patching the IR definition.

The second difference is in the use of the IR. In pandoc, the IR exists solely for conversion. In panproto, the Schema graph is the primary representation for all operations: diff, migration, existence checking, lens derivation, breaking-change detection, and translation all operate on the same object.

Table 11.2: Structural comparison between pandoc and panproto.
pandoc panproto
Translates between Document formats Schema formats
Intermediate representation Pandoc document AST Schema graph \(G\)
IR defined by Fixed Haskell data types Protocol theories (extensible)
IR used for Conversion only Conversion, diff, migration, lenses, …
Loss mechanism AST has no element for \(X\) Schema graph has no vertex/edge for \(X\)
Format-specific extensions Custom AST metadata Theory-specific constraint annotations
CautionExercise: IR extensibility tradeoff

pandoc’s fixed IR means every format author targets the same stable structure. panproto’s extensible IR means theories can grow. If protocol A extends the IR with a new sort via composition, does protocol B (written before that extension) need to be updated? What guarantees backward compatibility of the shared IR?

No. Protocol B targets the building-block theories it was composed from. A new sort added by protocol A via composition lives in A’s theory, not in the shared building blocks. The shared IR (the building-block theories) is stable; extensions happen in individual protocol theories. Backward compatibility of the shared IR is guaranteed by the colimit construction: adding new sorts to a component theory does not alter the existing sorts or operations in the shared base.

11.7 Further reading

The functorial data migration framework underlying cross-protocol translation is due to Spivak (2012). The theory group structure is a consequence of the compositional protocol definitions described in Lynch et al. (2024). The connection between theory morphisms and information-preserving translations is developed in Cartmell (1986).

Cross-protocol translation works because panproto gives every schema element a structural identity. That identity is protocol-relative: two elements in different protocols may refer to “the same thing” under different names. The next chapter formalizes how naming correspondences are established, tracked, and composed.

Cartmell, John. 1986. “Generalised Algebraic Theories and Contextual Categories.” Annals of Pure and Applied Logic 32: 209–43. https://doi.org/10.1016/0168-0072(86)90053-9.
Lynch, Owen, Kris Brown, James Fairbanks, and Evan Patterson. 2024. GATlab: Modeling and Programming with Generalized Algebraic Theories.” arXiv Preprint arXiv:2404.04837, ahead of print. https://doi.org/10.48550/arXiv.2404.04837.
Spivak, David I. 2012. “Functorial Data Migration.” Information and Computation 217: 31–51. https://arxiv.org/abs/1009.1166.