pith. sign in

arxiv: 2604.09591 · v1 · submitted 2026-03-04 · 💻 cs.DC · cs.PF· cs.PL

Simplicity Scales

Pith reviewed 2026-05-15 17:14 UTC · model grok-4.3

classification 💻 cs.DC cs.PFcs.PL
keywords serialization formatfixed-size encodingdecoding speedRPC protocolProtocol Buffers comparisonmemory bandwidthCPU pipeline stallsbatch pipelining
0
0 comments X

The pith

Bebop's fixed-size encoding turns every decode into a single unconditional memory read, delivering 9-213 times faster performance than Protocol Buffers.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Bebop, a data serialization format that assigns every type a fixed byte width instead of using variable-length encodings. A 32-bit integer is always exactly four bytes, and decoding reduces to reading that block from memory without inspecting continuation bits or parsing strings. This design eliminates data-dependent branches that stall CPU pipelines in formats like Protocol Buffers and JSON. Measurements across 19 workloads show speedups from 9 to 213 times, with a 1536-dimensional vector decoding in 2.8 nanoseconds compared to 111 nanoseconds for Protocol Buffers. The same wire format also supports an RPC protocol that pipelines dependent calls across services in one round trip over standard transports.

Core claim

Bebop encodes every data type using a fixed number of bytes so that decoding requires only a direct memory load with no conditionals. Across 19 decode workloads this produces speedups of 9 to 213 times over Protocol Buffers, while a 1536-dimension embedding vector decodes in 2.8 nanoseconds versus 111 nanoseconds for Protocol Buffers. On records larger than 64 KB the decoder reaches 86 percent of peak memory bandwidth, showing that the CPU is no longer the limiting factor.

What carries the argument

The fixed-byte-width encoding scheme, in which each primitive type and field occupies a predetermined byte count without variable prefixes or tags that require inspection.

Load-bearing premise

Fixed-size encoding stays practical for the range of data types and values encountered in real applications without causing excessive message bloat.

What would settle it

Run the same benchmarks on workloads dominated by small integers or sparse data and check whether Bebop still shows large speedups or if padding costs erase the advantage.

Figures

Figures reproduced from arXiv: 2604.09591 by Andrew Sampson (6OVER3 Institute), Ronny Chan (6OVER3 Institute), Yuta Saito (GoodNotes).

Figure 1
Figure 1. Figure 1: Wire size (solid lines, left axis) vs decode latency (dashed lines, right axis). Varint [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Wire encoding of a small embedding. Bebop’s native UUID saves 20 bytes versus [PITH_FULL_IMAGE:figures/full_fig_p013_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Bandwidth utilization vs record size. Larger records amortize per-record overhead [PITH_FULL_IMAGE:figures/full_fig_p014_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Encode vs decode latency across binary formats. Bebop’s decode advantage (dark [PITH_FULL_IMAGE:figures/full_fig_p015_4.png] view at source ↗
read the original abstract

The dominant data interchange formats encode integers using a variable number of bytes or represent floating-point numbers as variable-length UTF-8 strings. The decoder must inspect each byte for a continuation bit or parse each character individually, producing data-dependent branches that stall modern CPU pipelines. Protocol Buffers pays this cost on every integer, field tag, and length prefix. JSON pays it on every value. We present Bebop, a serialization format where every data type uses a fixed number of bytes. A 32-bit integer is always four bytes. Decoding becomes a single memory read with no conditionals. Across 19 decode workloads, Bebop decodes 9--213$\times$ faster than Protocol Buffers. On a 1536-dimension embedding vector, Bebop decodes in 2.8 nanoseconds versus 111 nanoseconds for Protocol Buffers and 4.69 microseconds for simdjson, a 1,675$\times$ gap. On records above 64 KB, the decoder achieves 86% of peak memory bandwidth. The CPU is no longer the bottleneck. We also present a transport-agnostic RPC protocol built on the same wire format. The protocol introduces batch pipelining, where dependent cross-service calls execute in a single round trip with server-side dependency resolution. It deploys over HTTP/1.1, HTTP/2, and binary transports without proxies, removing the HTTP/2 requirement that limits gRPC on serverless platforms and in browsers.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript presents Bebop, a serialization format in which every data type is encoded using a fixed number of bytes, enabling branch-free decoding via single memory reads. It reports substantial speedups over Protocol Buffers (9-213× on 19 decode workloads) and simdjson (1675× on a 1536-dimension embedding vector), with the decoder reaching 86% of peak memory bandwidth on large records. The paper also describes a batch-pipelining RPC protocol built on this format that supports dependent cross-service calls in one round trip over HTTP/1.1, HTTP/2, and binary transports.

Significance. If the empirical results are robust, Bebop could eliminate serialization as a performance bottleneck in data-intensive distributed systems, particularly for embedding vectors and high-volume RPCs. The fixed-size approach is a clean departure from variable-length encodings like varints in Protocol Buffers. The RPC extension addresses practical deployment constraints in serverless and browser environments. Strengths include the direct timing measurements and the bandwidth utilization result.

major comments (2)
  1. [Abstract] Abstract, paragraph 2: The reported speedups (9--213× vs Protocol Buffers across 19 workloads, 2.8 ns vs 111 ns for the 1536-dimension embedding vector) are presented without any description of the workloads, measurement methodology, hardware platform, error bars, or raw data. This makes it impossible to evaluate reproducibility or rule out post-hoc selection and measurement artifacts.
  2. [Abstract] Abstract, paragraph 1: The claim that 'every data type uses a fixed number of bytes' and that decoding reduces to 'a single memory read with no conditionals' is load-bearing for the performance results. The manuscript must explicitly show how variable-length types (strings, bytes, repeated fields) are handled without reintroducing data-dependent branches or unacceptable padding; otherwise the no-branch property does not survive for typical production messages.
minor comments (2)
  1. [Abstract] Abstract: Clarify whether the simdjson comparison (4.69 microseconds) uses the identical embedding-vector workload and optimal configuration.
  2. The manuscript should define all acronyms on first use (e.g., RPC) and provide a brief overview of the 19 workloads in the evaluation section.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback, which highlights important issues of reproducibility and clarity around the core claims. We address each major comment below and will revise the manuscript accordingly to strengthen the presentation.

read point-by-point responses
  1. Referee: [Abstract] Abstract, paragraph 2: The reported speedups (9--213× vs Protocol Buffers across 19 workloads, 2.8 ns vs 111 ns for the 1536-dimension embedding vector) are presented without any description of the workloads, measurement methodology, hardware platform, error bars, or raw data. This makes it impossible to evaluate reproducibility or rule out post-hoc selection and measurement artifacts.

    Authors: We agree that the abstract would benefit from additional context to support reproducibility. In the revised version we will add a concise clause describing the 19 workloads at a high level (integer arrays, embedding vectors, and nested messages) and explicitly direct readers to the evaluation section for the full methodology, hardware platform, timing approach, error bars on all figures, and availability of raw data. This keeps the abstract within length limits while addressing the concern directly. revision: yes

  2. Referee: [Abstract] Abstract, paragraph 1: The claim that 'every data type uses a fixed number of bytes' and that decoding reduces to 'a single memory read with no conditionals' is load-bearing for the performance results. The manuscript must explicitly show how variable-length types (strings, bytes, repeated fields) are handled without reintroducing data-dependent branches or unacceptable padding; otherwise the no-branch property does not survive for typical production messages.

    Authors: The manuscript already describes the encoding in Section 3, but we accept that the abstract claim requires an explicit supporting explanation for variable-length types. Bebop encodes strings, bytes, and repeated fields using a fixed-size (4-byte) length or count prefix followed immediately by the payload; the decoder issues an unconditional 32-bit read for the prefix and then computes the payload address from that value. Because the prefix is read as a single integer rather than inspected byte-by-byte, no data-dependent branches appear in the hot path. We will add a dedicated paragraph plus a small diagram in the revised Section 3 that walks through the string, bytes, and repeated-field cases, confirming that padding is limited to natural alignment and does not affect the reported speedups or bandwidth utilization. revision: yes

Circularity Check

0 steps flagged

No circularity: performance claims are direct empirical measurements, not derived quantities

full rationale

The paper introduces a fixed-size serialization format (Bebop) and reports measured decode latencies across 19 workloads plus an embedding-vector microbenchmark. No equations, fitted parameters, or derivations appear in the abstract or described claims. The central results (9–213× speedups, 2.8 ns vs. 111 ns) are presented as observed timings rather than quantities defined in terms of themselves or obtained by fitting to the same data. No self-citations are invoked as load-bearing uniqueness theorems, and the design choice of fixed byte widths is stated directly rather than smuggled via prior work. The skeptic concern about variable-length fields is a question of engineering practicality and workload representativeness, not a circularity in any derivation chain. The paper is therefore self-contained against external benchmarks with no reduction of outputs to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claim rests on the introduction of the Bebop wire format and the empirical timing results; no free parameters are mentioned. The only explicit background assumption is the well-known behavior of CPU pipelines on data-dependent branches.

axioms (1)
  • domain assumption Modern CPU pipelines stall on data-dependent branches during decoding.
    Invoked to explain why variable-length formats are slow.
invented entities (2)
  • Bebop serialization format no independent evidence
    purpose: Fixed-byte encoding that removes all continuation bits and conditionals
    New format introduced by the paper.
  • batch pipelining RPC protocol no independent evidence
    purpose: Server-side resolution of dependent cross-service calls in one round trip
    New protocol feature built on the Bebop wire format.

pith-pipeline@v0.9.0 · 5575 in / 1316 out tokens · 67648 ms · 2026-05-15T17:14:44.678275+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

35 extracted references · 35 canonical work pages

  1. [1]

    Google.Protocol Buffers.https://protobuf.dev/, 2008

  2. [2]

    K. Varda. Comment on Hacker News, May 2016. https://news.ycombinator.com/i tem?id=11657767

  3. [3]

    Protocol Buffers: Google’s Data Interchange Format

    K. Varda. “Protocol Buffers: Google’s Data Interchange Format.”Google Open Source Blog, July 2008. https://opensource.googleblog.com/2008/07/protocol-buffers -googles-data.html

  4. [4]

    Varda.Cap’n Proto.https://capnproto.org/, 2013

    K. Varda.Cap’n Proto.https://capnproto.org/, 2013

  5. [5]

    Field Presence

    Google. “Field Presence.”Protocol Buffers Documentation, 2020. https://protobuf.d ev/programming-guides/field_presence/

  6. [6]

    Google.FlatBuffers.https://flatbuffers.dev/, 2014

  7. [7]

    https://github.com/aeron-io/simple-binar y-encoding, 2013

    Real Logic.Simple Binary Encoding. https://github.com/aeron-io/simple-binar y-encoding, 2013

  8. [8]

    Furuhashi.MessagePack.https://msgpack.org/, 2008

    S. Furuhashi.MessagePack.https://msgpack.org/, 2008

  9. [9]

    Apache Software Foundation.Apache Avro.https://avro.apache.org/, 2009

  10. [10]

    Parsing Gigabytes of JSON per Second

    G. Langdale and D. Lemire. “Parsing Gigabytes of JSON per Second.”The VLDB Journal, 28(6):941–960, 2019

  11. [11]

    A New Golden Age for Computer Architecture

    J. L. Hennessy and D. A. Patterson. “A New Golden Age for Computer Architecture.” Communications of the ACM, 62(2):48–60, 2019. Based on Turing Lecture delivered at ISCA, June 2018

  12. [12]

    A 30 Year Retrospective on Dennard’s MOSFET Scaling Paper

    M. Bohr. “A 30 Year Retrospective on Dennard’s MOSFET Scaling Paper.”IEEE Solid-State Circuits Newsletter, 12(1):11–13, 2007

  13. [13]

    PCI-SIG.PCI Express Base Specification Revision 6.0, Version 1.0, January 2022

  14. [14]

    IEEE Standard for Ethernet – Amendment 10: Media Access Control Parameters, Physical Layers, and Management Parameters for 200 Gb/s and 400 Gb/s Operation

    IEEE. “IEEE Standard for Ethernet – Amendment 10: Media Access Control Parameters, Physical Layers, and Management Parameters for 200 Gb/s and 400 Gb/s Operation.” IEEE Std 802.3bs-2017, December 2017

  15. [15]

    An Introduction to the Compute Express Link (CXL) Interconnect

    D. Das Sharma, R. Blankenship, and D. Berger. “An Introduction to the Compute Express Link (CXL) Interconnect.”ACM Computing Surveys, 56(11):1–37, 2024

  16. [16]

    402 Tb/s GMI Data-Rate OESCLU-Band Transmis- sion

    B. J. Puttnam, H. Furukawa, et al. “402 Tb/s GMI Data-Rate OESCLU-Band Transmis- sion.” Post-deadline paper Th4A.3, Optical Fiber Communication Conference (OFC), San Diego, March 2024

  17. [17]

    J. Carmack. Post on fiber optic delay-line memory and flash bandwidth for AI inference. X (formerly Twitter), February 2026. https://x.com/ID_AA_Carmack/status/20198 39335382790342 33

  18. [18]

    Characterizing the Branch Misprediction Penalty

    S. Eyerman, J. E. Smith, and L. Eeckhout. “Characterizing the Branch Misprediction Penalty.”Proc. IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), pp. 48–58, 2006

  19. [19]

    The microarchitecture of Intel, AMD and VIA CPUs: An optimization guide for assembly programmers and compiler makers

    A. Fog. “The microarchitecture of Intel, AMD and VIA CPUs: An optimization guide for assembly programmers and compiler makers.” Technical University of Denmark, https://www.agner.org/optimize/microarchitecture.pdf, 2025

  20. [20]

    Optimization Notes: Apple M1

    H. Suzuki. “Optimization Notes: Apple M1.” https://github.com/ocxtal/insn_ben ch_aarch64, 2021

  21. [21]

    Popping the Hood on Golden Cove

    C. Lam. “Popping the Hood on Golden Cove.” Chips and Cheese, https://chipsand cheese.com/p/popping-the-hood-on-golden-cove, 2021

  22. [22]

    Mac Studio Technical Specifications

    Apple Inc. “Mac Studio Technical Specifications.” https://www.apple.com/mac-stud io/specs/, 2025

  23. [23]

    gRPC over HTTP2

    gRPC Authors. “gRPC over HTTP2.” https://github.com/grpc/grpc/blob/mast er/doc/PROTOCOL-HTTP2.md, 2015

  24. [24]

    Core concepts, architecture and lifecycle

    gRPC Authors. “Core concepts, architecture and lifecycle.” https://grpc.io/docs/w hat-is-grpc/core-concepts/, 2023

  25. [25]

    Deadlines

    gRPC Authors. “Deadlines.”https://grpc.io/docs/guides/deadlines/, 2023

  26. [26]

    gRPC to AWS Lambda: Is it Possible?

    P. Henry. “gRPC to AWS Lambda: Is it Possible?”Coinbase Blog, March 2019. https://www.coinbase.com/blog/grpc-to-aws-lambda-is-it-possible

  27. [27]

    Support for calling gRPC endpoints from Cloudflare Workers

    “Support for calling gRPC endpoints from Cloudflare Workers.” Discussion #4534, cloudflare/workerd GitHub repository, 2025. https://github.com/cloudflare/wo rkerd/discussions/4534

  28. [28]

    The state of gRPC in the browser

    J. Brandhorst. “The state of gRPC in the browser.”gRPC Blog, January 2019. https: //grpc.io/blog/state-of-grpc-web/

  29. [29]

    Twirp: a sweet new RPC framework for Go

    S. Nelson. “Twirp: a sweet new RPC framework for Go.”Twitch Blog, January

  30. [30]

    https://blog.twitch.tv/en/2018/01/16/twirp-a-sweet-new-rpc-framewo rk-for-go-5f2febbf35f/

  31. [31]

    Long Running Operations

    Google. “Long Running Operations.”API Design Guide. https://cloud.google.com /apis/design/design_patterns#long_running_operations, 2017

  32. [32]

    Thrift: Scalable Cross-Language Services Implementation

    M. Slee, A. Agarwal, and M. Kwiatkowski. “Thrift: Scalable Cross-Language Services Implementation.” Facebook Technical Paper, April 2007. https://thrift.apache.or g/static/files/thrift-20070401.pdf

  33. [33]

    Addressing Cascading Failures

    M. Ulrich. “Addressing Cascading Failures.” InSite Reliability Engineering: How Google Runs Production Systems, ch. 22, O’Reilly Media, 2016. https://sre.google/sre-b ook/addressing-cascading-failures/

  34. [34]

    How LinkedIn Adopted Protocol Buffers to Reduce Latency by 60%

    N. Kim. “How LinkedIn Adopted Protocol Buffers to Reduce Latency by 60%.”System Design Newsletter, 2023. https://newsletter.systemdesign.one/p/protocol-buf fers-vs-json 34

  35. [35]

    Hash Function Prospector

    C. Wellons. “Hash Function Prospector.” https://github.com/skeeto/hash-prosp ector, 2018. 35