pith. sign in

arxiv: 2605.17076 · v2 · pith:2E2PDC3Inew · submitted 2026-05-16 · 💻 cs.LG · cs.AI· cs.DC· cs.MA

S-Bus: Automatic Read-Set Reconstruction for Multi-Agent LLM State Coordination

Pith reviewed 2026-05-25 05:57 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.DCcs.MA
keywords concurrency controlLLM agentsread set reconstructionobservable read isolationmulti-agent coordinationHTTP middlewareformal verificationstate sharing
0
0 comments X

The pith

S-Bus uses a server-side DeliveryLog to reconstruct each agent's read set from HTTP GET traffic, enabling Observable-Read Isolation that prevents structural race conditions in dedicated-shard multi-agent LLM setups.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces S-Bus, an HTTP middleware for LLM agents that share mutable state but cannot be altered to declare their read sets in advance. Its central DeliveryLog mechanism rebuilds each agent's read set at commit time solely from observed GET requests. This supports Observable-Read Isolation, a partial causal consistency over the HTTP-visible read projection, which blocks structural races in dedicated-shard topologies. The work supplies three-tier formal evidence plus empirical runs showing safety parity with PostgreSQL SERIALIZABLE and Redis WATCH/MULTI, while noting that the property is neutral in dedicated-shard workloads but can propagate contradictions in single-shard collaborative writing.

Core claim

S-Bus's central claim is that the DeliveryLog achieves ReadSetSoundness and ORICommitSafety, established by TLAPS proofs (modulo one typing axiom), exhaustive TLC exploration of 20,763,484 states at N=3 with zero violations, and Dafny discharge of nine inductive lemmas, while 884,110 commit attempts under contention produced zero Type-I corruptions, matching the safety of PostgreSQL 17 SERIALIZABLE and Redis 7 WATCH/MULTI.

What carries the argument

The DeliveryLog, a server-side log that reconstructs each agent's read set from observed HTTP GET traffic to enforce Observable-Read Isolation (ORI) at commit time.

If this is right

  • S-Bus matches the safety of established database isolation levels without requiring changes to agent code.
  • ORI blocks Type-I corruptions in dedicated-shard multi-agent LLM state sharing.
  • ORI stays semantically neutral in dedicated-shard workloads but propagates concurrent contradictions in single-shard collaborative writing.
  • LLM-judge validation against human annotation reaches strict kappa of 0.93 on shard-usage pairs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If read-set reconstruction works from HTTP observables alone, analogous mechanisms could be tested on other protocols such as WebSockets without altering agents.
  • The performance difference between dedicated-shard and single-shard cases suggests experiments that dynamically switch topologies mid-workload to measure when ORI becomes harmful.
  • Agent self-reports over-claiming shard usage by 32-49 percent indicates that objective traffic-based monitoring may be needed even when agents attempt to self-declare behavior.

Load-bearing premise

Read sets can be accurately and completely reconstructed from observed HTTP GET traffic without missing or misinterpreting requests, and the dedicated-shard topology assumption holds for the workloads where ORI is beneficial.

What would settle it

Observing even one Type-I corruption across the 884,110 commit attempts or discovering a state violation when TLC is rerun at N greater than 3.

read the original abstract

We address concurrency control for LLM agents sharing mutable state over HTTP, where agents cannot be modified to declare read sets. S-Bus is an HTTP middleware whose central mechanism, a server-side DeliveryLog, reconstructs each agent's read set at commit time from observed HTTP GET traffic. The consistency property it provides -- Observable-Read Isolation (ORI), a partial causal consistency over the HTTP-observable read projection -- prevents Structural Race Conditions in dedicated-shard topologies. Three contributions. (C1) DeliveryLog mechanism with three-tier mechanised evidence: TLAPS proves ReadSetSoundness and ORICommitSafety (modulo one typing axiom); exhaustive TLC at N=3 explores 20,763,484 states with zero violations; Dafny discharges 9 inductive lemmas. (C2) Empirical safety parity against PostgreSQL 17 SERIALIZABLE and Redis 7 WATCH/MULTI: zero Type-I corruptions across 884,110 commit attempts (427,308 under active contention). (C3) ORI is semantically neutral in dedicated-shard workloads but harmful in single-shard collaborative writing because preservation propagates concurrent contradictions. v2 update: the PH-3 LLM judge is now independently validated against a human annotator (Zahid Hussain, Mindgigs Peshawar) on 400 (step, shard) pairs at strict kappa=0.93 (n=93, 96.8% raw agreement). Inter-LLM-judge agreement is kappa=0.46 (boundary variance). Agent self-reports over-claim shard usage by 32% (LLM judge) to 49% (human annotator). The SJ-v4 semantic-quality rubric remains single-judge LLM-only. Source code, formal proofs, harness, annotation data: https://github.com/sajjadanwar0/sbus

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces S-Bus, an HTTP middleware for concurrency control among unmodified LLM agents sharing mutable state. Its core mechanism is a server-side DeliveryLog that reconstructs each agent's read set at commit time solely from observed HTTP GET traffic, thereby providing Observable-Read Isolation (ORI)—a partial causal consistency over the HTTP-observable read projection—that prevents Structural Race Conditions in dedicated-shard topologies. Contributions include three-tier mechanized evidence (TLAPS proofs of ReadSetSoundness and ORICommitSafety modulo one typing axiom, exhaustive TLC model checking at N=3 with 20M+ states and zero violations, and Dafny lemmas), empirical parity with PostgreSQL SERIALIZABLE and Redis WATCH/MULTI (zero Type-I corruptions in 884k commit attempts), and analysis showing ORI is neutral in dedicated-shard workloads but harmful in single-shard collaborative writing. v2 adds human-validated LLM-judge metrics.

Significance. If the read-set reconstruction is reliable, the work provides a practical, agent-transparent solution for multi-agent LLM state coordination with unusually strong formal support (machine-checked proofs plus model checking) and large-scale empirical validation. The combination of TLAPS/TLC/Dafny evidence and zero-corruption results against established baselines is a notable strength for a systems-oriented claim in this area.

major comments (3)
  1. [C1 / DeliveryLog mechanism] The central claim that the DeliveryLog reconstructs read sets completely and unambiguously from observed HTTP GET traffic (to establish ORICommitSafety) is load-bearing, yet the manuscript provides no analysis or evidence addressing realistic HTTP behaviors such as caching, conditional requests (If-None-Match), redirects, authentication headers, or dynamic query parameters. If any read is missed or mis-mapped, the reconstruction premise fails even if the idealized TLAPS/TLC/Dafny models hold.
  2. [C1 / TLAPS proofs] TLAPS proofs of ReadSetSoundness and ORICommitSafety are stated to hold only modulo one typing axiom; this axiom is not discharged or justified in the provided evidence, leaving a gap in the mechanized guarantee for the core safety property.
  3. [C2 / Empirical safety parity] The empirical evaluation reports zero Type-I corruptions across 884,110 commit attempts (including 427k under contention), but the test harness description does not indicate coverage of the HTTP edge cases (caching, redirects, non-standard clients) that could violate the reconstruction assumption in practice.
minor comments (2)
  1. [v2 update] The v2 update reports inter-LLM-judge agreement at kappa=0.46 and agent self-report over-claiming at 32-49%; clarifying how these metrics affect the reliability of the PH-3 judge for the main experiments would improve transparency.
  2. [Abstract / Introduction] Notation for 'Structural Race Conditions' and the precise definition of the 'HTTP-observable read projection' could be introduced earlier with a small example to aid readers unfamiliar with the LLM-agent setting.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments on the robustness of the read-set reconstruction. We address each major comment point by point below, indicating planned revisions where appropriate.

read point-by-point responses
  1. Referee: [C1 / DeliveryLog mechanism] The central claim that the DeliveryLog reconstructs read sets completely and unambiguously from observed HTTP GET traffic (to establish ORICommitSafety) is load-bearing, yet the manuscript provides no analysis or evidence addressing realistic HTTP behaviors such as caching, conditional requests (If-None-Match), redirects, authentication headers, or dynamic query parameters. If any read is missed or mis-mapped, the reconstruction premise fails even if the idealized TLAPS/TLC/Dafny models hold.

    Authors: We agree that the manuscript lacks explicit analysis of these HTTP behaviors. The formal models and DeliveryLog focus on direct, observable GET requests in controlled dedicated-shard environments; caching, conditional requests, and redirects are outside the modeled threat model. We will add a dedicated limitations subsection in the revision discussing these assumptions, failure modes, and practical mitigations such as cache-control directives and direct shard connections. revision: yes

  2. Referee: [C1 / TLAPS proofs] TLAPS proofs of ReadSetSoundness and ORICommitSafety are stated to hold only modulo one typing axiom; this axiom is not discharged or justified in the provided evidence, leaving a gap in the mechanized guarantee for the core safety property.

    Authors: The typing axiom restricts the model to well-formed messages per the protocol specification, which is a standard TLA+ modeling choice to focus proofs on invariants. It is justified by the middleware implementation enforcing formats, but we acknowledge it is not discharged. We will revise the formal-methods section to provide a fuller justification of the axiom and note the remaining modeling gap. revision: partial

  3. Referee: [C2 / Empirical safety parity] The empirical evaluation reports zero Type-I corruptions across 884,110 commit attempts (including 427k under contention), but the test harness description does not indicate coverage of the HTTP edge cases (caching, redirects, non-standard clients) that could violate the reconstruction assumption in practice.

    Authors: The harness generates standard HTTP GET traffic matching the LLM-agent patterns in the evaluated workloads. We agree that the description should explicitly note the absence of caching and similar edge cases. We will update the evaluation section in the revision to clarify these harness assumptions and their alignment with the dedicated-shard deployment model. revision: yes

Circularity Check

0 steps flagged

No significant circularity; claims rest on mechanized proofs and external system comparisons

full rationale

The paper's core contributions (DeliveryLog reconstruction, ORI property, ReadSetSoundness and ORICommitSafety) are supported by independent mechanized verification in TLAPS (modulo one typing axiom), exhaustive TLC model checking, and Dafny lemmas, plus direct empirical parity testing against PostgreSQL SERIALIZABLE and Redis WATCH/MULTI. No load-bearing self-citations appear, no parameters are fitted then relabeled as predictions, and no derivation reduces by construction to its own inputs or prior author work. The reconstruction mechanism is presented as a novel engineering artifact whose safety is externally verified rather than tautologically assumed.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The paper introduces DeliveryLog and ORI as new constructs; relies on one typing axiom; no free parameters mentioned in abstract.

axioms (1)
  • ad hoc to paper One typing axiom in TLAPS proof
    Mentioned as modulo one typing axiom for ReadSetSoundness and ORICommitSafety.
invented entities (2)
  • DeliveryLog no independent evidence
    purpose: Server-side log to reconstruct read sets from HTTP GET traffic
    Central mechanism introduced in the paper.
  • Observable-Read Isolation (ORI) no independent evidence
    purpose: Consistency property providing partial causal consistency over HTTP-observable read projection
    New consistency model defined for this system.

pith-pipeline@v0.9.0 · 5873 in / 1395 out tokens · 42528 ms · 2026-05-25T05:57:26.289148+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

57 extracted references · 57 canonical work pages · 4 internal anchors

  1. [1]

    LangGraph,

    LangChain, “LangGraph,” GitHub, 2024

  2. [2]

    Moura, “CrewAI,” GitHub, 2024

    J. Moura, “CrewAI,” GitHub, 2024

  3. [3]

    AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation

    Q. Wu et al., “AutoGen,” arXiv:2308.08155, 2023

  4. [4]

    MetaGPT,

    S. Hong et al., “MetaGPT,” ICLR, 2024

  5. [5]

    Li et al., “CAMEL,” NeurIPS, 2023

    G. Li et al., “CAMEL,” NeurIPS, 2023

  6. [6]

    Swarm / AG2,

    OpenAI, “Swarm / AG2,” GitHub, 2024

  7. [7]

    Agent-to-Agent Protocol,

    Google, “Agent-to-Agent Protocol,” GitHub, 2025

  8. [8]

    Yao et al., “ReAct,” ICLR, 2023

    S. Yao et al., “ReAct,” ICLR, 2023

  9. [9]

    DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines

    O. Khattab et al., “DSPy,” arXiv:2310.03714, 2023

  10. [10]

    Semantic Kernel,

    Microsoft, “Semantic Kernel,” GitHub, 2023

  11. [11]

    Why do multi-agent LLM systems fail?

    M. Cemri et al., “Why do multi-agent LLM systems fail?” 2025

  12. [12]

    Generative Agents: Interactive Simulacra of Human Behavior,

    J. S. Park et al., “Generative Agents: Interactive Simulacra of Human Behavior,” UIST, 2023

  13. [13]

    Managing Update Conflicts in Bayou,

    D. Terry et al., “Managing Update Conflicts in Bayou,” SOSP, 1995

  14. [14]

    MemGPT: Towards LLMs as Operating Systems

    C. Packer et al., “MemGPT,” arXiv:2310.08560, 2023

  15. [15]

    Weak Consistency,

    A. Adya, “Weak Consistency,” PhD thesis, MIT, 1999

  16. [16]

    Don’t Settle for Eventual,

    W. Lloyd et al., “Don’t Settle for Eventual,” SOSP, 2011

  17. [17]

    Making Geo-Replicated Systems Fast as Possible, Consistent when Necessary,

    C. Li et al., “Making Geo-Replicated Systems Fast as Possible, Consistent when Necessary,” OSDI, 2012

  18. [18]

    Transactional Storage for Geo-Replicated Systems,

    Y . Sovran et al., “Transactional Storage for Geo-Replicated Systems,” SOSP, 2011

  19. [19]

    Serializable Snapshot Isolation in PostgreSQL,

    D. Ports & K. Grittner, “Serializable Snapshot Isolation in PostgreSQL,” VLDB, 2012

  20. [20]

    Spanner,

    J. Corbett et al., “Spanner,” OSDI, 2012

  21. [21]

    RAMP Transactions,

    P. Bailis et al., “RAMP Transactions,” VLDB, 2014

  22. [22]

    Coordination Avoidance in Database Systems,

    P. Bailis et al., “Coordination Avoidance in Database Systems,” VLDB, 2014

  23. [23]

    Ongaro & J

    D. Ongaro & J. Ousterhout, “Raft,” USENIX ATC, 2014

  24. [24]

    Time, Clocks, and the Ordering of Events in a Distributed System,

    L. Lamport, “Time, Clocks, and the Ordering of Events in a Distributed System,” CACM, 21(7), 1978

  25. [25]

    IronFleet,

    C. Hawblitzel et al., “IronFleet,” SOSP, 2015

  26. [26]

    J. R. Wilcox et al., “Verdi,” PLDI, 2015

  27. [27]

    FoundationDB,

    J. Zhou et al., “FoundationDB,” SIGMOD, 2021

  28. [28]

    TigerBeetle,

    J. Betz, “TigerBeetle,” 2023

  29. [29]

    Speedy Transactions in Multicore In-Memory Databases,

    S. Tu et al., “Speedy Transactions in Multicore In-Memory Databases,” SOSP, 2013

  30. [30]

    Hekaton: SQL Server’s Memory-Optimized OLTP Engine,

    C. Diaconu et al., “Hekaton: SQL Server’s Memory-Optimized OLTP Engine,” SIGMOD, 2013

  31. [31]

    No Compromises,

    A. Dragojevic et al., “No Compromises,” SOSP, 2015

  32. [32]

    Orleans,

    P. A. Bernstein et al., “Orleans,” MSR TR-2014-41, 2014

  33. [33]

    Concurrency Control,

    P. Bernstein & N. Goodman, “Concurrency Control,” ACM Surv., 1981

  34. [34]

    H. T. Kung & J. T. Robinson, “OCC,” ACM TODS, 1981. 24

  35. [35]

    Transactional Memory,

    M. Herlihy & J. E. Moss, “Transactional Memory,” ISCA, 1993

  36. [36]

    Dice et al., “TL2,” DISC, 2006

    D. Dice et al., “TL2,” DISC, 2006

  37. [37]

    Thomson et al., “Calvin,” SIGMOD, 2012

    A. Thomson et al., “Calvin,” SIGMOD, 2012

  38. [38]

    Percolator,

    D. Peng & F. Dabek, “Percolator,” OSDI, 2010

  39. [39]

    CockroachDB,

    R. Taft et al., “CockroachDB,” SIGMOD, 2020

  40. [40]

    SWE-bench,

    C. Jimenez et al., “SWE-bench,” ICLR, 2024

  41. [41]

    LLM-as-a-Judge,

    L. Zheng et al., “LLM-as-a-Judge,” NeurIPS, 2023

  42. [42]

    Loro: Movable Tree CRDT,

    Loro Team, “Loro: Movable Tree CRDT,” https://loro.dev, 2024

  43. [43]

    Automerge,

    M. Kleppmann et al., “Automerge,” https://automerge.org, 2024

  44. [44]

    Hypertext Transfer Protocol (HTTP/1.1): Conditional Requests,

    R. Fielding and J. Reschke, “Hypertext Transfer Protocol (HTTP/1.1): Conditional Requests,” IETF RFC 7232, June 2014

  45. [45]

    The Measurement of Observer Agreement for Categorical Data,

    J. R. Landis and G. G. Koch, “The Measurement of Observer Agreement for Categorical Data,”Biometrics, vol. 33, no. 1, pp. 159–174, 1977

  46. [46]

    Workflow Update API,

    Temporal Technologies, “Workflow Update API,” Temporal documenta- tion, 2024. https://docs.temporal.io/workflows#update

  47. [47]

    FoundationDB Directory Layer,

    A. Beamer et al., “FoundationDB Directory Layer,” FoundationDB doc- umentation, 2024. https://apple.github.io/foundationdb/developer-guide. html#directories

  48. [48]

    Letta: Stateful Agents Beyond Context Windows,

    C. Packer et al., “Letta: Stateful Agents Beyond Context Windows,” GitHub, 2024. https://github.com/letta-ai/letta

  49. [49]

    Large Language Models are not Fair Evaluators

    P. Wang et al., “Large Language Models are not Fair Evaluators,” arXiv:2305.17926, 2023

  50. [50]

    Scalable Transactions across Heterogeneous NoSQL Key-Value Data Stores,

    A. Dey, A. Fekete, R. Nambiar, U. Röhm, “Scalable Transactions across Heterogeneous NoSQL Key-Value Data Stores,”PVLDB, vol. 6, no. 12, pp. 1434–1439, 2013

  51. [51]

    AgentScope: A Flexible yet Robust Multi-Agent Platform,

    D. Gao et al., “AgentScope: A Flexible yet Robust Multi-Agent Platform,” arXiv:2402.14034, 2024

  52. [52]

    V oyager: An Open-Ended Embodied Agent with Large Language Models,

    G. Wang et al., “V oyager: An Open-Ended Embodied Agent with Large Language Models,”Transactions on Machine Learning Research, 2024

  53. [53]

    SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering,

    J. Yang et al., “SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering,”NeurIPS, 2024

  54. [54]

    Verus: Verifying Rust Programs using Linear Ghost Types,

    A. Lattuada et al., “Verus: Verifying Rust Programs using Linear Ghost Types,”OOPSLA, 2023

  55. [55]

    Creusot: A Foundry for the Deductive Verification of Rust Programs,

    X. Denis, J.-H. Jourdan, C. Marché, “Creusot: A Foundry for the Deductive Verification of Rust Programs,” inFormal Methods: 25th Intl. Symp., 2023

  56. [56]

    Boki: Stateful Serverless Computing with Shared Logs,

    Z. Jia, E. Witchel, “Boki: Stateful Serverless Computing with Shared Logs,”SOSP, 2021

  57. [57]

    Language Primitives and Type Discipline for Structured Communication-Based Programming,

    K. Honda, V . T. Vasconcelos, M. Kubo, “Language Primitives and Type Discipline for Structured Communication-Based Programming,”ESOP, 1998. APPENDIX The rubric used by both LLM judges in the Exp. PH-3 validation study (§7.9) is reproduced verbatim below. The prompt was frozen before observing inter-judge agreement and was not revised. You are a strict cod...