pith. sign in

arxiv: 2606.10662 · v1 · pith:ZZWPZT3Xnew · submitted 2026-06-09 · 💻 cs.MA · cs.AI

Decentralized Multi-Agent Systems with Shared Context

Pith reviewed 2026-06-27 11:09 UTC · model grok-4.3

classification 💻 cs.MA cs.AI
keywords decentralized multi-agent systemsshared verified contextlarge language modelstest-time scalingSWE-benchLongBenchasynchronous coordination
0
0 comments X

The pith

Decentralized agents using a shared verified context outperform centralized multi-agent systems on software engineering and long-context benchmarks while halving costs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that centralized orchestration becomes a bottleneck in multi-agent systems as the number of subtasks grows. DeLM addresses this by letting parallel agents asynchronously claim subtasks from a queue, read accumulated progress from a shared verified context, perform local reasoning, and write compact updates back. The shared context serves as the common substrate so agents can build directly on one another's verified work without routing everything through a central controller. On SWE-bench Verified this produces gains of up to 10.5 percentage points across Avg.@1, Pass@2, and Pass@4 while cutting cost per task by roughly 50 percent. On LongBench-v2 Multi-Doc QA the same setup yields the highest average accuracy across four model families, up to 5.7 points above the strongest baseline.

Core claim

DeLM is a multi-agent framework that decentralizes coordination through parallel agents, a shared verified context, and a task queue. Agents asynchronously claim subtasks, read the accumulated verified progress, reason locally, and write back compact verified updates. The shared context acts as the common communication substrate, enabling agents to build on one another's progress without routing every update through a central controller.

What carries the argument

The shared verified context that functions as an asynchronous common communication substrate for agent updates.

If this is right

  • DeLM records the best scores on SWE-bench Verified for Avg.@1, Pass@2, and Pass@4.
  • Cost per task drops by roughly 50 percent on the same benchmark.
  • Accuracy rises by up to 5.7 percentage points on LongBench-v2 Multi-Doc QA across four frontier model families.
  • Coordination overhead stays flat as the number of subtasks increases.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same shared-context pattern could be tested on agent teams that must maintain a single evolving plan over many steps.
  • Verification rules inside the context might be relaxed or strengthened to measure the accuracy-cost trade-off directly.
  • The framework could be run with heterogeneous model sizes where cheaper agents handle simple subtasks and stronger ones handle verification.
  • Task queues with priority or dependency edges could be added to see whether ordering constraints improve or hurt the observed gains.

Load-bearing premise

The shared verified context can be read and written asynchronously by multiple agents without introducing unresolvable conflicts or unverifiable errors.

What would settle it

A run on SWE-bench Verified in which agents produce conflicting writes to the shared context that cannot be verified or merged, causing DeLM to fall below the strongest centralized baseline on Pass@2 or Pass@4.

Figures

Figures reproduced from arXiv: 2606.10662 by Azalia Mirhoseini, Yuzhen Mao.

Figure 1
Figure 1. Figure 1: Comparison on SWE-bench Verified and LongBench-v2 Multi-Doc QA. Our method, DELM, achieves the best average performance across both agentic and long-context benchmarks. Multi-agent systems (MAS) offer a natural way to scale large language model reasoning at test time. Instead of relying on a single model invocation to solve a complex task end-to-end, MAS decompose the problem into subtasks, dispatch multip… view at source ↗
Figure 2
Figure 2. Figure 2: Centralized vs. decentralized multi-agent systems. Centralized MAS relies on a main agent to assign subcontexts, spawn sub-agents, and integrate returned results through a synchronous scatter–gather loop, creating a bottleneck where progress is gated by both the central merge step and the slowest worker. In contrast, DELM decentralizes coordination through parallel agents, a shared context, and a task queu… view at source ↗
Figure 3
Figure 3. Figure 3: Overview of DELM. A one-time initialization step decomposes the input into initial subtasks and places them in a shared task queue. Parallel agents asynchronously claim tasks (Ti), read the verified shared context, and perform local reasoning. Completed updates are compressed, verified, and admitted as compact gists (Gi), making reusable progress visible to all agents. When the task queue becomes empty, th… view at source ↗
Figure 4
Figure 4. Figure 4: Ablation and robustness analysis on LongBench-v2 Multi-Doc QA. All bars report accuracy averaged over the five domains with GPT-5.4. (a) Modular ablation: removing either the verification step or the hierarchical summary lowers accuracy. (b, c) DELM is largely insensitive to its gist configuration: accuracy is stable once the gist is long enough (b) and across summarizers of varying cost (c). In every pane… view at source ↗
Figure 5
Figure 5. Figure 5: (a) Long source units are compressed into reference-grounded summaries [PITH_FULL_IMAGE:figures/full_fig_p021_5.png] view at source ↗
read the original abstract

Multi-agent systems (MAS) can scale large language model reasoning at test time by decomposing complex problems into parallel subtasks. However, most existing MAS rely on centralized orchestration, where a main agent assigns work, collects outputs, and merges results. As the number of subtasks grows, this controller becomes a communication and integration bottleneck. We propose Decentralized Language Models (DeLM), a MAS framework that decentralizes coordination through parallel agents, a shared verified context, and a task queue. Agents asynchronously claim subtasks, read accumulated progress, perform local reasoning, and write back compact verified updates. The shared context acts as a common communication substrate, enabling agents to build on one another's verified progress without routing every update through a central controller. Empirically, DeLM improves both software-engineering test-time scaling and long-context reasoning. On SWE-bench Verified, DeLM achieves the best performance across Avg.@1, Pass@2, and Pass@4, with gains of up to 10.5 percentage points over the strongest baseline, while reducing cost per task by roughly 50%. On LongBench-v2 Multi-Doc QA, DeLM achieves the highest average accuracy across four frontier model families, improving over the strongest baseline by up to 5.7 percentage points. The code is available on our project website at https://yuzhenmao.github.io/DeLM/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces Decentralized Language Models (DeLM), a multi-agent system framework that decentralizes coordination via parallel agents, a shared verified context, and a task queue. Agents asynchronously claim subtasks, read accumulated progress, perform local reasoning, and write compact verified updates. The central claim is that this avoids central orchestration bottlenecks and yields empirical gains: on SWE-bench Verified, best performance across Avg.@1, Pass@2, and Pass@4 with up to 10.5 pp improvement and ~50% cost reduction; on LongBench-v2 Multi-Doc QA, highest average accuracy across four model families with up to 5.7 pp gains.

Significance. If the shared-context mechanism reliably supports cumulative verified progress under asynchrony, the work would be significant for scaling test-time multi-agent reasoning without central bottlenecks. The reported benchmark gains and cost savings, together with public code release, would provide a concrete, reproducible baseline for decentralized MAS.

major comments (3)
  1. [§3.2] §3.2 (Shared Verified Context): The description states that agents 'asynchronously claim subtasks, read accumulated progress, perform local reasoning, and write back compact verified updates,' but supplies no protocol for versioning, locking, conflict detection, or resolution of overlapping writes. This mechanism is load-bearing for the decentralization claim and for attributing the 10.5 pp and 5.7 pp gains to the architecture rather than to centralized verification.
  2. [§4.1] §4.1 (Experimental Setup): The comparisons on SWE-bench Verified and LongBench-v2 report performance numbers but do not describe whether baselines received equivalent local verification steps, the same task-queue discipline, or identical conflict-handling assumptions. Without these controls it is impossible to isolate the contribution of the decentralized shared context.
  3. [§4.3] §4.3 (Ablation Studies): No ablation isolates the effect of asynchronous shared-context writes versus sequential or centrally mediated updates; the reported gains could therefore be driven by verification volume rather than by the decentralized coordination substrate.
minor comments (2)
  1. [Abstract / §1] The abstract and §1 use 'verified' without an operational definition; a short paragraph clarifying what constitutes a 'verified update' would improve clarity.
  2. [Figure 2] Figure 2 (system diagram) would benefit from explicit arrows or labels showing the read/write paths to the shared context and the task queue.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major comment below and commit to revisions that strengthen the description of the shared context and the experimental analysis.

read point-by-point responses
  1. Referee: [§3.2] §3.2 (Shared Verified Context): The description states that agents 'asynchronously claim subtasks, read accumulated progress, perform local reasoning, and write back compact verified updates,' but supplies no protocol for versioning, locking, conflict detection, or resolution of overlapping writes. This mechanism is load-bearing for the decentralization claim and for attributing the 10.5 pp and 5.7 pp gains to the architecture rather than to centralized verification.

    Authors: We agree that the concurrency protocol requires explicit specification. In the revised manuscript we will expand §3.2 with a formal description of the shared verified context, including version vectors for update tracking, optimistic locking keyed to subtask claims, conflict detection via version mismatch, and resolution by priority merge of verified facts. This addition will clarify how the architecture supports asynchrony and will help attribute the reported gains to the decentralized design rather than to verification alone. revision: yes

  2. Referee: [§4.1] §4.1 (Experimental Setup): The comparisons on SWE-bench Verified and LongBench-v2 report performance numbers but do not describe whether baselines received equivalent local verification steps, the same task-queue discipline, or identical conflict-handling assumptions. Without these controls it is impossible to isolate the contribution of the decentralized shared context.

    Authors: We acknowledge that the experimental setup must document baseline conditions more precisely. The revision will augment §4.1 with explicit statements of the verification steps, task-queue discipline, and conflict-handling rules applied to each baseline. Where original baselines differed, we will note the differences and, where computationally feasible, supply matched re-runs to isolate the contribution of the decentralized shared context. revision: yes

  3. Referee: [§4.3] §4.3 (Ablation Studies): No ablation isolates the effect of asynchronous shared-context writes versus sequential or centrally mediated updates; the reported gains could therefore be driven by verification volume rather than by the decentralized coordination substrate.

    Authors: We agree that an ablation separating asynchrony from verification volume is needed. The revised §4.3 will include a new ablation comparing the full asynchronous DeLM against sequential and centrally mediated variants while controlling for total verification steps. This will provide direct evidence that performance differences arise from the decentralized coordination substrate. revision: yes

Circularity Check

0 steps flagged

No derivation chain present; empirical claims only

full rationale

The manuscript describes a decentralized MAS framework evaluated via benchmark experiments on SWE-bench Verified and LongBench-v2. No equations, fitted parameters, ansatzes, uniqueness theorems, or derivation steps are referenced in the abstract or described structure. Performance numbers are reported outcomes, not quantities defined in terms of themselves or reduced via self-citation. The central claims therefore rest on external falsifiable measurements rather than any self-referential construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The framework rests on the domain assumption that asynchronous updates to a shared verified context remain consistent and useful; no free parameters or new physical entities are mentioned.

axioms (1)
  • domain assumption Agents can reliably verify and incorporate updates from the shared context without introducing errors that break cumulative progress.
    This premise is required for the decentralized coordination to produce the claimed performance gains.
invented entities (1)
  • Shared verified context no independent evidence
    purpose: Serves as the common communication substrate enabling agents to build on one another's verified progress without a central controller.
    Core new component of the DeLM framework introduced to replace centralized orchestration.

pith-pipeline@v0.9.1-grok · 5771 in / 1317 out tokens · 29445 ms · 2026-06-27T11:09:42.842356+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

20 extracted references · 16 canonical work pages · 6 internal anchors

  1. [1]

    GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning

    Lakshya A Agrawal, Shangyin Tan, Dilara Soylu, Noah Ziems, Rishi Khare, Krista Opsahl-Ong, Arnav Singhvi, Herumb Shandilya, Michael J Ryan, Meng Jiang, et al. Gepa: Reflective prompt evolution can outperform reinforcement learning.arXiv preprint arXiv:2507.19457,

  2. [2]

    Accessed: 2026-06-03

    URL https://deepmind.google/blog/ alphaevolve-a-gemini-powered-coding-agent-for-designing-advanced-algorithms/ . Accessed: 2026-06-03. Anthropic. Claude Sonnet 4.6. https://www.anthropic.com/news/claude-sonnet-4-6 ,

  3. [3]

    Anthropic

    Accessed: 2026-05-06. Anthropic. Introducing Claude Opus 4.6. Anthropic, February

  4. [4]

    anthropic.com/news/claude-opus-4-6

    URL https://www. anthropic.com/news/claude-opus-4-6. Accessed: 2026-06-01. Anthropic. Claude code.https://claude.ai/,

  5. [5]

    Oolong: Evaluating long context reasoning and aggregation capabilities.arXiv preprint arXiv:2511.02817,

    Amanda Bertsch, Adithya Pratapa, Teruko Mitamura, Graham Neubig, and Matthew R Gorm- ley. Oolong: Evaluating long context reasoning and aggregation capabilities.arXiv preprint arXiv:2511.02817,

  6. [6]

    Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory

    Prateek Chhikara, Dev Khant, Saket Aryan, Taranjeet Singh, and Deshraj Yadav. Mem0: Building production-ready ai agents with scalable long-term memory.arXiv preprint arXiv:2504.19413,

  7. [7]

    Accessed: 2026-06-01

    URL https://blog.google/products-and-platforms/products/gemini/ gemini-3-flash/. Accessed: 2026-06-01. Zhaopeng Feng, Liangcai Su, Zhen Zhang, Xinyu Wang, Xiaotian Zhang, Xiaobin Wang, Runnan Fang, Qi Zhang, Baixuan Li, Shihao Cai, et al. Agentswing: Adaptive parallel context management routing for long-horizon web agents.arXiv preprint arXiv:2603.27490,

  8. [8]

    Exploring advanced llm multi-agent systems based on blackboard architecture.arXiv preprint arXiv:2507.01701,

    Bochen Han and Songmao Zhang. Exploring advanced llm multi-agent systems based on blackboard architecture.arXiv preprint arXiv:2507.01701,

  9. [9]

    Kimi K2.5: Visual Agentic Intelligence

    URL https://openreview.net/ forum?id=VTF8yNQM66. 18 Kimi, Tongtong Bai, Yifan Bai, Yiping Bao, SH Cai, Yuan Cao, Y Charles, HS Che, Cheng Chen, Guanduo Chen, et al. Kimi k2. 5: Visual agentic intelligence.arXiv preprint arXiv:2602.02276,

  10. [10]

    A human-inspired reading agent with gist memory of very long contexts.arXiv preprint arXiv:2402.09727,

    Kuang-Huei Lee, Xinyun Chen, Hiroki Furuta, John Canny, and Ian Fischer. A human-inspired reading agent with gist memory of very long contexts.arXiv preprint arXiv:2402.09727,

  11. [11]

    Combee: Scaling Prompt Learning for Self-Improving Language Model Agents

    Hanchen Li, Runyuan He, Qizheng Zhang, Changxiu Ji, Qiuyang Mang, Xiaokun Chen, Lakshya A Agrawal, Wei-Liang Liao, Eric Yang, Alvin Cheung, et al. Combee: Scaling prompt learning for self-improving language model agents.arXiv preprint arXiv:2604.04247,

  12. [12]

    Memos: An operating system for memory-augmented generation (mag) in large language models.arXiv preprint arXiv:2505.22101,

    Zhiyu Li, Shichao Song, Hanyu Wang, Simin Niu, Ding Chen, Jiawei Yang, Chenyang Xi, Huayi Lai, Jihao Zhao, Yezhaohui Wang, et al. Memos: An operating system for memory-augmented generation (mag) in large language models.arXiv preprint arXiv:2505.22101,

  13. [13]

    Dimakis, and Ion Stoica

    doi: 10.1145/3786335.3813221. URL https://doi.org/10. 1145/3786335.3813221. Monishwaran Maheswaran, Leon Lakhani, Zhongzhu Zhou, Shijia Yang, Junxiong Wang, Coleman Hooper, Yuezhou Hu, Rishabh Tiwari, Jue Wang, Harman Singh, et al. Squeeze evolve: Unified multi-model orchestration for verifier-free evolution.arXiv preprint arXiv:2604.07725,

  14. [14]

    Charles Packer, Vivian Fang, Shishir_G Patil, Kevin Lin, Sarah Wooders, and Joseph_E Gonzalez

    Accessed: 2026-05-06. Charles Packer, Vivian Fang, Shishir_G Patil, Kevin Lin, Sarah Wooders, and Joseph_E Gonzalez. Memgpt: towards llms as operating systems

  15. [15]

    Aorchestra: Automating sub-agent creation for agentic orchestration.arXiv preprint arXiv:2602.03786,

    Jianhao Ruan, Zhihao Xu, Yiran Peng, Fashen Ren, Zhaoyang Yu, Xinbing Liang, Jinyu Xiang, Yongru Chen, Bang Liu, Chenglin Wu, et al. Aorchestra: Automating sub-agent creation for agentic orchestration.arXiv preprint arXiv:2602.03786,

  16. [16]

    Llm-based multi-agent blackboard system for information discovery in data science.arXiv preprint arXiv:2510.01285,

    Alireza Salemi, Mihir Parmar, Palash Goyal, Yiwen Song, Jinsung Yoon, Hamed Zamani, Tomas Pfister, and Hamid Palangi. Llm-based multi-agent blackboard system for information discovery in data science.arXiv preprint arXiv:2510.01285,

  17. [17]

    Toolorchestra: Elevating intelligence via efficient model and tool orchestration.arXiv preprint arXiv:2511.21689,

    19 Hongjin Su, Shizhe Diao, Ximing Lu, Mingjie Liu, Jiacheng Xu, Xin Dong, Yonggan Fu, Peter Belcak, Hanrong Ye, Hongxu Yin, et al. Toolorchestra: Elevating intelligence via efficient model and tool orchestration.arXiv preprint arXiv:2511.21689,

  18. [18]

    Scaling long-horizon LLM agent via context-folding.arXiv preprint arXiv:2510.11967, 2025

    Weiwei Sun, Miao Lu, Zhan Ling, Kang Liu, Xuesong Yao, Yiming Yang, and Jiecao Chen. Scaling long-horizon llm agent via context-folding.arXiv preprint arXiv:2510.11967,

  19. [19]

    SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering

    URL https: //arxiv.org/abs/2405.15793. Xiyuan Yang, Jiaru Zou, Rui Pan, Ruizhong Qiu, Pan Lu, Shizhe Diao, Jindong Jiang, Hanghang Tong, Tong Zhang, Markus J Buehler, et al. Recursive multi-agent systems.arXiv preprint arXiv:2604.25917,

  20. [20]

    Recursive Language Models

    Alex L Zhang, Tim Kraska, and Omar Khattab. Recursive language models.arXiv preprint arXiv:2512.24601,