arxiv: 2605.02801 · v1 · submitted 2026-05-04 · 💻 cs.CL

Recognition: 3 theorem links

· Lean Theorem

Reinforcement Learning for LLM-based Multi-Agent Systems through Orchestration Traces

Chenchen Zhang

Authors on Pith no claims yet

Pith reviewed 2026-05-08 18:40 UTC · model grok-4.3

classification 💻 cs.CL

keywords reinforcement learningmulti-agent systemslarge language modelsorchestration tracescredit assignmentreward designstopping decisionagent coordination

0 comments

The pith

Reinforcement learning for teams of LLM agents must optimize coordination decisions like when to stop, but no current method trains the stopping decision explicitly.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper frames reinforcement learning for multi-agent systems of large language models around orchestration traces, which are temporal graphs of events including spawning agents, delegating tasks, communicating, using tools, aggregating results, and stopping. It organizes the technical challenges into three axes: reward design across eight families that can include orchestration-specific signals, attachment of credit signals to units from tokens to teams, and decomposition of orchestration into five sub-decisions. The review finds that while the first four sub-decisions have some RL attention, the stopping decision has none in the surveyed work. This gap matters because without it, agent teams may continue computing after goals are met or fail to know when to conclude, limiting reliable scaling of coordinated LLM systems.

Core claim

Orchestration traces are temporal interaction graphs whose events include sub-agent spawning, delegation, communication, tool use, return, aggregation, and stopping decisions. Reward design spans eight families including orchestration rewards for parallelism speedup, split correctness, and aggregation quality. Reward and credit signals attach to eight units from token to team, with explicit counterfactual message-level credit sparse. Orchestration learning decomposes into five sub-decisions, and no explicit RL training method for the stopping decision appears in the 84-entry curated pool as of May 2026.

What carries the argument

Orchestration traces: temporal interaction graphs that record events of spawning, delegating, communicating, aggregating, and stopping in multi-agent LLM systems.

If this is right

Reward design can incorporate eight families that reward parallelism speedup, task split correctness, and aggregation quality.
Credit signals attach to eight different units ranging from single tokens to full teams, though message-level counterfactual credit remains rare.
Orchestration decomposes into five sub-decisions of spawn, delegate, communicate, aggregate, and stop, with the last one lacking RL methods.
Industrial deployments such as agent swarms operate at larger scales than open academic evaluations, revealing a verification gap.
A released JSON schema for replayable traces and the tagged paper pool enable systematic study of coordination learning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Training the stopping decision would let agent teams end work once goals are reached, cutting unnecessary computation.
The same orchestration-trace lens could apply to coordination in non-LLM multi-agent settings such as robotic or simulation teams.
Closing the industrial-academic scale gap would require sharing traces so academic RL methods can match deployment sizes.
Adding explicit message-level counterfactual credit could make learning more efficient when communication is the main signal.

Load-bearing premise

The 84-entry paper pool together with its exclusion log fully represents the space of relevant work on RL for LLM-based multi-agent systems.

What would settle it

Publication or discovery of even one explicit reinforcement learning method that trains the stopping decision on orchestration traces would disprove the claimed absence.

Figures

Figures reproduced from arXiv: 2605.02801 by Chenchen Zhang.

**Figure 1.** Figure 1: Paper map. Reading: the survey takes three input traditions (single-agent LLM RL, classical MARL, and industrial agent systems), foregrounds the orchestration trace as the shared object, and then organizes the literature into reward design, credit assignment, and orchestration learning. Benchmarks, safety, and open problems are downstream because they inherit the same trace structure. 1 Introduction Thesis… view at source ↗

**Figure 2.** Figure 2: Timeline of selected representative LLM-MAS entries from Q4 2024 to Q2 2026, plotted by arXiv submission date and grouped vertically by the credit-bearing unit they target (§7.1). Nearly the entire corpus sits in an 18-month window, motivating the timing claim in §1.1. The orchestrator and message rows remain sparsely populated throughout; agent- and role-level credit has received the most attention. agent… view at source ↗

**Figure 3.** Figure 3: Compact coverage map for representative retained entries. • means the entry directly studies the dimension; ◦ means it supplies indirect evidence, a benchmark substrate, or a system constraint. The sparsity is intentional: it shows why the survey treats reward, credit, orchestration, evaluation, and safety as coupled but unevenly supported dimensions. we cover the step that follows: what changes when the p… view at source ↗

**Figure 4.** Figure 4: Corpus construction flow. Counts are internal audit counts after the journal-revision coverage audit, not a claim of exhaustive coverage or independently reproducible screening. (§6), (b) credit and signal assignment across eight credit- or signal-bearing units (§7), and (c) orchestration learning across five sub-decisions (§8). • An industrial–academic bridge. We connect open methods to Kimi PARL, OpenAI … view at source ↗

**Figure 5.** Figure 5: Visual schematics of the six recurring LLM-MAS topologies catalogued in view at source ↗

**Figure 6.** Figure 6: Industry–academia scale gap. Reading: blue points summarize the typical public evaluation regime of academic LLM-MAS RL methods, while red filled points mark Kimi reports that disclose both team size and long trace length. Hollow red points indicate industrial deployment-shape evidence where the public material is useful for harness and workflow analysis but does not disclose a comparable RL training scale… view at source ↗

**Figure 7.** Figure 7: Rollout cost across representative operating regimes, shown as a schematic relative-cost proxy rather than a calibrated dollar or latency estimate. The bars combine representative team size and total trace length / tool-call counts under the cost form in (5); exact ratios depend on token lengths, tool latencies, and harness overhead. The group-of-G annotation shows the additional rollout-collection multipl… view at source ↗

**Figure 8.** Figure 8: The harness as a training-frozen interface. The harness (dashed box) wraps the trainable LLM πθ with a prompt template, tool registry, and execution runtime; only θ receives gradients during RL. The harness defines both the input distribution harness(o) that the policy sees and the output grammar Aharness it may emit. A policy fine-tuned through a different harness is a different policy in the deployment s… view at source ↗

**Figure 9.** Figure 9: Schematic per-step signal-to-noise under three credit schemes as trace length T grows. The blue curve is not a proven rate; it visualizes the qualitative warning in (8): uniform terminal credit can become low-SNR on long shared-reward traces. Role- or message-level decomposition (dashed green) partitions the trace into shorter sub-problems; a learned orchestrator critic (dotted red) targets a smaller set o… view at source ↗

**Figure 10.** Figure 10: Reward family composition. The seven primitive families R1–R7 (§6.1) group into four semantic tiers—outcome, structured, process, system—and are composed through an R8 hybrid weighting to produce method-specific reward shapes. Three representative compositions from our pool are shown on the right. The less-studied axis is schedule semantics: which terms are transient scaffolds, which terms define the pri… view at source ↗

**Figure 11.** Figure 11: Schematic of Kimi PARL’s three-term reward rorch = rperf + λ1rparallel + λ2rfinish across training (§6.2). The task-outcome term rperf is the primary objective; both auxiliary orchestrationshaping terms are shown as transient scaffolds because the public Kimi K2.5 report states that their hyperparameters are annealed to zero over training. Curves are schematic; exact schedules are not disclosed in the pu… view at source ↗

**Figure 12.** Figure 12: The eight credit-bearing units in LLM-MAS RL (§7.1), stacked from coarsest (team) to finest (token). Red-outlined levels—orchestrator, role, message—have no clean counterpart in classical MARL or single-agent LLM RL. Right-column labels list representative entries that assign credit at each level; the sparse levels (orchestrator, message) mark the most under-populated research territory. 7.1 The credit- a… view at source ↗

**Figure 13.** Figure 13: A decision-tree heuristic for selecting a credit-assignment mechanism from our pool, organized by four system-level questions: (i) whether the agent set is dynamic at inference, (ii) whether the orchestrator is the identified bottleneck, (iii) whether traces are long enough to suffer diffusion, and (iv) whether roles are heterogeneous or the structure is debate-shaped. Leaves name the method whose design… view at source ↗

**Figure 14.** Figure 14: Optimization objects for single-agent LLM RL vs. LLM-MAS RL. (a) A trajectory τ is a linearly ordered sequence of (st, at) pairs. (b) An orchestration trace G = (V, E, ℓ) is a temporal interaction graph: orchestrator decisions (red) spawn sub-agents (blue), which issue tool calls (orange) and return summaries that are aggregated (green diamond) before the next orchestrator decision. Creditbearing units (… view at source ↗

**Figure 15.** Figure 15: The five orchestration sub-decisions O1–O5 (§8.2). An orchestrator policy makes some or all of these decisions per task; surveyed entries cover O1–O4 but not O5. The red dashed box marks “when to stop” as a named open problem (§11): in the entries we surveyed, termination is either externally signaled (ground-truth answer found) or triggered by a fixed step-count cap rather than explicitly trained as a st… view at source ↗

**Figure 16.** Figure 16: Three orchestrator training regimes (§8.3). (A) Frozen sub-agents: gradient flows only into the orchestrator; cheapest and most common. (B) Joint training with shared baseline and per-agent advantage: all policies update together; requires stabilization (Dr. MAS’s agent-wise normalization). (C) Fully decoupled per-policy training against a central critic Vϕ: most expressive but most engineeringheavy. Sol… view at source ↗

**Figure 17.** Figure 17: Attack-surface map for an LLM-MAS orchestration trace. The substrate (orchestrator → sub-agents → tools → shared memory) is the same as view at source ↗

**Figure 18.** Figure 18: Evidence-level matrix used when interpreting the corpus. “Alg.” denotes support for algorithmic mechanism claims; “Train” denotes public training-objective or post-training evidence; “Deploy” denotes deployment-shape evidence; “Scale” denotes public scale or horizon evidence. This matrix prevents product documentation from being treated as equivalent to reproducible algorithmic evidence view at source ↗

read the original abstract

As large language model (LLM) agents evolve from isolated tool users into coordinated teams, reinforcement learning (RL) must optimize not only individual actions but also how work is spawned, delegated, communicated, aggregated, and stopped. This paper studies RL for LLM-based multi-agent systems through orchestration traces: temporal interaction graphs whose events include sub-agent spawning, delegation, communication, tool use, return, aggregation, and stopping decisions. Using this lens, we identify three technical axes. First, reward design spans eight families, including orchestration rewards for parallelism speedup, split correctness, and aggregation quality. Second, reward and credit signals attach to eight credit- or signal-bearing units from token to team; explicit counterfactual message-level credit remains especially sparse in our curated pool. Third, orchestration learning decomposes into five sub-decisions: when to spawn, whom to delegate to, how to communicate, how to aggregate, and when to stop. In our curated pool as of May 4, 2026, we found no explicit RL training method for the stopping decision. We connect academic methods to public industrial evidence from Kimi Agent Swarm, OpenAI Codex, and Anthropic Claude Code. The resulting scale gap is a gap between publicly reported deployment envelopes and open academic evaluation regimes, not independent verification of industrial training traces. We release the artifact at https://github.com/xxzcc/awesome-llm-mas-rl, including an 84-entry tagged paper pool, a 32-record exclusion log, scripted corpus statistics, and a minimal JSON schema for replayable orchestration traces.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The manuscript surveys reinforcement learning (RL) methods for large language model (LLM)-based multi-agent systems by analyzing orchestration traces—temporal interaction graphs capturing sub-agent spawning, delegation, communication, tool use, aggregation, and stopping. It organizes the literature along three axes: eight families of reward designs (including orchestration rewards for parallelism, split correctness, and aggregation quality), eight credit- or signal-bearing units (from token to team), and five sub-decisions in orchestration learning. A central observation is that, in the authors' curated pool of 84 papers as of May 4, 2026, no explicit RL training method for the stopping decision appears. The work links academic methods to industrial examples from Kimi, OpenAI, and Anthropic and releases an artifact containing the tagged paper pool, 32-record exclusion log, corpus statistics, and a JSON schema for replayable traces.

Significance. If the 84-paper pool and its exclusion criteria are representative, the survey usefully identifies an underexplored area—explicit RL for stopping decisions—in LLM-based multi-agent orchestration, which could direct future work on termination policies. The public release of the tagged pool, exclusion log, scripted statistics, and minimal JSON schema constitutes a concrete contribution to reproducibility, allowing the community to inspect, extend, or challenge the categorization.

major comments (1)

[Abstract] Abstract and curation description: The gap claim ('no explicit RL training method for the stopping decision') is explicitly scoped to the curated pool, yet the manuscript provides no reproducible account of the search protocol (databases queried, exact Boolean strings, date ranges, or inclusion/exclusion criteria) within the text itself; these details reside only in the GitHub artifact. Because the central finding rests on the completeness of this pool, independent verification of whether relevant RL-MAS papers using alternate terminology were missed is not possible from the manuscript alone.

minor comments (1)

[Orchestration learning decomposition] The five sub-decisions (spawn, delegate, communicate, aggregate, stop) are clearly enumerated, but the text could add one or two concrete examples of how an implicit stopping mechanism (e.g., value-based termination inside a single policy) would be tagged versus an explicit stopping decision under the proposed schema.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and positive recommendation for minor revision. We address the single major comment below and will incorporate the requested changes to improve reproducibility.

read point-by-point responses

Referee: [Abstract] Abstract and curation description: The gap claim ('no explicit RL training method for the stopping decision') is explicitly scoped to the curated pool, yet the manuscript provides no reproducible account of the search protocol (databases queried, exact Boolean strings, date ranges, or inclusion/exclusion criteria) within the text itself; these details reside only in the GitHub artifact. Because the central finding rests on the completeness of this pool, independent verification of whether relevant RL-MAS papers using alternate terminology were missed is not possible from the manuscript alone.

Authors: We agree that the manuscript should contain a self-contained description of the search protocol so that the central gap claim can be evaluated without external resources. In the revised version we will add a new subsection (placed in the curation or methods section) that explicitly lists the databases and repositories queried, the exact Boolean search strings, the date range (up to May 4, 2026), and the complete inclusion/exclusion criteria. The 32-record exclusion log and the tagged 84-paper pool will remain available in the GitHub artifact as supplementary material to support further inspection and extension by the community. revision: yes

Circularity Check

0 steps flagged

No circularity: descriptive literature survey with externally verifiable curation

full rationale

This is a literature survey paper whose central claims consist of categorizations of an external 84-paper pool along three axes and five sub-decisions, plus the negative observation that no explicit RL method for the stopping decision appears in that pool. No mathematical derivations, fitted parameters renamed as predictions, self-citations used as load-bearing uniqueness theorems, or ansatzes smuggled via prior work are present. The decomposition into spawn/delegate/communicate/aggregate/stop is offered as an analytical lens rather than a self-defining or tautological construction. The released artifact (tagged pool, exclusion log, schema) makes the curation externally inspectable, satisfying the rule that a self-contained survey against external benchmarks receives score 0-2.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The survey rests on the assumption that the chosen search and exclusion criteria produce a representative sample of the field and that the proposed axes are natural and complete partitions of the design space. No new mathematical axioms or invented physical entities are introduced.

axioms (1)

domain assumption The curated pool of 84 papers plus exclusion log adequately represents the relevant literature on RL for LLM-based multi-agent systems as of May 2026.
Stated in the abstract when the authors report findings from the pool.

pith-pipeline@v0.9.0 · 8303 in / 1240 out tokens · 69476 ms · 2026-05-08T18:40:35.808270+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

89 extracted references · 41 canonical work pages · 13 internal anchors

[1]

SAGE: Multi-agent self-evolution for LLM reasoning, 2026

Anonymous. SAGE: Multi-agent self-evolution for LLM reasoning, 2026. ACL Rolling Review January 2026 submission; challenger, planner, solver, and critic co-evolve from a shared LLM backbone; under review; accessed 2026-05-04

2026
[2]

Creating custom sub-agents (claude code docs).https://docs.anthropic

Anthropic. Creating custom sub-agents (claude code docs).https://docs.anthropic. com/en/docs/claude-code/sub-agents, 2025. Documentation; accessed 2026-04-27

2025
[3]

Our framework for developing safe and trustworthy agents.https://www.anthropic.com/news/ our-framework-for-developing-safe-and-trustworthy-agents, 2025

Anthropic. Our framework for developing safe and trustworthy agents.https://www.anthropic.com/news/ our-framework-for-developing-safe-and-trustworthy-agents, 2025. 2025-08- 04; accessed 2026-04-27

2025
[4]

Building a C compiler with a team of parallel Claudes.https:// www.anthropic.com/engineering/building-c-compiler, 2026

Anthropic Engineering. Building a C compiler with a team of parallel Claudes.https:// www.anthropic.com/engineering/building-c-compiler, 2026. 2026-02-05; 16 parallel Claudes; accessed 2026-04-27

2026
[5]

Bernstein, Robert Givan, Neil Immerman, and Shlomo Zilberstein

Daniel S. Bernstein, Robert Givan, Neil Immerman, and Shlomo Zilberstein. The com- plexity of decentralized control of Markov decision processes.Mathematics of Operations Research, 27(4):819–840, 2002. Original Dec-POMDP formalism; proves NEXP-complete

2002
[6]

& Liu, T

Shuaihang Chen, Weinan Zhang, Ting Liu, et al. A survey on LLM-based multi-agent sys- tem: Recent advances and new frontiers in application.arXiv preprint arXiv:2412.17481,

work page arXiv
[7]

Exact Is Easier: Credit Assignment for Cooperative LLM Agents

Yanjun Chen et al. Contextual counterfactual credit assignment for multi-agent reinforce- ment learning in LLM collaboration.arXiv preprint arXiv:2603.06859, 2026. Counterfac- tual causal credit assignment at message level

work page internal anchor Pith review arXiv 2026
[8]

Multi-Agent Evolve:

Yixing Chen et al. Multi-agent evolve: LLM self-improve through co-evolution.arXiv preprint arXiv:2510.23595, 2025. Proposer-Solver-Judge co-evolution; UIUC ulab. Preprint — May 2026 51

work page arXiv 2025
[9]

AgentSpawn: Adaptive multi-agent collaboration through dynamic spawning for long-horizon code generation.arXiv preprint arXiv:2602.07072, 2026

Igor Costa. AgentSpawn: Adaptive multi-agent collaboration through dynamic spawning for long-horizon code generation.arXiv preprint arXiv:2602.07072, 2026. Runtime dynamic spawn + memory transfer; sole-author

work page arXiv 2026
[10]

Multi-agent collaboration via evolving orchestration

Yufan Dang et al. Multi-agent collaboration via evolving orchestration. InAdvances in Neural Information Processing Systems (NeurIPS), 2025. Puppeteer central orchestrator; Tsinghua/OpenBMB ChatDev team

2025
[11]

Christian Schroeder de Witt, Tarun Gupta, Denys Makoviichuk, Viktor Makoviychuk, Philip H. S. Torr, Mingfei Sun, and Shimon Whiteson. Is independent learning all you need in the StarCraft multi-agent challenge?arXiv preprint arXiv:2011.09533, 2020. IPPO; independent PPO competitive on SMAC

work page arXiv 2011
[12]

AgentDojo: A dynamic environment to evaluate prompt injection attacks and defenses for LLM agents

Edoardo Debenedetti, Jie Zhang, Mislav Balunovi´ c, Luca Beurer-Kellner, Marc Fischer, and Florian Tram` er. AgentDojo: A dynamic environment to evaluate prompt injection attacks and defenses for LLM agents. InAdvances in Neural Information Processing Sys- tems (NeurIPS) Datasets and Benchmarks Track, 2024. 97 realistic tasks, 629 security test cases; ETH SPY Lab

2024
[13]

DeepSeek-R1: Incentivizing reasoning capability in LLMs via reinforcement learning.Nature, 645:633–638, 2025

DeepSeek-AI. DeepSeek-R1: Incentivizing reasoning capability in LLMs via reinforcement learning.Nature, 645:633–638, 2025. Rule-based RL unlocks long-CoT reasoning; R1 & R1-Zero

2025
[14]

Wasp: Benchmarking web agent security against prompt injection attacks,

Ivan Evtimov, Arman Zharmagambetov, Aaron Grattafiori, Chuan Guo, and Kama- lika Chaudhuri. WASP: Benchmarking web agent security against prompt injection at- tacks.arXiv preprint arXiv:2504.18575, 2025. Meta FAIR; end-to-end web-agent prompt- injection benchmark

work page arXiv 2025
[15]

Lang Feng et al. Dr. MAS: Stable reinforcement learning for multi-agent LLM systems. arXiv preprint arXiv:2602.08847, 2026. Diagnoses GRPO instability in MAS; agent-wise normalization

work page arXiv 2026
[16]

Foerster, Gregory Farquhar, Triantafyllos Afouras, Nantas Nardelli, and Shi- mon Whiteson

Jakob N. Foerster, Gregory Farquhar, Triantafyllos Afouras, Nantas Nardelli, and Shi- mon Whiteson. Counterfactual multi-agent policy gradients. InProceedings of the AAAI Conference on Artificial Intelligence, 2018. COMA; counterfactual baseline for per-agent credit

2018
[17]

Not what you’ve signed up for: Compromising real-world LLM-integrated applications with indirect prompt injection

Kai Greshake, Sahar Abdelnabi, Shailesh Mishra, Christoph Endres, Thorsten Holz, and Mario Fritz. Not what you’ve signed up for: Compromising real-world LLM-integrated applications with indirect prompt injection. InProceedings of the 16th ACM Workshop on Artificial Intelligence and Security (AISec), 2023. Foundational paper on indirect prompt injection

2023
[18]

Fung, Ming Yan, Ji Zhang, Fei Huang, and Yang Liu

Zhitao He, Zijun Liu, Peng Li, Yi R. Fung, Ming Yan, Ji Zhang, Fei Huang, and Yang Liu. Advancing language multi-agent learning with credit re-assignment for interactive en- vironment generalization. InConference on Language Modeling (COLM), 2025. Introduces CollabUIAgents and multi-agent credit re-assignment for interactive UI/web environments; accessed ...

2025
[19]

Multi-agent deep research: Training multi-agent systems with m-grpo.arXiv preprint arXiv:2511.13288, 2025

Haoyang Hong et al. Multi-agent deep research: Training multi-agent systems with M- GRPO.arXiv preprint arXiv:2511.13288, 2025. Hierarchical GRPO; Ant Group. Preprint — May 2026 52

work page arXiv 2025
[20]

Halo: Hierarchical au- tonomous logic-oriented orchestration for multi-agent llm systems.arXiv preprint arXiv:2505.13516,

Zhipeng Hou et al. HALO: Hierarchical autonomous logic-oriented orchestration for multi- agent LLM systems.arXiv preprint arXiv:2505.13516, 2025. MCTS-based three-layer hierarchical MAS

work page arXiv 2025
[21]

DEPART: Hierarchical multi-agent system for multi-turn interaction, 2026

Hao-Lun Hsu, Jing Xu, Nikhil Vichare, Francesco Carbone, Miroslav Pajic, and Giuseppe Carenini. DEPART: Hierarchical multi-agent system for multi-turn interaction, 2026. OpenReview ICLR 2026 submission; introduces HIMPO for alternating planner/executor post-training with role-specific rewards; accessed 2026-05-04

2026
[22]

OWL: Optimized workforce learning for general multi-agent assistance in real-world task automation

Mengkang Hu, Yuhang Zhou, Wendong Fan, Yuzhou Nie, Ziyu Ye, Bowei Xia, Tao Sun, Zhaoxuan Jin, Yingru Li, Zeyu Zhang, Yifeng Wang, Qianshuo Ye, Bernard Ghanem, Ping Luo, and Guohao Li. OWL: Optimized workforce learning for general multi-agent assistance in real-world task automation. InAdvances in Neural Information Processing Systems (NeurIPS), 2025. Neur...

2025
[23]

Agent Q-Mix: Selecting the right action for LLM multi-agent systems through reinforcement learning.arXiv preprint arXiv:2604.00344, 2026

Eric Hanchen Jiang, Levina Li, Rui Sun, Xiao Liang, Yubei Li, Yuchen Wu, Haozheng Luo, Hengli Li, Zhi Zhang, Zhaolu Kang, Kai-Wei Chang, and Ying Nian Wu. Agent Q-Mix: Selecting the right action for LLM multi-agent systems through reinforcement learning.arXiv preprint arXiv:2604.00344, 2026. QMIX-style CTDE for decentralized communication and topology dec...

work page arXiv 2026
[24]

Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan

Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. SWE-bench: Can language models resolve real-world GitHub issues? InInternational Conference on Learning Representations (ICLR), 2024. 2294 real GitHub issues from 12 Python repos

2024
[25]

TAMAS: Benchmarking adversarial risks in multi-agent LLM systems

Ishan Kavathekar et al. TAMAS: Benchmarking adversarial risks in multi-agent LLM systems. InICML 2025 Multi-Agent Systems Workshop, 2025. First adversarial robustness benchmark for MAS

2025
[26]

MAS-Zero: Designing multi-agent systems with zero supervision.arXiv preprint arXiv:2505.14996, 2025

Zixuan Ke, Austin Xu, Yifei Ming, Xuan-Phi Nguyen, Caiming Xiong, and Shafiq Joty. MAS-Zero: Designing multi-agent systems with zero supervision.arXiv preprint arXiv:2505.14996, 2025. Inference-time self-evolved MAS design through meta-level design feedback and self-verification; accessed 2026-04-27

work page arXiv 2025
[27]

Agents under siege: Breaking pragmatic multi-agent LLM systems with optimized prompt attacks

Rana Muhammad Shahroz Khan, Zhen Tan, Sukwon Chen, Patrick Foulds, Sean Yong, Huan Liu, and Tianlong Chen. Agents under siege: Breaking pragmatic multi-agent LLM systems with optimized prompt attacks. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL), 2025. Permutation-invariant adversarial attack on multi-age...

2025
[28]

Kimi K2.5: Visual Agentic Intelligence

Kimi Team. Kimi K2.5: Visual agentic intelligence.https://www.kimi.com/blog/ kimi-k2-5.html, 2026. Technical report by Moonshot AI; arXiv 2602.02276; introduces Agent Swarm + PARL; accessed 2026-04-27

work page internal anchor Pith review arXiv 2026
[29]

Kimi K2.6 tech blog.https://www.kimi.com/blog/kimi-k2-6, 2026

Kimi Team. Kimi K2.6 tech blog.https://www.kimi.com/blog/kimi-k2-6, 2026. 2026- 04-20; 300-agent coordination, Claw Groups; accessed 2026-04-27

2026
[30]

Experience as a Compass: Multi-agent RAG with Evolving Orchestration and Agent Prompts

Sha Li et al. Experience as a compass: Multi-agent RAG with evolving orchestration and agent prompts.arXiv preprint arXiv:2604.00901, 2026. HERA; evolving orchestration policy for MAS-RAG. Preprint — May 2026 53

work page internal anchor Pith review Pith/arXiv arXiv 2026
[31]

Who deserves the reward? sharp: Shapley credit-based optimization for multi-agent system.arXiv preprint arXiv:2602.08335, 2026b

Yanming Li et al. Who deserves the reward? SHARP: Shapley credit-based optimiza- tion for multi-agent system.arXiv preprint arXiv:2602.08335, 2026. Shapley-value-based hierarchical credit assignment

work page arXiv 2026
[32]

MARFT: multi-agent reinforcement fine-tuning.CoRR, abs/2504.16129, 2025

Junwei Liao et al. MARFT: Multi-agent reinforcement fine-tuning.arXiv preprint arXiv:2504.16129, 2025. v4 dated 2025-11-03; submitted to ICLR 2026

work page arXiv 2025
[33]

Michael L. Littman. Markov games as a framework for multi-agent reinforcement learning. InProceedings of the International Conference on Machine Learning (ICML), pages 157– 163, 1994. Foundational stochastic/Markov-game formulation for MARL

1994
[34]

SPIRAL: Self- play on zero-sum games incentivizes reasoning via multi-agent multi-turn reinforcement learning

Bo Liu, Simon Yu, Zichen Liu, Leon Guertler, Penghui Qi, Daniel Balcells, Mickel Liu, Cheston Tan, Weiyan Shi, Min Lin, Wee Sun Lee, and Natasha Jaques. SPIRAL: Self- play on zero-sum games incentivizes reasoning via multi-agent multi-turn reinforcement learning. InInternational Conference on Learning Representations (ICLR), 2026. ICLR 2026 poster; role-c...

2026
[35]

Reinforcement learn- ing meets large language models: A survey of ad- vancements and applications across the llm lifecycle

Keliang Liu et al. Reinforcement learning meets large language models: A survey of advancements and applications across the LLM lifecycle.arXiv preprint arXiv:2509.16679,

work page arXiv
[36]

Fudan/Tongji/CUHK MMLab
[37]

MarsRL: Advancing multi-agent reasoning system via reinforcement learning with agentic pipeline parallelism.arXiv preprint arXiv:2511.11373, 2025

Shulin Liu et al. MarsRL: Advancing multi-agent reasoning system via reinforcement learning with agentic pipeline parallelism.arXiv preprint arXiv:2511.11373, 2025. Agentic pipeline-parallel RL

work page arXiv 2025
[38]

arXiv preprint arXiv:2508.04652 , year=

Shuo Liu, Christopher Amato, et al. LLM collaboration with multi-agent reinforcement learning.arXiv preprint arXiv:2508.04652, 2025. v7 dated 2025-12-09; introduces MA- GRPO

work page arXiv 2025
[39]

Learning Decentralized LLM Collaboration with Multi-Agent Actor Critic

Shuo Liu, Tianle Chen, Ryan Amiri, and Christopher Amato. Learning decentralized LLM collaboration with multi-agent actor critic.arXiv preprint arXiv:2601.21972, 2026. Introduces CoLLM-CC and CoLLM-DC actor-critic variants for decentralized LLM collab- oration; accessed 2026-05-04

work page internal anchor Pith review Pith/arXiv arXiv 2026
[40]

Multi- agent actor-critic for mixed cooperative-competitive environments

Ryan Lowe, Yi Wu, Aviv Tamar, Jean Harb, Pieter Abbeel, and Igor Mordatch. Multi- agent actor-critic for mixed cooperative-competitive environments. InAdvances in Neural Information Processing Systems (NeurIPS), 2017. MADDPG; centralized critic, decentral- ized actor

2017
[41]

Agent lightning: Train any ai agents with reinforcement learning,

Xufang Luo et al. Agent Lightning: Train any AI agents with reinforcement learning.arXiv preprint arXiv:2508.03680, 2025. Microsoft Research; decouples agent execution from RL training

work page arXiv 2025
[42]

A trouble- maker with contagious jailbreak makes chaos in honest towns

Tianyi Men, Pengfei Cao, Zhuoran Jin, Yubo Chen, Kang Liu, and Jun Zhao. A trouble- maker with contagious jailbreak makes chaos in honest towns. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL), 2025. Contagious jailbreak that propagates through agent memory across non-complete-graph topologies

2025
[43]

GAIA: a benchmark for General AI Assistants

Gr´ egoire Mialon, Cl´ ementine Fourrier, Craig Swift, Thomas Wolf, Yann LeCun, and Thomas Scialom. GAIA: a benchmark for general AI assistants.arXiv preprint arXiv:2311.12983, 2023. 466 tool-use-heavy real-world questions. Preprint — May 2026 54

work page internal anchor Pith review arXiv 2023
[44]

Multi-agent tool-integrated policy optimization.arXiv preprint arXiv:2510.04678, 2025

Zhanfeng Mo et al. Multi-agent tool-integrated policy optimization.arXiv preprint arXiv:2510.04678, 2025. Single-LLM dual-role planner+worker; +18.38% over single-agent

work page arXiv 2025
[45]

MALT: Improving reasoning with multi-agent LLM train- ing

Sumeet Ramesh Motwani et al. MALT: Improving reasoning with multi-agent LLM train- ing. InConference on Language Modeling (COLM), 2025. Generator-verifier-refiner training with role-PRM (+14.14%)

2025
[46]

Introducing codex.https://openai.com/index/introducing-codex/;https: //openai.com/index/introducing-the-codex-app/, 2025

OpenAI. Introducing codex.https://openai.com/index/introducing-codex/;https: //openai.com/index/introducing-the-codex-app/, 2025. 2025-05-16 launch post plus Codex app materials; cloud-native parallel software-engineering agent; accessed 2026-04-27

2025
[47]

Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Chris- tiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human feedbac...

2022
[48]

MAPoRL: Multi-agent post-co-training for collaborative large lan- guage models with reinforcement learning

Chanwoo Park et al. MAPoRL: Multi-agent post-co-training for collaborative large lan- guage models with reinforcement learning. InAnnual Meeting of the Association for Com- putational Linguistics (ACL), 2025. MIT; first explicit post-training RL for collaboration

2025
[49]

CriticLean: Critic-guided reinforcement learning for mathematical formalization.arXiv preprint arXiv:2507.06181, 2025

Zhongyuan Peng, Yifan Yao, Kaijing Ma, Shuyue Guo, Yizhe Li, Yichi Zhang, Chenchen Zhang, Yifan Zhang, Zhouliang Yu, et al. CriticLean: Critic-guided reinforcement learning for mathematical formalization.arXiv preprint arXiv:2507.06181, 2025. Trains a critic via SFT+RL to score Lean 4 formalizations; concrete instance of verifier-as-reward (R6)

work page arXiv 2025
[50]

ToolLLM: Facilitating large language models to master 16000+ real-world APIs

Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang, Bill Qian, Sihan Zhao, Lauren Hong, Runchu Tian, Ruobing Xie, Jie Zhou, Mark Gerstein, Dahai Li, Zhiyuan Liu, and Maosong Sun. ToolLLM: Facilitating large language models to master 16000+ real-world APIs. InInternational Conference on Learning Representat...

2024
[51]

QMIX: Monotonic value function factorisation for deep multi-agent reinforcement learning

Tabish Rashid, Mikayel Samvelyan, Christian Schroeder de Witt, Gregory Farquhar, Jakob Foerster, and Shimon Whiteson. QMIX: Monotonic value function factorisation for deep multi-agent reinforcement learning. InProceedings of the International Conference on Machine Learning (ICML), 2018. QMIX; monotonic mixing network

2018
[52]

Debate as Reward: A Multi-Agent Reward System for Scientific Ideation via RL Post-Training

Moein Salimi et al. Debate as reward: A multi-agent reward system for scientific ideation via RL post-training.arXiv preprint arXiv:2604.16723, 2026. Multi-agent debate as reward signal

work page internal anchor Pith review Pith/arXiv arXiv 2026
[53]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017. PPO; clipped surrogate objective

work page internal anchor Pith review arXiv 2017
[54]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. DeepSeekMath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024. Original source of GRPO. Preprint — May 2026 55

work page internal anchor Pith review arXiv 2024
[55]

Tenenbaum, Antonio Torralba, Shuang Li, and Igor Mordatch

Vighnesh Subramaniam, Yilun Du, Joshua B. Tenenbaum, Antonio Torralba, Shuang Li, and Igor Mordatch. Multiagent finetuning: Self improvement with diverse reasoning chains. InInternational Conference on Learning Representations (ICLR), 2025. Finetunes a society of language models using diverse reasoning chains generated through multiagent interac- tion; ac...

2025
[56]

Scaling long-horizon LLM agent via context-folding.CoRR, abs/2510.11967, 2025

Weiwei Sun et al. Scaling long-horizon LLM agent via context-folding.arXiv preprint arXiv:2510.11967, 2025. ByteDance Seed/CMU; submitted to ICLR 2026

work page arXiv 2025
[57]

Czarnecki, Vinicius Zambaldi, Max Jaderberg, Marc Lanctot, Nicolas Sonnerat, Joel Z

Peter Sunehag, Guy Lever, Audrunas Gruslys, Wojciech M. Czarnecki, Vinicius Zambaldi, Max Jaderberg, Marc Lanctot, Nicolas Sonnerat, Joel Z. Leibo, Karl Tuyls, and Thore Graepel. Value-decomposition networks for cooperative multi-agent learning. InProceed- ings of the 17th International Conference on Autonomous Agents and MultiAgent Systems (AAMAS), 2018....

2018
[58]

Multi-Agent Collaboration Mechanisms: A Survey of LLMs

Khanh-Tung Tran, Barry O’Sullivan, et al. Multi-agent collaboration mechanisms: A survey of LLMs.arXiv preprint arXiv:2501.06322, 2025. UCC Ireland

work page internal anchor Pith review arXiv 2025
[59]

ReMA: Learning to meta-think for LLMs with multi-agent reinforcement learning

Ziyu Wan, Yunxiang Li, Xiaoyu Wen, Yan Song, Hanjing Wang, Linyi Yang, Mark Schmidt, Jun Wang, Weinan Zhang, Shuyue Hu, and Ying Wen. ReMA: Learning to meta-think for LLMs with multi-agent reinforcement learning. InAdvances in Neural Information Processing Systems (NeurIPS), 2025. NeurIPS 2025 poster; multi-agent RL framework for meta-thinking with high-l...

2025
[60]

Shapley Q-value: A local reward approach to solve global reward games

Jianhong Wang, Yuan Zhang, Tae-Kyun Kim, and Yunjie Gu. Shapley Q-value: A local reward approach to solve global reward games. InProceedings of the AAAI Conference on Artificial Intelligence, 2020. Shapley-value credit assignment for cooperative MARL

2020
[61]

MTU-Bench: A multi-granularity tool-use benchmark for large language models

Pei Wang, Yanan Wu, Zekun Wang, Jiaheng Liu, Xiaoshuai Song, Zhongyuan Peng, Ken Deng, Chenchen Zhang, Jiakai Wang, et al. MTU-Bench: A multi-granularity tool-use benchmark for large language models. InInternational Conference on Learning Represen- tations (ICLR), 2025. Five-granularity tool-use benchmark covering single/multi-turn and single/multi-tool scenarios

2025
[62]

MARTI-MARS 2: Scaling multi-agent self-search via rein- forcement learning for code generation.arXiv preprint arXiv:2602.07848, 2026

Shijie Wang, Pengfei Li, Yikun Fu, Kaifeng Liu, Fangyuan Li, Yang Liu, Xiaowei Sun, Zonglin Li, Siyao Zhao, Jian Zhao, Kai Tian, Dong Li, Junqi Gao, Yutong Zhang, Yiqun Chen, Yuqiang Li, Zoe Li, Weinan Zhang, Peng Ye, Shuyue Hu, Lei Bai, Bowen Zhou, Kaiyan Zhang, and Biqing Qi. MARTI-MARS 2: Scaling multi-agent self-search via rein- forcement learning for...

work page arXiv 2026
[63]

Towards Scalable Lightweight GUI Agents via Multi-role Orchestration

Ziwei Wang, Junjie Zheng, Leyang Yang, Sheng Zhou, Xiaoxuan Tang, Zhouhua Fang, Zhiwei Liu, Dajun Chen, Yong Li, and Jiajun Bu. Towards scalable lightweight GUI agents via multi-role orchestration.arXiv preprint arXiv:2604.13488, 2026. Findings of ACL 2026; multi-role orchestration and RL for role-oriented cooperative exploration; accessed 2026- 05-04

work page internal anchor Pith review Pith/arXiv arXiv 2026
[64]

BrowseComp: A Simple Yet Challenging Benchmark for Browsing Agents

Jason Wei, Zhiqing Sun, Spencer Papay, Scott McKinney, Jeffrey Han, Isa Ful- ford, Hyung Won Chung, Alex Tachard Passos, William Fedus, and Amelia Glaese. BrowseComp: A simple yet challenging benchmark for browsing agents.arXiv preprint arXiv:2504.12516, 2025. OpenAI 2025-04-17; 1266 hard browsing questions. Preprint — May 2026 56

work page internal anchor Pith review arXiv 2025
[65]

arXiv preprint arXiv:2601.12538 , year=

Tianxin Wei, Heng Ji, et al. Agentic reasoning for large language models.arXiv preprint arXiv:2601.12538, 2026. UIUC; 29-author team

work page arXiv 2026
[66]

MAGIC: A co-evolving attacker-defender adversarial game for robust LLM safety

Xiaoyu Wen, Zhida He, Han Qi, Ziyu Wan, Zhongtian Ma, Ying Wen, Tianhang Zheng, Xingcheng Xu, Chaochao Lu, and Qiaosheng Zhang. MAGIC: A co-evolving attacker- defender adversarial game for robust LLM safety.arXiv preprint arXiv:2602.01539, 2026. Multi-turn attacker-defender multi-agent RL for safety alignment; accessed 2026-05-04

work page arXiv 2026
[67]

Wolpert and Kagan Tumer

David H. Wolpert and Kagan Tumer. Optimal payoff functions for members of collectives. Advances in Complex Systems, 4(2–3):265–279, 2001. Difference rewards / Wonderful Life Utility; foundational credit assignment

2001
[68]

OSWorld: Benchmarking multimodal agents for open-ended tasks in real computer environments

Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh Jing Hua, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, Yitao Liu, Yiheng Xu, Shuyan Zhou, Silvio Savarese, Caiming Xiong, Victor Zhong, and Tao Yu. OSWorld: Benchmarking multimodal agents for open-ended tasks in real computer environments. InAdvances in Neural Information P...

2024
[69]

WideSeek-R1: Exploring width scaling for broad information seeking via multi-agent reinforcement learning.arXiv preprint arXiv:2602.04634, 2026

Zelai Xu, Zhexuan Xu, Ruize Zhang, Chunyang Zhu, Shi Yu, Weilin Liu, Quanlu Zhang, Wenbo Ding, Chao Yu, and Yu Wang. WideSeek-R1: Exploring width scaling for broad in- formation seeking via multi-agent reinforcement learning.arXiv preprint arXiv:2602.04634,

work page arXiv
[70]

Lead-agent/subagent MARL for broad information seeking and width scaling; ac- cessed 2026-05-04

2026
[71]

CoMAS: Co-evolving multi-agent systems via interaction rewards

Xiangyuan Xue, Yifan Zhou, Guibin Zhang, Zaibin Zhang, Yijiang Li, Chen Zhang, Zhenfei Yin, Philip Torr, Wanli Ouyang, and Lei Bai. CoMAS: Co-evolving multi-agent systems via interaction rewards. InInternational Conference on Learning Representations (ICLR),
[72]

ICLR 2026 poster; self-evolution through interaction-derived rewards and LLM-as- judge reward construction; accessed 2026-04-27

2026
[73]

Learning to deliberate: Meta-policy collaboration for agentic LLMs with multi-agent reinforcement learning.arXiv preprint arXiv:2509.03817,

Wei Yang and Jesse Thomason. Learning to deliberate: Meta-policy collaboration for agentic LLMs with multi-agent reinforcement learning.arXiv preprint arXiv:2509.03817,

work page arXiv
[74]

Introduces MPDF and SoftRankPO for decentralized meta-cognitive actions Persist, Refine, and Concede; accessed 2026-04-27

2026
[75]

Langmarl: Natural language multi-agent reinforcement learning,

Huaiyuan Yao, Longchao Da, Xiaoou Liu, Charles Fleming, Tianlong Chen, and Hua Wei. LangMARL: Natural language multi-agent reinforcement learning.arXiv preprint arXiv:2604.00722, 2026. Agent-level language credit assignment and policy-gradient evo- lution in language space; accessed 2026-05-04

work page arXiv 2026
[76]

$\tau$-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains

Shunyu Yao, Noah Shinn, Pedram Razavi, and Karthik Narasimhan.τ-bench: A benchmark for tool-agent-user interaction in real-world domains.arXiv preprint arXiv:2406.12045, 2024. Sierra Research; retail/airline domains with policy adherence

work page internal anchor Pith review arXiv 2024
[77]

ReAct: Synergizing reasoning and acting in language models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. ReAct: Synergizing reasoning and acting in language models. InInternational Conference on Learning Representations (ICLR), 2023. Interleaved reasoning+acting; agentic origin

2023
[78]

The surprising effectiveness of PPO in cooperative multi-agent games

Chao Yu, Akash Velu, Eugene Vinitsky, Jiaxuan Gao, Yu Wang, Alexandre Bayen, and Yi Wu. The surprising effectiveness of PPO in cooperative multi-agent games. InAdvances in Neural Information Processing Systems (NeurIPS) Datasets and Benchmarks Track,
[79]

Preprint — May 2026 57

MAPPO; PPO with centralized value for cooperative MARL. Preprint — May 2026 57

2026
[80]

MARSHAL: Incentivizing multi-agent reasoning via self-play with strategic LLMs

Huining Yuan, Zelai Xu, Zheyue Tan, Xiangmin Yi, Mo Guang, Kaiwen Long, Haojia Hui, Boxun Li, Xinlei Chen, Bo Zhao, Xiao-Ping Zhang, Chao Yu, and Yu Wang. MARSHAL: Incentivizing multi-agent reasoning via self-play with strategic LLMs. InInternational Con- ference on Learning Representations (ICLR), 2026. ICLR 2026 poster; turn-level advantage estimation a...

2026

Showing first 80 references.