pith. machine review for the scientific record. sign in

arxiv: 2605.02801 · v1 · submitted 2026-05-04 · 💻 cs.CL

Recognition: 3 theorem links

· Lean Theorem

Reinforcement Learning for LLM-based Multi-Agent Systems through Orchestration Traces

Authors on Pith no claims yet

Pith reviewed 2026-05-08 18:40 UTC · model grok-4.3

classification 💻 cs.CL
keywords reinforcement learningmulti-agent systemslarge language modelsorchestration tracescredit assignmentreward designstopping decisionagent coordination
0
0 comments X

The pith

Reinforcement learning for teams of LLM agents must optimize coordination decisions like when to stop, but no current method trains the stopping decision explicitly.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper frames reinforcement learning for multi-agent systems of large language models around orchestration traces, which are temporal graphs of events including spawning agents, delegating tasks, communicating, using tools, aggregating results, and stopping. It organizes the technical challenges into three axes: reward design across eight families that can include orchestration-specific signals, attachment of credit signals to units from tokens to teams, and decomposition of orchestration into five sub-decisions. The review finds that while the first four sub-decisions have some RL attention, the stopping decision has none in the surveyed work. This gap matters because without it, agent teams may continue computing after goals are met or fail to know when to conclude, limiting reliable scaling of coordinated LLM systems.

Core claim

Orchestration traces are temporal interaction graphs whose events include sub-agent spawning, delegation, communication, tool use, return, aggregation, and stopping decisions. Reward design spans eight families including orchestration rewards for parallelism speedup, split correctness, and aggregation quality. Reward and credit signals attach to eight units from token to team, with explicit counterfactual message-level credit sparse. Orchestration learning decomposes into five sub-decisions, and no explicit RL training method for the stopping decision appears in the 84-entry curated pool as of May 2026.

What carries the argument

Orchestration traces: temporal interaction graphs that record events of spawning, delegating, communicating, aggregating, and stopping in multi-agent LLM systems.

If this is right

  • Reward design can incorporate eight families that reward parallelism speedup, task split correctness, and aggregation quality.
  • Credit signals attach to eight different units ranging from single tokens to full teams, though message-level counterfactual credit remains rare.
  • Orchestration decomposes into five sub-decisions of spawn, delegate, communicate, aggregate, and stop, with the last one lacking RL methods.
  • Industrial deployments such as agent swarms operate at larger scales than open academic evaluations, revealing a verification gap.
  • A released JSON schema for replayable traces and the tagged paper pool enable systematic study of coordination learning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Training the stopping decision would let agent teams end work once goals are reached, cutting unnecessary computation.
  • The same orchestration-trace lens could apply to coordination in non-LLM multi-agent settings such as robotic or simulation teams.
  • Closing the industrial-academic scale gap would require sharing traces so academic RL methods can match deployment sizes.
  • Adding explicit message-level counterfactual credit could make learning more efficient when communication is the main signal.

Load-bearing premise

The 84-entry paper pool together with its exclusion log fully represents the space of relevant work on RL for LLM-based multi-agent systems.

What would settle it

Publication or discovery of even one explicit reinforcement learning method that trains the stopping decision on orchestration traces would disprove the claimed absence.

Figures

Figures reproduced from arXiv: 2605.02801 by Chenchen Zhang.

Figure 1
Figure 1. Figure 1: Paper map. Reading: the survey takes three input traditions (single-agent LLM RL, classical MARL, and industrial agent systems), foregrounds the orchestration trace as the shared object, and then organizes the literature into reward design, credit assignment, and orchestration learning. Benchmarks, safety, and open problems are downstream because they inherit the same trace structure. 1 Introduction Thesis… view at source ↗
Figure 2
Figure 2. Figure 2: Timeline of selected representative LLM-MAS entries from Q4 2024 to Q2 2026, plotted by arXiv submission date and grouped vertically by the credit-bearing unit they target (§7.1). Nearly the entire corpus sits in an 18-month window, motivating the timing claim in §1.1. The orchestrator and message rows remain sparsely populated throughout; agent- and role-level credit has received the most attention. agent… view at source ↗
Figure 3
Figure 3. Figure 3: Compact coverage map for representative retained entries. • means the entry directly studies the dimension; ◦ means it supplies indirect evidence, a benchmark substrate, or a system constraint. The sparsity is intentional: it shows why the survey treats reward, credit, orchestration, evaluation, and safety as coupled but unevenly supported dimensions. we cover the step that follows: what changes when the p… view at source ↗
Figure 4
Figure 4. Figure 4: Corpus construction flow. Counts are internal audit counts after the journal-revision coverage audit, not a claim of exhaustive coverage or independently reproducible screening. (§6), (b) credit and signal assignment across eight credit- or signal-bearing units (§7), and (c) orchestration learning across five sub-decisions (§8). • An industrial–academic bridge. We connect open methods to Kimi PARL, OpenAI … view at source ↗
Figure 5
Figure 5. Figure 5: Visual schematics of the six recurring LLM-MAS topologies catalogued in view at source ↗
Figure 6
Figure 6. Figure 6: Industry–academia scale gap. Reading: blue points summarize the typical public evaluation regime of academic LLM-MAS RL methods, while red filled points mark Kimi reports that disclose both team size and long trace length. Hollow red points indicate industrial deployment-shape evidence where the public material is useful for harness and workflow analysis but does not disclose a comparable RL training scale… view at source ↗
Figure 7
Figure 7. Figure 7: Rollout cost across representative operating regimes, shown as a schematic relative-cost proxy rather than a calibrated dollar or latency estimate. The bars combine representative team size and total trace length / tool-call counts under the cost form in (5); exact ratios depend on token lengths, tool latencies, and harness overhead. The group-of-G annotation shows the additional rollout-collection multipl… view at source ↗
Figure 8
Figure 8. Figure 8: The harness as a training-frozen interface. The harness (dashed box) wraps the trainable LLM πθ with a prompt template, tool registry, and execution runtime; only θ receives gradients during RL. The harness defines both the input distribution harness(o) that the policy sees and the output grammar Aharness it may emit. A policy fine-tuned through a different harness is a different policy in the deployment s… view at source ↗
Figure 9
Figure 9. Figure 9: Schematic per-step signal-to-noise under three credit schemes as trace length T grows. The blue curve is not a proven rate; it visualizes the qualitative warning in (8): uniform terminal credit can become low-SNR on long shared-reward traces. Role- or message-level decomposition (dashed green) partitions the trace into shorter sub-problems; a learned orchestrator critic (dotted red) targets a smaller set o… view at source ↗
Figure 10
Figure 10. Figure 10: Reward family composition. The seven primitive families R1–R7 (§6.1) group into four se￾mantic tiers—outcome, structured, process, system—and are composed through an R8 hybrid weighting to produce method-specific reward shapes. Three representative compositions from our pool are shown on the right. The less-studied axis is schedule semantics: which terms are transient scaffolds, which terms define the pri… view at source ↗
Figure 11
Figure 11. Figure 11: Schematic of Kimi PARL’s three-term reward rorch = rperf + λ1rparallel + λ2rfinish across training (§6.2). The task-outcome term rperf is the primary objective; both auxiliary orchestration￾shaping terms are shown as transient scaffolds because the public Kimi K2.5 report states that their hyperparameters are annealed to zero over training. Curves are schematic; exact schedules are not disclosed in the pu… view at source ↗
Figure 12
Figure 12. Figure 12: The eight credit-bearing units in LLM-MAS RL (§7.1), stacked from coarsest (team) to finest (token). Red-outlined levels—orchestrator, role, message—have no clean counterpart in classical MARL or single-agent LLM RL. Right-column labels list representative entries that assign credit at each level; the sparse levels (orchestrator, message) mark the most under-populated research territory. 7.1 The credit- a… view at source ↗
Figure 13
Figure 13. Figure 13: A decision-tree heuristic for selecting a credit-assignment mechanism from our pool, or￾ganized by four system-level questions: (i) whether the agent set is dynamic at inference, (ii) whether the orchestrator is the identified bottleneck, (iii) whether traces are long enough to suffer diffusion, and (iv) whether roles are heterogeneous or the structure is debate-shaped. Leaves name the method whose design… view at source ↗
Figure 14
Figure 14. Figure 14: Optimization objects for single-agent LLM RL vs. LLM-MAS RL. (a) A trajectory τ is a linearly ordered sequence of (st, at) pairs. (b) An orchestration trace G = (V, E, ℓ) is a temporal interaction graph: orchestrator decisions (red) spawn sub-agents (blue), which issue tool calls (orange) and return summaries that are aggregated (green diamond) before the next orchestrator decision. Credit￾bearing units (… view at source ↗
Figure 15
Figure 15. Figure 15: The five orchestration sub-decisions O1–O5 (§8.2). An orchestrator policy makes some or all of these decisions per task; surveyed entries cover O1–O4 but not O5. The red dashed box marks “when to stop” as a named open problem (§11): in the entries we surveyed, termination is either externally signaled (ground-truth answer found) or triggered by a fixed step-count cap rather than explicitly trained as a st… view at source ↗
Figure 16
Figure 16. Figure 16: Three orchestrator training regimes (§8.3). (A) Frozen sub-agents: gradient flows only into the orchestrator; cheapest and most common. (B) Joint training with shared baseline and per-agent advantage: all policies update together; requires stabilization (Dr. MAS’s agent-wise normalization). (C) Fully decoupled per-policy training against a central critic Vϕ: most expressive but most engineering￾heavy. Sol… view at source ↗
Figure 17
Figure 17. Figure 17: Attack-surface map for an LLM-MAS orchestration trace. The substrate (orchestrator → sub-agents → tools → shared memory) is the same as view at source ↗
Figure 18
Figure 18. Figure 18: Evidence-level matrix used when interpreting the corpus. “Alg.” denotes support for algo￾rithmic mechanism claims; “Train” denotes public training-objective or post-training evidence; “Deploy” denotes deployment-shape evidence; “Scale” denotes public scale or horizon evidence. This matrix pre￾vents product documentation from being treated as equivalent to reproducible algorithmic evidence view at source ↗
read the original abstract

As large language model (LLM) agents evolve from isolated tool users into coordinated teams, reinforcement learning (RL) must optimize not only individual actions but also how work is spawned, delegated, communicated, aggregated, and stopped. This paper studies RL for LLM-based multi-agent systems through orchestration traces: temporal interaction graphs whose events include sub-agent spawning, delegation, communication, tool use, return, aggregation, and stopping decisions. Using this lens, we identify three technical axes. First, reward design spans eight families, including orchestration rewards for parallelism speedup, split correctness, and aggregation quality. Second, reward and credit signals attach to eight credit- or signal-bearing units from token to team; explicit counterfactual message-level credit remains especially sparse in our curated pool. Third, orchestration learning decomposes into five sub-decisions: when to spawn, whom to delegate to, how to communicate, how to aggregate, and when to stop. In our curated pool as of May 4, 2026, we found no explicit RL training method for the stopping decision. We connect academic methods to public industrial evidence from Kimi Agent Swarm, OpenAI Codex, and Anthropic Claude Code. The resulting scale gap is a gap between publicly reported deployment envelopes and open academic evaluation regimes, not independent verification of industrial training traces. We release the artifact at https://github.com/xxzcc/awesome-llm-mas-rl, including an 84-entry tagged paper pool, a 32-record exclusion log, scripted corpus statistics, and a minimal JSON schema for replayable orchestration traces.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The manuscript surveys reinforcement learning (RL) methods for large language model (LLM)-based multi-agent systems by analyzing orchestration traces—temporal interaction graphs capturing sub-agent spawning, delegation, communication, tool use, aggregation, and stopping. It organizes the literature along three axes: eight families of reward designs (including orchestration rewards for parallelism, split correctness, and aggregation quality), eight credit- or signal-bearing units (from token to team), and five sub-decisions in orchestration learning. A central observation is that, in the authors' curated pool of 84 papers as of May 4, 2026, no explicit RL training method for the stopping decision appears. The work links academic methods to industrial examples from Kimi, OpenAI, and Anthropic and releases an artifact containing the tagged paper pool, 32-record exclusion log, corpus statistics, and a JSON schema for replayable traces.

Significance. If the 84-paper pool and its exclusion criteria are representative, the survey usefully identifies an underexplored area—explicit RL for stopping decisions—in LLM-based multi-agent orchestration, which could direct future work on termination policies. The public release of the tagged pool, exclusion log, scripted statistics, and minimal JSON schema constitutes a concrete contribution to reproducibility, allowing the community to inspect, extend, or challenge the categorization.

major comments (1)
  1. [Abstract] Abstract and curation description: The gap claim ('no explicit RL training method for the stopping decision') is explicitly scoped to the curated pool, yet the manuscript provides no reproducible account of the search protocol (databases queried, exact Boolean strings, date ranges, or inclusion/exclusion criteria) within the text itself; these details reside only in the GitHub artifact. Because the central finding rests on the completeness of this pool, independent verification of whether relevant RL-MAS papers using alternate terminology were missed is not possible from the manuscript alone.
minor comments (1)
  1. [Orchestration learning decomposition] The five sub-decisions (spawn, delegate, communicate, aggregate, stop) are clearly enumerated, but the text could add one or two concrete examples of how an implicit stopping mechanism (e.g., value-based termination inside a single policy) would be tagged versus an explicit stopping decision under the proposed schema.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and positive recommendation for minor revision. We address the single major comment below and will incorporate the requested changes to improve reproducibility.

read point-by-point responses
  1. Referee: [Abstract] Abstract and curation description: The gap claim ('no explicit RL training method for the stopping decision') is explicitly scoped to the curated pool, yet the manuscript provides no reproducible account of the search protocol (databases queried, exact Boolean strings, date ranges, or inclusion/exclusion criteria) within the text itself; these details reside only in the GitHub artifact. Because the central finding rests on the completeness of this pool, independent verification of whether relevant RL-MAS papers using alternate terminology were missed is not possible from the manuscript alone.

    Authors: We agree that the manuscript should contain a self-contained description of the search protocol so that the central gap claim can be evaluated without external resources. In the revised version we will add a new subsection (placed in the curation or methods section) that explicitly lists the databases and repositories queried, the exact Boolean search strings, the date range (up to May 4, 2026), and the complete inclusion/exclusion criteria. The 32-record exclusion log and the tagged 84-paper pool will remain available in the GitHub artifact as supplementary material to support further inspection and extension by the community. revision: yes

Circularity Check

0 steps flagged

No circularity: descriptive literature survey with externally verifiable curation

full rationale

This is a literature survey paper whose central claims consist of categorizations of an external 84-paper pool along three axes and five sub-decisions, plus the negative observation that no explicit RL method for the stopping decision appears in that pool. No mathematical derivations, fitted parameters renamed as predictions, self-citations used as load-bearing uniqueness theorems, or ansatzes smuggled via prior work are present. The decomposition into spawn/delegate/communicate/aggregate/stop is offered as an analytical lens rather than a self-defining or tautological construction. The released artifact (tagged pool, exclusion log, schema) makes the curation externally inspectable, satisfying the rule that a self-contained survey against external benchmarks receives score 0-2.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The survey rests on the assumption that the chosen search and exclusion criteria produce a representative sample of the field and that the proposed axes are natural and complete partitions of the design space. No new mathematical axioms or invented physical entities are introduced.

axioms (1)
  • domain assumption The curated pool of 84 papers plus exclusion log adequately represents the relevant literature on RL for LLM-based multi-agent systems as of May 2026.
    Stated in the abstract when the authors report findings from the pool.

pith-pipeline@v0.9.0 · 8303 in / 1240 out tokens · 69476 ms · 2026-05-08T18:40:35.808270+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

89 extracted references · 41 canonical work pages · 13 internal anchors

  1. [1]

    SAGE: Multi-agent self-evolution for LLM reasoning, 2026

    Anonymous. SAGE: Multi-agent self-evolution for LLM reasoning, 2026. ACL Rolling Review January 2026 submission; challenger, planner, solver, and critic co-evolve from a shared LLM backbone; under review; accessed 2026-05-04

  2. [2]

    Creating custom sub-agents (claude code docs).https://docs.anthropic

    Anthropic. Creating custom sub-agents (claude code docs).https://docs.anthropic. com/en/docs/claude-code/sub-agents, 2025. Documentation; accessed 2026-04-27

  3. [3]

    Our framework for developing safe and trustworthy agents.https://www.anthropic.com/news/ our-framework-for-developing-safe-and-trustworthy-agents, 2025

    Anthropic. Our framework for developing safe and trustworthy agents.https://www.anthropic.com/news/ our-framework-for-developing-safe-and-trustworthy-agents, 2025. 2025-08- 04; accessed 2026-04-27

  4. [4]

    Building a C compiler with a team of parallel Claudes.https:// www.anthropic.com/engineering/building-c-compiler, 2026

    Anthropic Engineering. Building a C compiler with a team of parallel Claudes.https:// www.anthropic.com/engineering/building-c-compiler, 2026. 2026-02-05; 16 parallel Claudes; accessed 2026-04-27

  5. [5]

    Bernstein, Robert Givan, Neil Immerman, and Shlomo Zilberstein

    Daniel S. Bernstein, Robert Givan, Neil Immerman, and Shlomo Zilberstein. The com- plexity of decentralized control of Markov decision processes.Mathematics of Operations Research, 27(4):819–840, 2002. Original Dec-POMDP formalism; proves NEXP-complete

  6. [6]

    & Liu, T

    Shuaihang Chen, Weinan Zhang, Ting Liu, et al. A survey on LLM-based multi-agent sys- tem: Recent advances and new frontiers in application.arXiv preprint arXiv:2412.17481,

  7. [7]

    Exact Is Easier: Credit Assignment for Cooperative LLM Agents

    Yanjun Chen et al. Contextual counterfactual credit assignment for multi-agent reinforce- ment learning in LLM collaboration.arXiv preprint arXiv:2603.06859, 2026. Counterfac- tual causal credit assignment at message level

  8. [8]

    Multi-Agent Evolve:

    Yixing Chen et al. Multi-agent evolve: LLM self-improve through co-evolution.arXiv preprint arXiv:2510.23595, 2025. Proposer-Solver-Judge co-evolution; UIUC ulab. Preprint — May 2026 51

  9. [9]

    AgentSpawn: Adaptive multi-agent collaboration through dynamic spawning for long-horizon code generation.arXiv preprint arXiv:2602.07072, 2026

    Igor Costa. AgentSpawn: Adaptive multi-agent collaboration through dynamic spawning for long-horizon code generation.arXiv preprint arXiv:2602.07072, 2026. Runtime dynamic spawn + memory transfer; sole-author

  10. [10]

    Multi-agent collaboration via evolving orchestration

    Yufan Dang et al. Multi-agent collaboration via evolving orchestration. InAdvances in Neural Information Processing Systems (NeurIPS), 2025. Puppeteer central orchestrator; Tsinghua/OpenBMB ChatDev team

  11. [11]

    Christian Schroeder de Witt, Tarun Gupta, Denys Makoviichuk, Viktor Makoviychuk, Philip H. S. Torr, Mingfei Sun, and Shimon Whiteson. Is independent learning all you need in the StarCraft multi-agent challenge?arXiv preprint arXiv:2011.09533, 2020. IPPO; independent PPO competitive on SMAC

  12. [12]

    AgentDojo: A dynamic environment to evaluate prompt injection attacks and defenses for LLM agents

    Edoardo Debenedetti, Jie Zhang, Mislav Balunovi´ c, Luca Beurer-Kellner, Marc Fischer, and Florian Tram` er. AgentDojo: A dynamic environment to evaluate prompt injection attacks and defenses for LLM agents. InAdvances in Neural Information Processing Sys- tems (NeurIPS) Datasets and Benchmarks Track, 2024. 97 realistic tasks, 629 security test cases; ETH SPY Lab

  13. [13]

    DeepSeek-R1: Incentivizing reasoning capability in LLMs via reinforcement learning.Nature, 645:633–638, 2025

    DeepSeek-AI. DeepSeek-R1: Incentivizing reasoning capability in LLMs via reinforcement learning.Nature, 645:633–638, 2025. Rule-based RL unlocks long-CoT reasoning; R1 & R1-Zero

  14. [14]

    Wasp: Benchmarking web agent security against prompt injection attacks,

    Ivan Evtimov, Arman Zharmagambetov, Aaron Grattafiori, Chuan Guo, and Kama- lika Chaudhuri. WASP: Benchmarking web agent security against prompt injection at- tacks.arXiv preprint arXiv:2504.18575, 2025. Meta FAIR; end-to-end web-agent prompt- injection benchmark

  15. [15]

    Lang Feng et al. Dr. MAS: Stable reinforcement learning for multi-agent LLM systems. arXiv preprint arXiv:2602.08847, 2026. Diagnoses GRPO instability in MAS; agent-wise normalization

  16. [16]

    Foerster, Gregory Farquhar, Triantafyllos Afouras, Nantas Nardelli, and Shi- mon Whiteson

    Jakob N. Foerster, Gregory Farquhar, Triantafyllos Afouras, Nantas Nardelli, and Shi- mon Whiteson. Counterfactual multi-agent policy gradients. InProceedings of the AAAI Conference on Artificial Intelligence, 2018. COMA; counterfactual baseline for per-agent credit

  17. [17]

    Not what you’ve signed up for: Compromising real-world LLM-integrated applications with indirect prompt injection

    Kai Greshake, Sahar Abdelnabi, Shailesh Mishra, Christoph Endres, Thorsten Holz, and Mario Fritz. Not what you’ve signed up for: Compromising real-world LLM-integrated applications with indirect prompt injection. InProceedings of the 16th ACM Workshop on Artificial Intelligence and Security (AISec), 2023. Foundational paper on indirect prompt injection

  18. [18]

    Fung, Ming Yan, Ji Zhang, Fei Huang, and Yang Liu

    Zhitao He, Zijun Liu, Peng Li, Yi R. Fung, Ming Yan, Ji Zhang, Fei Huang, and Yang Liu. Advancing language multi-agent learning with credit re-assignment for interactive en- vironment generalization. InConference on Language Modeling (COLM), 2025. Introduces CollabUIAgents and multi-agent credit re-assignment for interactive UI/web environments; accessed ...

  19. [19]

    Multi-agent deep research: Training multi-agent systems with m-grpo.arXiv preprint arXiv:2511.13288, 2025

    Haoyang Hong et al. Multi-agent deep research: Training multi-agent systems with M- GRPO.arXiv preprint arXiv:2511.13288, 2025. Hierarchical GRPO; Ant Group. Preprint — May 2026 52

  20. [20]

    Halo: Hierarchical au- tonomous logic-oriented orchestration for multi-agent llm systems.arXiv preprint arXiv:2505.13516,

    Zhipeng Hou et al. HALO: Hierarchical autonomous logic-oriented orchestration for multi- agent LLM systems.arXiv preprint arXiv:2505.13516, 2025. MCTS-based three-layer hierarchical MAS

  21. [21]

    DEPART: Hierarchical multi-agent system for multi-turn interaction, 2026

    Hao-Lun Hsu, Jing Xu, Nikhil Vichare, Francesco Carbone, Miroslav Pajic, and Giuseppe Carenini. DEPART: Hierarchical multi-agent system for multi-turn interaction, 2026. OpenReview ICLR 2026 submission; introduces HIMPO for alternating planner/executor post-training with role-specific rewards; accessed 2026-05-04

  22. [22]

    OWL: Optimized workforce learning for general multi-agent assistance in real-world task automation

    Mengkang Hu, Yuhang Zhou, Wendong Fan, Yuzhou Nie, Ziyu Ye, Bowei Xia, Tao Sun, Zhaoxuan Jin, Yingru Li, Zeyu Zhang, Yifeng Wang, Qianshuo Ye, Bernard Ghanem, Ping Luo, and Guohao Li. OWL: Optimized workforce learning for general multi-agent assistance in real-world task automation. InAdvances in Neural Information Processing Systems (NeurIPS), 2025. Neur...

  23. [23]

    Agent Q-Mix: Selecting the right action for LLM multi-agent systems through reinforcement learning.arXiv preprint arXiv:2604.00344, 2026

    Eric Hanchen Jiang, Levina Li, Rui Sun, Xiao Liang, Yubei Li, Yuchen Wu, Haozheng Luo, Hengli Li, Zhi Zhang, Zhaolu Kang, Kai-Wei Chang, and Ying Nian Wu. Agent Q-Mix: Selecting the right action for LLM multi-agent systems through reinforcement learning.arXiv preprint arXiv:2604.00344, 2026. QMIX-style CTDE for decentralized communication and topology dec...

  24. [24]

    Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan

    Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. SWE-bench: Can language models resolve real-world GitHub issues? InInternational Conference on Learning Representations (ICLR), 2024. 2294 real GitHub issues from 12 Python repos

  25. [25]

    TAMAS: Benchmarking adversarial risks in multi-agent LLM systems

    Ishan Kavathekar et al. TAMAS: Benchmarking adversarial risks in multi-agent LLM systems. InICML 2025 Multi-Agent Systems Workshop, 2025. First adversarial robustness benchmark for MAS

  26. [26]

    MAS-Zero: Designing multi-agent systems with zero supervision.arXiv preprint arXiv:2505.14996, 2025

    Zixuan Ke, Austin Xu, Yifei Ming, Xuan-Phi Nguyen, Caiming Xiong, and Shafiq Joty. MAS-Zero: Designing multi-agent systems with zero supervision.arXiv preprint arXiv:2505.14996, 2025. Inference-time self-evolved MAS design through meta-level design feedback and self-verification; accessed 2026-04-27

  27. [27]

    Agents under siege: Breaking pragmatic multi-agent LLM systems with optimized prompt attacks

    Rana Muhammad Shahroz Khan, Zhen Tan, Sukwon Chen, Patrick Foulds, Sean Yong, Huan Liu, and Tianlong Chen. Agents under siege: Breaking pragmatic multi-agent LLM systems with optimized prompt attacks. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL), 2025. Permutation-invariant adversarial attack on multi-age...

  28. [28]

    Kimi K2.5: Visual Agentic Intelligence

    Kimi Team. Kimi K2.5: Visual agentic intelligence.https://www.kimi.com/blog/ kimi-k2-5.html, 2026. Technical report by Moonshot AI; arXiv 2602.02276; introduces Agent Swarm + PARL; accessed 2026-04-27

  29. [29]

    Kimi K2.6 tech blog.https://www.kimi.com/blog/kimi-k2-6, 2026

    Kimi Team. Kimi K2.6 tech blog.https://www.kimi.com/blog/kimi-k2-6, 2026. 2026- 04-20; 300-agent coordination, Claw Groups; accessed 2026-04-27

  30. [30]

    Experience as a Compass: Multi-agent RAG with Evolving Orchestration and Agent Prompts

    Sha Li et al. Experience as a compass: Multi-agent RAG with evolving orchestration and agent prompts.arXiv preprint arXiv:2604.00901, 2026. HERA; evolving orchestration policy for MAS-RAG. Preprint — May 2026 53

  31. [31]

    Who deserves the reward? sharp: Shapley credit-based optimization for multi-agent system.arXiv preprint arXiv:2602.08335, 2026b

    Yanming Li et al. Who deserves the reward? SHARP: Shapley credit-based optimiza- tion for multi-agent system.arXiv preprint arXiv:2602.08335, 2026. Shapley-value-based hierarchical credit assignment

  32. [32]

    MARFT: multi-agent reinforcement fine-tuning.CoRR, abs/2504.16129, 2025

    Junwei Liao et al. MARFT: Multi-agent reinforcement fine-tuning.arXiv preprint arXiv:2504.16129, 2025. v4 dated 2025-11-03; submitted to ICLR 2026

  33. [33]

    Michael L. Littman. Markov games as a framework for multi-agent reinforcement learning. InProceedings of the International Conference on Machine Learning (ICML), pages 157– 163, 1994. Foundational stochastic/Markov-game formulation for MARL

  34. [34]

    SPIRAL: Self- play on zero-sum games incentivizes reasoning via multi-agent multi-turn reinforcement learning

    Bo Liu, Simon Yu, Zichen Liu, Leon Guertler, Penghui Qi, Daniel Balcells, Mickel Liu, Cheston Tan, Weiyan Shi, Min Lin, Wee Sun Lee, and Natasha Jaques. SPIRAL: Self- play on zero-sum games incentivizes reasoning via multi-agent multi-turn reinforcement learning. InInternational Conference on Learning Representations (ICLR), 2026. ICLR 2026 poster; role-c...

  35. [35]

    Reinforcement learn- ing meets large language models: A survey of ad- vancements and applications across the llm lifecycle

    Keliang Liu et al. Reinforcement learning meets large language models: A survey of advancements and applications across the LLM lifecycle.arXiv preprint arXiv:2509.16679,

  36. [36]

    Fudan/Tongji/CUHK MMLab

  37. [37]

    MarsRL: Advancing multi-agent reasoning system via reinforcement learning with agentic pipeline parallelism.arXiv preprint arXiv:2511.11373, 2025

    Shulin Liu et al. MarsRL: Advancing multi-agent reasoning system via reinforcement learning with agentic pipeline parallelism.arXiv preprint arXiv:2511.11373, 2025. Agentic pipeline-parallel RL

  38. [38]

    arXiv preprint arXiv:2508.04652 , year=

    Shuo Liu, Christopher Amato, et al. LLM collaboration with multi-agent reinforcement learning.arXiv preprint arXiv:2508.04652, 2025. v7 dated 2025-12-09; introduces MA- GRPO

  39. [39]

    Learning Decentralized LLM Collaboration with Multi-Agent Actor Critic

    Shuo Liu, Tianle Chen, Ryan Amiri, and Christopher Amato. Learning decentralized LLM collaboration with multi-agent actor critic.arXiv preprint arXiv:2601.21972, 2026. Introduces CoLLM-CC and CoLLM-DC actor-critic variants for decentralized LLM collab- oration; accessed 2026-05-04

  40. [40]

    Multi- agent actor-critic for mixed cooperative-competitive environments

    Ryan Lowe, Yi Wu, Aviv Tamar, Jean Harb, Pieter Abbeel, and Igor Mordatch. Multi- agent actor-critic for mixed cooperative-competitive environments. InAdvances in Neural Information Processing Systems (NeurIPS), 2017. MADDPG; centralized critic, decentral- ized actor

  41. [41]

    Agent lightning: Train any ai agents with reinforcement learning,

    Xufang Luo et al. Agent Lightning: Train any AI agents with reinforcement learning.arXiv preprint arXiv:2508.03680, 2025. Microsoft Research; decouples agent execution from RL training

  42. [42]

    A trouble- maker with contagious jailbreak makes chaos in honest towns

    Tianyi Men, Pengfei Cao, Zhuoran Jin, Yubo Chen, Kang Liu, and Jun Zhao. A trouble- maker with contagious jailbreak makes chaos in honest towns. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL), 2025. Contagious jailbreak that propagates through agent memory across non-complete-graph topologies

  43. [43]

    GAIA: a benchmark for General AI Assistants

    Gr´ egoire Mialon, Cl´ ementine Fourrier, Craig Swift, Thomas Wolf, Yann LeCun, and Thomas Scialom. GAIA: a benchmark for general AI assistants.arXiv preprint arXiv:2311.12983, 2023. 466 tool-use-heavy real-world questions. Preprint — May 2026 54

  44. [44]

    Multi-agent tool-integrated policy optimization.arXiv preprint arXiv:2510.04678, 2025

    Zhanfeng Mo et al. Multi-agent tool-integrated policy optimization.arXiv preprint arXiv:2510.04678, 2025. Single-LLM dual-role planner+worker; +18.38% over single-agent

  45. [45]

    MALT: Improving reasoning with multi-agent LLM train- ing

    Sumeet Ramesh Motwani et al. MALT: Improving reasoning with multi-agent LLM train- ing. InConference on Language Modeling (COLM), 2025. Generator-verifier-refiner training with role-PRM (+14.14%)

  46. [46]

    Introducing codex.https://openai.com/index/introducing-codex/;https: //openai.com/index/introducing-the-codex-app/, 2025

    OpenAI. Introducing codex.https://openai.com/index/introducing-codex/;https: //openai.com/index/introducing-the-codex-app/, 2025. 2025-05-16 launch post plus Codex app materials; cloud-native parallel software-engineering agent; accessed 2026-04-27

  47. [47]

    Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Chris- tiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human feedbac...

  48. [48]

    MAPoRL: Multi-agent post-co-training for collaborative large lan- guage models with reinforcement learning

    Chanwoo Park et al. MAPoRL: Multi-agent post-co-training for collaborative large lan- guage models with reinforcement learning. InAnnual Meeting of the Association for Com- putational Linguistics (ACL), 2025. MIT; first explicit post-training RL for collaboration

  49. [49]

    CriticLean: Critic-guided reinforcement learning for mathematical formalization.arXiv preprint arXiv:2507.06181, 2025

    Zhongyuan Peng, Yifan Yao, Kaijing Ma, Shuyue Guo, Yizhe Li, Yichi Zhang, Chenchen Zhang, Yifan Zhang, Zhouliang Yu, et al. CriticLean: Critic-guided reinforcement learning for mathematical formalization.arXiv preprint arXiv:2507.06181, 2025. Trains a critic via SFT+RL to score Lean 4 formalizations; concrete instance of verifier-as-reward (R6)

  50. [50]

    ToolLLM: Facilitating large language models to master 16000+ real-world APIs

    Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang, Bill Qian, Sihan Zhao, Lauren Hong, Runchu Tian, Ruobing Xie, Jie Zhou, Mark Gerstein, Dahai Li, Zhiyuan Liu, and Maosong Sun. ToolLLM: Facilitating large language models to master 16000+ real-world APIs. InInternational Conference on Learning Representat...

  51. [51]

    QMIX: Monotonic value function factorisation for deep multi-agent reinforcement learning

    Tabish Rashid, Mikayel Samvelyan, Christian Schroeder de Witt, Gregory Farquhar, Jakob Foerster, and Shimon Whiteson. QMIX: Monotonic value function factorisation for deep multi-agent reinforcement learning. InProceedings of the International Conference on Machine Learning (ICML), 2018. QMIX; monotonic mixing network

  52. [52]

    Debate as Reward: A Multi-Agent Reward System for Scientific Ideation via RL Post-Training

    Moein Salimi et al. Debate as reward: A multi-agent reward system for scientific ideation via RL post-training.arXiv preprint arXiv:2604.16723, 2026. Multi-agent debate as reward signal

  53. [53]

    Proximal Policy Optimization Algorithms

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017. PPO; clipped surrogate objective

  54. [54]

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. DeepSeekMath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024. Original source of GRPO. Preprint — May 2026 55

  55. [55]

    Tenenbaum, Antonio Torralba, Shuang Li, and Igor Mordatch

    Vighnesh Subramaniam, Yilun Du, Joshua B. Tenenbaum, Antonio Torralba, Shuang Li, and Igor Mordatch. Multiagent finetuning: Self improvement with diverse reasoning chains. InInternational Conference on Learning Representations (ICLR), 2025. Finetunes a society of language models using diverse reasoning chains generated through multiagent interac- tion; ac...

  56. [56]

    Scaling long-horizon LLM agent via context-folding.CoRR, abs/2510.11967, 2025

    Weiwei Sun et al. Scaling long-horizon LLM agent via context-folding.arXiv preprint arXiv:2510.11967, 2025. ByteDance Seed/CMU; submitted to ICLR 2026

  57. [57]

    Czarnecki, Vinicius Zambaldi, Max Jaderberg, Marc Lanctot, Nicolas Sonnerat, Joel Z

    Peter Sunehag, Guy Lever, Audrunas Gruslys, Wojciech M. Czarnecki, Vinicius Zambaldi, Max Jaderberg, Marc Lanctot, Nicolas Sonnerat, Joel Z. Leibo, Karl Tuyls, and Thore Graepel. Value-decomposition networks for cooperative multi-agent learning. InProceed- ings of the 17th International Conference on Autonomous Agents and MultiAgent Systems (AAMAS), 2018....

  58. [58]

    Multi-Agent Collaboration Mechanisms: A Survey of LLMs

    Khanh-Tung Tran, Barry O’Sullivan, et al. Multi-agent collaboration mechanisms: A survey of LLMs.arXiv preprint arXiv:2501.06322, 2025. UCC Ireland

  59. [59]

    ReMA: Learning to meta-think for LLMs with multi-agent reinforcement learning

    Ziyu Wan, Yunxiang Li, Xiaoyu Wen, Yan Song, Hanjing Wang, Linyi Yang, Mark Schmidt, Jun Wang, Weinan Zhang, Shuyue Hu, and Ying Wen. ReMA: Learning to meta-think for LLMs with multi-agent reinforcement learning. InAdvances in Neural Information Processing Systems (NeurIPS), 2025. NeurIPS 2025 poster; multi-agent RL framework for meta-thinking with high-l...

  60. [60]

    Shapley Q-value: A local reward approach to solve global reward games

    Jianhong Wang, Yuan Zhang, Tae-Kyun Kim, and Yunjie Gu. Shapley Q-value: A local reward approach to solve global reward games. InProceedings of the AAAI Conference on Artificial Intelligence, 2020. Shapley-value credit assignment for cooperative MARL

  61. [61]

    MTU-Bench: A multi-granularity tool-use benchmark for large language models

    Pei Wang, Yanan Wu, Zekun Wang, Jiaheng Liu, Xiaoshuai Song, Zhongyuan Peng, Ken Deng, Chenchen Zhang, Jiakai Wang, et al. MTU-Bench: A multi-granularity tool-use benchmark for large language models. InInternational Conference on Learning Represen- tations (ICLR), 2025. Five-granularity tool-use benchmark covering single/multi-turn and single/multi-tool scenarios

  62. [62]

    MARTI-MARS 2: Scaling multi-agent self-search via rein- forcement learning for code generation.arXiv preprint arXiv:2602.07848, 2026

    Shijie Wang, Pengfei Li, Yikun Fu, Kaifeng Liu, Fangyuan Li, Yang Liu, Xiaowei Sun, Zonglin Li, Siyao Zhao, Jian Zhao, Kai Tian, Dong Li, Junqi Gao, Yutong Zhang, Yiqun Chen, Yuqiang Li, Zoe Li, Weinan Zhang, Peng Ye, Shuyue Hu, Lei Bai, Bowen Zhou, Kaiyan Zhang, and Biqing Qi. MARTI-MARS 2: Scaling multi-agent self-search via rein- forcement learning for...

  63. [63]

    Towards Scalable Lightweight GUI Agents via Multi-role Orchestration

    Ziwei Wang, Junjie Zheng, Leyang Yang, Sheng Zhou, Xiaoxuan Tang, Zhouhua Fang, Zhiwei Liu, Dajun Chen, Yong Li, and Jiajun Bu. Towards scalable lightweight GUI agents via multi-role orchestration.arXiv preprint arXiv:2604.13488, 2026. Findings of ACL 2026; multi-role orchestration and RL for role-oriented cooperative exploration; accessed 2026- 05-04

  64. [64]

    BrowseComp: A Simple Yet Challenging Benchmark for Browsing Agents

    Jason Wei, Zhiqing Sun, Spencer Papay, Scott McKinney, Jeffrey Han, Isa Ful- ford, Hyung Won Chung, Alex Tachard Passos, William Fedus, and Amelia Glaese. BrowseComp: A simple yet challenging benchmark for browsing agents.arXiv preprint arXiv:2504.12516, 2025. OpenAI 2025-04-17; 1266 hard browsing questions. Preprint — May 2026 56

  65. [65]

    arXiv preprint arXiv:2601.12538 , year=

    Tianxin Wei, Heng Ji, et al. Agentic reasoning for large language models.arXiv preprint arXiv:2601.12538, 2026. UIUC; 29-author team

  66. [66]

    MAGIC: A co-evolving attacker-defender adversarial game for robust LLM safety

    Xiaoyu Wen, Zhida He, Han Qi, Ziyu Wan, Zhongtian Ma, Ying Wen, Tianhang Zheng, Xingcheng Xu, Chaochao Lu, and Qiaosheng Zhang. MAGIC: A co-evolving attacker- defender adversarial game for robust LLM safety.arXiv preprint arXiv:2602.01539, 2026. Multi-turn attacker-defender multi-agent RL for safety alignment; accessed 2026-05-04

  67. [67]

    Wolpert and Kagan Tumer

    David H. Wolpert and Kagan Tumer. Optimal payoff functions for members of collectives. Advances in Complex Systems, 4(2–3):265–279, 2001. Difference rewards / Wonderful Life Utility; foundational credit assignment

  68. [68]

    OSWorld: Benchmarking multimodal agents for open-ended tasks in real computer environments

    Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh Jing Hua, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, Yitao Liu, Yiheng Xu, Shuyan Zhou, Silvio Savarese, Caiming Xiong, Victor Zhong, and Tao Yu. OSWorld: Benchmarking multimodal agents for open-ended tasks in real computer environments. InAdvances in Neural Information P...

  69. [69]

    WideSeek-R1: Exploring width scaling for broad information seeking via multi-agent reinforcement learning.arXiv preprint arXiv:2602.04634, 2026

    Zelai Xu, Zhexuan Xu, Ruize Zhang, Chunyang Zhu, Shi Yu, Weilin Liu, Quanlu Zhang, Wenbo Ding, Chao Yu, and Yu Wang. WideSeek-R1: Exploring width scaling for broad in- formation seeking via multi-agent reinforcement learning.arXiv preprint arXiv:2602.04634,

  70. [70]

    Lead-agent/subagent MARL for broad information seeking and width scaling; ac- cessed 2026-05-04

  71. [71]

    CoMAS: Co-evolving multi-agent systems via interaction rewards

    Xiangyuan Xue, Yifan Zhou, Guibin Zhang, Zaibin Zhang, Yijiang Li, Chen Zhang, Zhenfei Yin, Philip Torr, Wanli Ouyang, and Lei Bai. CoMAS: Co-evolving multi-agent systems via interaction rewards. InInternational Conference on Learning Representations (ICLR),

  72. [72]

    ICLR 2026 poster; self-evolution through interaction-derived rewards and LLM-as- judge reward construction; accessed 2026-04-27

  73. [73]

    Learning to deliberate: Meta-policy collaboration for agentic LLMs with multi-agent reinforcement learning.arXiv preprint arXiv:2509.03817,

    Wei Yang and Jesse Thomason. Learning to deliberate: Meta-policy collaboration for agentic LLMs with multi-agent reinforcement learning.arXiv preprint arXiv:2509.03817,

  74. [74]

    Introduces MPDF and SoftRankPO for decentralized meta-cognitive actions Persist, Refine, and Concede; accessed 2026-04-27

  75. [75]

    Langmarl: Natural language multi-agent reinforcement learning,

    Huaiyuan Yao, Longchao Da, Xiaoou Liu, Charles Fleming, Tianlong Chen, and Hua Wei. LangMARL: Natural language multi-agent reinforcement learning.arXiv preprint arXiv:2604.00722, 2026. Agent-level language credit assignment and policy-gradient evo- lution in language space; accessed 2026-05-04

  76. [76]

    $\tau$-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains

    Shunyu Yao, Noah Shinn, Pedram Razavi, and Karthik Narasimhan.τ-bench: A benchmark for tool-agent-user interaction in real-world domains.arXiv preprint arXiv:2406.12045, 2024. Sierra Research; retail/airline domains with policy adherence

  77. [77]

    ReAct: Synergizing reasoning and acting in language models

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. ReAct: Synergizing reasoning and acting in language models. InInternational Conference on Learning Representations (ICLR), 2023. Interleaved reasoning+acting; agentic origin

  78. [78]

    The surprising effectiveness of PPO in cooperative multi-agent games

    Chao Yu, Akash Velu, Eugene Vinitsky, Jiaxuan Gao, Yu Wang, Alexandre Bayen, and Yi Wu. The surprising effectiveness of PPO in cooperative multi-agent games. InAdvances in Neural Information Processing Systems (NeurIPS) Datasets and Benchmarks Track,

  79. [79]

    Preprint — May 2026 57

    MAPPO; PPO with centralized value for cooperative MARL. Preprint — May 2026 57

  80. [80]

    MARSHAL: Incentivizing multi-agent reasoning via self-play with strategic LLMs

    Huining Yuan, Zelai Xu, Zheyue Tan, Xiangmin Yi, Mo Guang, Kaiwen Long, Haojia Hui, Boxun Li, Xinlei Chen, Bo Zhao, Xiao-Ping Zhang, Chao Yu, and Yu Wang. MARSHAL: Incentivizing multi-agent reasoning via self-play with strategic LLMs. InInternational Con- ference on Learning Representations (ICLR), 2026. ICLR 2026 poster; turn-level advantage estimation a...

Showing first 80 references.