Recognition: 3 theorem links
· Lean TheoremReinforcement Learning for LLM-based Multi-Agent Systems through Orchestration Traces
Pith reviewed 2026-05-08 18:40 UTC · model grok-4.3
The pith
Reinforcement learning for teams of LLM agents must optimize coordination decisions like when to stop, but no current method trains the stopping decision explicitly.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Orchestration traces are temporal interaction graphs whose events include sub-agent spawning, delegation, communication, tool use, return, aggregation, and stopping decisions. Reward design spans eight families including orchestration rewards for parallelism speedup, split correctness, and aggregation quality. Reward and credit signals attach to eight units from token to team, with explicit counterfactual message-level credit sparse. Orchestration learning decomposes into five sub-decisions, and no explicit RL training method for the stopping decision appears in the 84-entry curated pool as of May 2026.
What carries the argument
Orchestration traces: temporal interaction graphs that record events of spawning, delegating, communicating, aggregating, and stopping in multi-agent LLM systems.
If this is right
- Reward design can incorporate eight families that reward parallelism speedup, task split correctness, and aggregation quality.
- Credit signals attach to eight different units ranging from single tokens to full teams, though message-level counterfactual credit remains rare.
- Orchestration decomposes into five sub-decisions of spawn, delegate, communicate, aggregate, and stop, with the last one lacking RL methods.
- Industrial deployments such as agent swarms operate at larger scales than open academic evaluations, revealing a verification gap.
- A released JSON schema for replayable traces and the tagged paper pool enable systematic study of coordination learning.
Where Pith is reading between the lines
- Training the stopping decision would let agent teams end work once goals are reached, cutting unnecessary computation.
- The same orchestration-trace lens could apply to coordination in non-LLM multi-agent settings such as robotic or simulation teams.
- Closing the industrial-academic scale gap would require sharing traces so academic RL methods can match deployment sizes.
- Adding explicit message-level counterfactual credit could make learning more efficient when communication is the main signal.
Load-bearing premise
The 84-entry paper pool together with its exclusion log fully represents the space of relevant work on RL for LLM-based multi-agent systems.
What would settle it
Publication or discovery of even one explicit reinforcement learning method that trains the stopping decision on orchestration traces would disprove the claimed absence.
Figures
read the original abstract
As large language model (LLM) agents evolve from isolated tool users into coordinated teams, reinforcement learning (RL) must optimize not only individual actions but also how work is spawned, delegated, communicated, aggregated, and stopped. This paper studies RL for LLM-based multi-agent systems through orchestration traces: temporal interaction graphs whose events include sub-agent spawning, delegation, communication, tool use, return, aggregation, and stopping decisions. Using this lens, we identify three technical axes. First, reward design spans eight families, including orchestration rewards for parallelism speedup, split correctness, and aggregation quality. Second, reward and credit signals attach to eight credit- or signal-bearing units from token to team; explicit counterfactual message-level credit remains especially sparse in our curated pool. Third, orchestration learning decomposes into five sub-decisions: when to spawn, whom to delegate to, how to communicate, how to aggregate, and when to stop. In our curated pool as of May 4, 2026, we found no explicit RL training method for the stopping decision. We connect academic methods to public industrial evidence from Kimi Agent Swarm, OpenAI Codex, and Anthropic Claude Code. The resulting scale gap is a gap between publicly reported deployment envelopes and open academic evaluation regimes, not independent verification of industrial training traces. We release the artifact at https://github.com/xxzcc/awesome-llm-mas-rl, including an 84-entry tagged paper pool, a 32-record exclusion log, scripted corpus statistics, and a minimal JSON schema for replayable orchestration traces.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript surveys reinforcement learning (RL) methods for large language model (LLM)-based multi-agent systems by analyzing orchestration traces—temporal interaction graphs capturing sub-agent spawning, delegation, communication, tool use, aggregation, and stopping. It organizes the literature along three axes: eight families of reward designs (including orchestration rewards for parallelism, split correctness, and aggregation quality), eight credit- or signal-bearing units (from token to team), and five sub-decisions in orchestration learning. A central observation is that, in the authors' curated pool of 84 papers as of May 4, 2026, no explicit RL training method for the stopping decision appears. The work links academic methods to industrial examples from Kimi, OpenAI, and Anthropic and releases an artifact containing the tagged paper pool, 32-record exclusion log, corpus statistics, and a JSON schema for replayable traces.
Significance. If the 84-paper pool and its exclusion criteria are representative, the survey usefully identifies an underexplored area—explicit RL for stopping decisions—in LLM-based multi-agent orchestration, which could direct future work on termination policies. The public release of the tagged pool, exclusion log, scripted statistics, and minimal JSON schema constitutes a concrete contribution to reproducibility, allowing the community to inspect, extend, or challenge the categorization.
major comments (1)
- [Abstract] Abstract and curation description: The gap claim ('no explicit RL training method for the stopping decision') is explicitly scoped to the curated pool, yet the manuscript provides no reproducible account of the search protocol (databases queried, exact Boolean strings, date ranges, or inclusion/exclusion criteria) within the text itself; these details reside only in the GitHub artifact. Because the central finding rests on the completeness of this pool, independent verification of whether relevant RL-MAS papers using alternate terminology were missed is not possible from the manuscript alone.
minor comments (1)
- [Orchestration learning decomposition] The five sub-decisions (spawn, delegate, communicate, aggregate, stop) are clearly enumerated, but the text could add one or two concrete examples of how an implicit stopping mechanism (e.g., value-based termination inside a single policy) would be tagged versus an explicit stopping decision under the proposed schema.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and positive recommendation for minor revision. We address the single major comment below and will incorporate the requested changes to improve reproducibility.
read point-by-point responses
-
Referee: [Abstract] Abstract and curation description: The gap claim ('no explicit RL training method for the stopping decision') is explicitly scoped to the curated pool, yet the manuscript provides no reproducible account of the search protocol (databases queried, exact Boolean strings, date ranges, or inclusion/exclusion criteria) within the text itself; these details reside only in the GitHub artifact. Because the central finding rests on the completeness of this pool, independent verification of whether relevant RL-MAS papers using alternate terminology were missed is not possible from the manuscript alone.
Authors: We agree that the manuscript should contain a self-contained description of the search protocol so that the central gap claim can be evaluated without external resources. In the revised version we will add a new subsection (placed in the curation or methods section) that explicitly lists the databases and repositories queried, the exact Boolean search strings, the date range (up to May 4, 2026), and the complete inclusion/exclusion criteria. The 32-record exclusion log and the tagged 84-paper pool will remain available in the GitHub artifact as supplementary material to support further inspection and extension by the community. revision: yes
Circularity Check
No circularity: descriptive literature survey with externally verifiable curation
full rationale
This is a literature survey paper whose central claims consist of categorizations of an external 84-paper pool along three axes and five sub-decisions, plus the negative observation that no explicit RL method for the stopping decision appears in that pool. No mathematical derivations, fitted parameters renamed as predictions, self-citations used as load-bearing uniqueness theorems, or ansatzes smuggled via prior work are present. The decomposition into spawn/delegate/communicate/aggregate/stop is offered as an analytical lens rather than a self-defining or tautological construction. The released artifact (tagged pool, exclusion log, schema) makes the curation externally inspectable, satisfying the rule that a self-contained survey against external benchmarks receives score 0-2.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The curated pool of 84 papers plus exclusion log adequately represents the relevant literature on RL for LLM-based multi-agent systems as of May 2026.
Reference graph
Works this paper leans on
-
[1]
SAGE: Multi-agent self-evolution for LLM reasoning, 2026
Anonymous. SAGE: Multi-agent self-evolution for LLM reasoning, 2026. ACL Rolling Review January 2026 submission; challenger, planner, solver, and critic co-evolve from a shared LLM backbone; under review; accessed 2026-05-04
2026
-
[2]
Creating custom sub-agents (claude code docs).https://docs.anthropic
Anthropic. Creating custom sub-agents (claude code docs).https://docs.anthropic. com/en/docs/claude-code/sub-agents, 2025. Documentation; accessed 2026-04-27
2025
-
[3]
Our framework for developing safe and trustworthy agents.https://www.anthropic.com/news/ our-framework-for-developing-safe-and-trustworthy-agents, 2025
Anthropic. Our framework for developing safe and trustworthy agents.https://www.anthropic.com/news/ our-framework-for-developing-safe-and-trustworthy-agents, 2025. 2025-08- 04; accessed 2026-04-27
2025
-
[4]
Building a C compiler with a team of parallel Claudes.https:// www.anthropic.com/engineering/building-c-compiler, 2026
Anthropic Engineering. Building a C compiler with a team of parallel Claudes.https:// www.anthropic.com/engineering/building-c-compiler, 2026. 2026-02-05; 16 parallel Claudes; accessed 2026-04-27
2026
-
[5]
Bernstein, Robert Givan, Neil Immerman, and Shlomo Zilberstein
Daniel S. Bernstein, Robert Givan, Neil Immerman, and Shlomo Zilberstein. The com- plexity of decentralized control of Markov decision processes.Mathematics of Operations Research, 27(4):819–840, 2002. Original Dec-POMDP formalism; proves NEXP-complete
2002
- [6]
-
[7]
Exact Is Easier: Credit Assignment for Cooperative LLM Agents
Yanjun Chen et al. Contextual counterfactual credit assignment for multi-agent reinforce- ment learning in LLM collaboration.arXiv preprint arXiv:2603.06859, 2026. Counterfac- tual causal credit assignment at message level
work page internal anchor Pith review arXiv 2026
-
[8]
Yixing Chen et al. Multi-agent evolve: LLM self-improve through co-evolution.arXiv preprint arXiv:2510.23595, 2025. Proposer-Solver-Judge co-evolution; UIUC ulab. Preprint — May 2026 51
-
[9]
Igor Costa. AgentSpawn: Adaptive multi-agent collaboration through dynamic spawning for long-horizon code generation.arXiv preprint arXiv:2602.07072, 2026. Runtime dynamic spawn + memory transfer; sole-author
-
[10]
Multi-agent collaboration via evolving orchestration
Yufan Dang et al. Multi-agent collaboration via evolving orchestration. InAdvances in Neural Information Processing Systems (NeurIPS), 2025. Puppeteer central orchestrator; Tsinghua/OpenBMB ChatDev team
2025
-
[11]
Christian Schroeder de Witt, Tarun Gupta, Denys Makoviichuk, Viktor Makoviychuk, Philip H. S. Torr, Mingfei Sun, and Shimon Whiteson. Is independent learning all you need in the StarCraft multi-agent challenge?arXiv preprint arXiv:2011.09533, 2020. IPPO; independent PPO competitive on SMAC
-
[12]
AgentDojo: A dynamic environment to evaluate prompt injection attacks and defenses for LLM agents
Edoardo Debenedetti, Jie Zhang, Mislav Balunovi´ c, Luca Beurer-Kellner, Marc Fischer, and Florian Tram` er. AgentDojo: A dynamic environment to evaluate prompt injection attacks and defenses for LLM agents. InAdvances in Neural Information Processing Sys- tems (NeurIPS) Datasets and Benchmarks Track, 2024. 97 realistic tasks, 629 security test cases; ETH SPY Lab
2024
-
[13]
DeepSeek-R1: Incentivizing reasoning capability in LLMs via reinforcement learning.Nature, 645:633–638, 2025
DeepSeek-AI. DeepSeek-R1: Incentivizing reasoning capability in LLMs via reinforcement learning.Nature, 645:633–638, 2025. Rule-based RL unlocks long-CoT reasoning; R1 & R1-Zero
2025
-
[14]
Wasp: Benchmarking web agent security against prompt injection attacks,
Ivan Evtimov, Arman Zharmagambetov, Aaron Grattafiori, Chuan Guo, and Kama- lika Chaudhuri. WASP: Benchmarking web agent security against prompt injection at- tacks.arXiv preprint arXiv:2504.18575, 2025. Meta FAIR; end-to-end web-agent prompt- injection benchmark
- [15]
-
[16]
Foerster, Gregory Farquhar, Triantafyllos Afouras, Nantas Nardelli, and Shi- mon Whiteson
Jakob N. Foerster, Gregory Farquhar, Triantafyllos Afouras, Nantas Nardelli, and Shi- mon Whiteson. Counterfactual multi-agent policy gradients. InProceedings of the AAAI Conference on Artificial Intelligence, 2018. COMA; counterfactual baseline for per-agent credit
2018
-
[17]
Not what you’ve signed up for: Compromising real-world LLM-integrated applications with indirect prompt injection
Kai Greshake, Sahar Abdelnabi, Shailesh Mishra, Christoph Endres, Thorsten Holz, and Mario Fritz. Not what you’ve signed up for: Compromising real-world LLM-integrated applications with indirect prompt injection. InProceedings of the 16th ACM Workshop on Artificial Intelligence and Security (AISec), 2023. Foundational paper on indirect prompt injection
2023
-
[18]
Fung, Ming Yan, Ji Zhang, Fei Huang, and Yang Liu
Zhitao He, Zijun Liu, Peng Li, Yi R. Fung, Ming Yan, Ji Zhang, Fei Huang, and Yang Liu. Advancing language multi-agent learning with credit re-assignment for interactive en- vironment generalization. InConference on Language Modeling (COLM), 2025. Introduces CollabUIAgents and multi-agent credit re-assignment for interactive UI/web environments; accessed ...
2025
-
[19]
Haoyang Hong et al. Multi-agent deep research: Training multi-agent systems with M- GRPO.arXiv preprint arXiv:2511.13288, 2025. Hierarchical GRPO; Ant Group. Preprint — May 2026 52
-
[20]
Zhipeng Hou et al. HALO: Hierarchical autonomous logic-oriented orchestration for multi- agent LLM systems.arXiv preprint arXiv:2505.13516, 2025. MCTS-based three-layer hierarchical MAS
-
[21]
DEPART: Hierarchical multi-agent system for multi-turn interaction, 2026
Hao-Lun Hsu, Jing Xu, Nikhil Vichare, Francesco Carbone, Miroslav Pajic, and Giuseppe Carenini. DEPART: Hierarchical multi-agent system for multi-turn interaction, 2026. OpenReview ICLR 2026 submission; introduces HIMPO for alternating planner/executor post-training with role-specific rewards; accessed 2026-05-04
2026
-
[22]
OWL: Optimized workforce learning for general multi-agent assistance in real-world task automation
Mengkang Hu, Yuhang Zhou, Wendong Fan, Yuzhou Nie, Ziyu Ye, Bowei Xia, Tao Sun, Zhaoxuan Jin, Yingru Li, Zeyu Zhang, Yifeng Wang, Qianshuo Ye, Bernard Ghanem, Ping Luo, and Guohao Li. OWL: Optimized workforce learning for general multi-agent assistance in real-world task automation. InAdvances in Neural Information Processing Systems (NeurIPS), 2025. Neur...
2025
-
[23]
Eric Hanchen Jiang, Levina Li, Rui Sun, Xiao Liang, Yubei Li, Yuchen Wu, Haozheng Luo, Hengli Li, Zhi Zhang, Zhaolu Kang, Kai-Wei Chang, and Ying Nian Wu. Agent Q-Mix: Selecting the right action for LLM multi-agent systems through reinforcement learning.arXiv preprint arXiv:2604.00344, 2026. QMIX-style CTDE for decentralized communication and topology dec...
-
[24]
Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan
Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. SWE-bench: Can language models resolve real-world GitHub issues? InInternational Conference on Learning Representations (ICLR), 2024. 2294 real GitHub issues from 12 Python repos
2024
-
[25]
TAMAS: Benchmarking adversarial risks in multi-agent LLM systems
Ishan Kavathekar et al. TAMAS: Benchmarking adversarial risks in multi-agent LLM systems. InICML 2025 Multi-Agent Systems Workshop, 2025. First adversarial robustness benchmark for MAS
2025
-
[26]
MAS-Zero: Designing multi-agent systems with zero supervision.arXiv preprint arXiv:2505.14996, 2025
Zixuan Ke, Austin Xu, Yifei Ming, Xuan-Phi Nguyen, Caiming Xiong, and Shafiq Joty. MAS-Zero: Designing multi-agent systems with zero supervision.arXiv preprint arXiv:2505.14996, 2025. Inference-time self-evolved MAS design through meta-level design feedback and self-verification; accessed 2026-04-27
-
[27]
Agents under siege: Breaking pragmatic multi-agent LLM systems with optimized prompt attacks
Rana Muhammad Shahroz Khan, Zhen Tan, Sukwon Chen, Patrick Foulds, Sean Yong, Huan Liu, and Tianlong Chen. Agents under siege: Breaking pragmatic multi-agent LLM systems with optimized prompt attacks. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL), 2025. Permutation-invariant adversarial attack on multi-age...
2025
-
[28]
Kimi K2.5: Visual Agentic Intelligence
Kimi Team. Kimi K2.5: Visual agentic intelligence.https://www.kimi.com/blog/ kimi-k2-5.html, 2026. Technical report by Moonshot AI; arXiv 2602.02276; introduces Agent Swarm + PARL; accessed 2026-04-27
work page internal anchor Pith review arXiv 2026
-
[29]
Kimi K2.6 tech blog.https://www.kimi.com/blog/kimi-k2-6, 2026
Kimi Team. Kimi K2.6 tech blog.https://www.kimi.com/blog/kimi-k2-6, 2026. 2026- 04-20; 300-agent coordination, Claw Groups; accessed 2026-04-27
2026
-
[30]
Experience as a Compass: Multi-agent RAG with Evolving Orchestration and Agent Prompts
Sha Li et al. Experience as a compass: Multi-agent RAG with evolving orchestration and agent prompts.arXiv preprint arXiv:2604.00901, 2026. HERA; evolving orchestration policy for MAS-RAG. Preprint — May 2026 53
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[31]
Yanming Li et al. Who deserves the reward? SHARP: Shapley credit-based optimiza- tion for multi-agent system.arXiv preprint arXiv:2602.08335, 2026. Shapley-value-based hierarchical credit assignment
-
[32]
MARFT: multi-agent reinforcement fine-tuning.CoRR, abs/2504.16129, 2025
Junwei Liao et al. MARFT: Multi-agent reinforcement fine-tuning.arXiv preprint arXiv:2504.16129, 2025. v4 dated 2025-11-03; submitted to ICLR 2026
-
[33]
Michael L. Littman. Markov games as a framework for multi-agent reinforcement learning. InProceedings of the International Conference on Machine Learning (ICML), pages 157– 163, 1994. Foundational stochastic/Markov-game formulation for MARL
1994
-
[34]
SPIRAL: Self- play on zero-sum games incentivizes reasoning via multi-agent multi-turn reinforcement learning
Bo Liu, Simon Yu, Zichen Liu, Leon Guertler, Penghui Qi, Daniel Balcells, Mickel Liu, Cheston Tan, Weiyan Shi, Min Lin, Wee Sun Lee, and Natasha Jaques. SPIRAL: Self- play on zero-sum games incentivizes reasoning via multi-agent multi-turn reinforcement learning. InInternational Conference on Learning Representations (ICLR), 2026. ICLR 2026 poster; role-c...
2026
-
[35]
Keliang Liu et al. Reinforcement learning meets large language models: A survey of advancements and applications across the LLM lifecycle.arXiv preprint arXiv:2509.16679,
-
[36]
Fudan/Tongji/CUHK MMLab
-
[37]
Shulin Liu et al. MarsRL: Advancing multi-agent reasoning system via reinforcement learning with agentic pipeline parallelism.arXiv preprint arXiv:2511.11373, 2025. Agentic pipeline-parallel RL
-
[38]
arXiv preprint arXiv:2508.04652 , year=
Shuo Liu, Christopher Amato, et al. LLM collaboration with multi-agent reinforcement learning.arXiv preprint arXiv:2508.04652, 2025. v7 dated 2025-12-09; introduces MA- GRPO
-
[39]
Learning Decentralized LLM Collaboration with Multi-Agent Actor Critic
Shuo Liu, Tianle Chen, Ryan Amiri, and Christopher Amato. Learning decentralized LLM collaboration with multi-agent actor critic.arXiv preprint arXiv:2601.21972, 2026. Introduces CoLLM-CC and CoLLM-DC actor-critic variants for decentralized LLM collab- oration; accessed 2026-05-04
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[40]
Multi- agent actor-critic for mixed cooperative-competitive environments
Ryan Lowe, Yi Wu, Aviv Tamar, Jean Harb, Pieter Abbeel, and Igor Mordatch. Multi- agent actor-critic for mixed cooperative-competitive environments. InAdvances in Neural Information Processing Systems (NeurIPS), 2017. MADDPG; centralized critic, decentral- ized actor
2017
-
[41]
Agent lightning: Train any ai agents with reinforcement learning,
Xufang Luo et al. Agent Lightning: Train any AI agents with reinforcement learning.arXiv preprint arXiv:2508.03680, 2025. Microsoft Research; decouples agent execution from RL training
-
[42]
A trouble- maker with contagious jailbreak makes chaos in honest towns
Tianyi Men, Pengfei Cao, Zhuoran Jin, Yubo Chen, Kang Liu, and Jun Zhao. A trouble- maker with contagious jailbreak makes chaos in honest towns. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL), 2025. Contagious jailbreak that propagates through agent memory across non-complete-graph topologies
2025
-
[43]
GAIA: a benchmark for General AI Assistants
Gr´ egoire Mialon, Cl´ ementine Fourrier, Craig Swift, Thomas Wolf, Yann LeCun, and Thomas Scialom. GAIA: a benchmark for general AI assistants.arXiv preprint arXiv:2311.12983, 2023. 466 tool-use-heavy real-world questions. Preprint — May 2026 54
work page internal anchor Pith review arXiv 2023
-
[44]
Multi-agent tool-integrated policy optimization.arXiv preprint arXiv:2510.04678, 2025
Zhanfeng Mo et al. Multi-agent tool-integrated policy optimization.arXiv preprint arXiv:2510.04678, 2025. Single-LLM dual-role planner+worker; +18.38% over single-agent
-
[45]
MALT: Improving reasoning with multi-agent LLM train- ing
Sumeet Ramesh Motwani et al. MALT: Improving reasoning with multi-agent LLM train- ing. InConference on Language Modeling (COLM), 2025. Generator-verifier-refiner training with role-PRM (+14.14%)
2025
-
[46]
Introducing codex.https://openai.com/index/introducing-codex/;https: //openai.com/index/introducing-the-codex-app/, 2025
OpenAI. Introducing codex.https://openai.com/index/introducing-codex/;https: //openai.com/index/introducing-the-codex-app/, 2025. 2025-05-16 launch post plus Codex app materials; cloud-native parallel software-engineering agent; accessed 2026-04-27
2025
-
[47]
Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Chris- tiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human feedbac...
2022
-
[48]
MAPoRL: Multi-agent post-co-training for collaborative large lan- guage models with reinforcement learning
Chanwoo Park et al. MAPoRL: Multi-agent post-co-training for collaborative large lan- guage models with reinforcement learning. InAnnual Meeting of the Association for Com- putational Linguistics (ACL), 2025. MIT; first explicit post-training RL for collaboration
2025
-
[49]
Zhongyuan Peng, Yifan Yao, Kaijing Ma, Shuyue Guo, Yizhe Li, Yichi Zhang, Chenchen Zhang, Yifan Zhang, Zhouliang Yu, et al. CriticLean: Critic-guided reinforcement learning for mathematical formalization.arXiv preprint arXiv:2507.06181, 2025. Trains a critic via SFT+RL to score Lean 4 formalizations; concrete instance of verifier-as-reward (R6)
-
[50]
ToolLLM: Facilitating large language models to master 16000+ real-world APIs
Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang, Bill Qian, Sihan Zhao, Lauren Hong, Runchu Tian, Ruobing Xie, Jie Zhou, Mark Gerstein, Dahai Li, Zhiyuan Liu, and Maosong Sun. ToolLLM: Facilitating large language models to master 16000+ real-world APIs. InInternational Conference on Learning Representat...
2024
-
[51]
QMIX: Monotonic value function factorisation for deep multi-agent reinforcement learning
Tabish Rashid, Mikayel Samvelyan, Christian Schroeder de Witt, Gregory Farquhar, Jakob Foerster, and Shimon Whiteson. QMIX: Monotonic value function factorisation for deep multi-agent reinforcement learning. InProceedings of the International Conference on Machine Learning (ICML), 2018. QMIX; monotonic mixing network
2018
-
[52]
Debate as Reward: A Multi-Agent Reward System for Scientific Ideation via RL Post-Training
Moein Salimi et al. Debate as reward: A multi-agent reward system for scientific ideation via RL post-training.arXiv preprint arXiv:2604.16723, 2026. Multi-agent debate as reward signal
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[53]
Proximal Policy Optimization Algorithms
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017. PPO; clipped surrogate objective
work page internal anchor Pith review arXiv 2017
-
[54]
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. DeepSeekMath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024. Original source of GRPO. Preprint — May 2026 55
work page internal anchor Pith review arXiv 2024
-
[55]
Tenenbaum, Antonio Torralba, Shuang Li, and Igor Mordatch
Vighnesh Subramaniam, Yilun Du, Joshua B. Tenenbaum, Antonio Torralba, Shuang Li, and Igor Mordatch. Multiagent finetuning: Self improvement with diverse reasoning chains. InInternational Conference on Learning Representations (ICLR), 2025. Finetunes a society of language models using diverse reasoning chains generated through multiagent interac- tion; ac...
2025
-
[56]
Scaling long-horizon LLM agent via context-folding.CoRR, abs/2510.11967, 2025
Weiwei Sun et al. Scaling long-horizon LLM agent via context-folding.arXiv preprint arXiv:2510.11967, 2025. ByteDance Seed/CMU; submitted to ICLR 2026
-
[57]
Czarnecki, Vinicius Zambaldi, Max Jaderberg, Marc Lanctot, Nicolas Sonnerat, Joel Z
Peter Sunehag, Guy Lever, Audrunas Gruslys, Wojciech M. Czarnecki, Vinicius Zambaldi, Max Jaderberg, Marc Lanctot, Nicolas Sonnerat, Joel Z. Leibo, Karl Tuyls, and Thore Graepel. Value-decomposition networks for cooperative multi-agent learning. InProceed- ings of the 17th International Conference on Autonomous Agents and MultiAgent Systems (AAMAS), 2018....
2018
-
[58]
Multi-Agent Collaboration Mechanisms: A Survey of LLMs
Khanh-Tung Tran, Barry O’Sullivan, et al. Multi-agent collaboration mechanisms: A survey of LLMs.arXiv preprint arXiv:2501.06322, 2025. UCC Ireland
work page internal anchor Pith review arXiv 2025
-
[59]
ReMA: Learning to meta-think for LLMs with multi-agent reinforcement learning
Ziyu Wan, Yunxiang Li, Xiaoyu Wen, Yan Song, Hanjing Wang, Linyi Yang, Mark Schmidt, Jun Wang, Weinan Zhang, Shuyue Hu, and Ying Wen. ReMA: Learning to meta-think for LLMs with multi-agent reinforcement learning. InAdvances in Neural Information Processing Systems (NeurIPS), 2025. NeurIPS 2025 poster; multi-agent RL framework for meta-thinking with high-l...
2025
-
[60]
Shapley Q-value: A local reward approach to solve global reward games
Jianhong Wang, Yuan Zhang, Tae-Kyun Kim, and Yunjie Gu. Shapley Q-value: A local reward approach to solve global reward games. InProceedings of the AAAI Conference on Artificial Intelligence, 2020. Shapley-value credit assignment for cooperative MARL
2020
-
[61]
MTU-Bench: A multi-granularity tool-use benchmark for large language models
Pei Wang, Yanan Wu, Zekun Wang, Jiaheng Liu, Xiaoshuai Song, Zhongyuan Peng, Ken Deng, Chenchen Zhang, Jiakai Wang, et al. MTU-Bench: A multi-granularity tool-use benchmark for large language models. InInternational Conference on Learning Represen- tations (ICLR), 2025. Five-granularity tool-use benchmark covering single/multi-turn and single/multi-tool scenarios
2025
-
[62]
Shijie Wang, Pengfei Li, Yikun Fu, Kaifeng Liu, Fangyuan Li, Yang Liu, Xiaowei Sun, Zonglin Li, Siyao Zhao, Jian Zhao, Kai Tian, Dong Li, Junqi Gao, Yutong Zhang, Yiqun Chen, Yuqiang Li, Zoe Li, Weinan Zhang, Peng Ye, Shuyue Hu, Lei Bai, Bowen Zhou, Kaiyan Zhang, and Biqing Qi. MARTI-MARS 2: Scaling multi-agent self-search via rein- forcement learning for...
-
[63]
Towards Scalable Lightweight GUI Agents via Multi-role Orchestration
Ziwei Wang, Junjie Zheng, Leyang Yang, Sheng Zhou, Xiaoxuan Tang, Zhouhua Fang, Zhiwei Liu, Dajun Chen, Yong Li, and Jiajun Bu. Towards scalable lightweight GUI agents via multi-role orchestration.arXiv preprint arXiv:2604.13488, 2026. Findings of ACL 2026; multi-role orchestration and RL for role-oriented cooperative exploration; accessed 2026- 05-04
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[64]
BrowseComp: A Simple Yet Challenging Benchmark for Browsing Agents
Jason Wei, Zhiqing Sun, Spencer Papay, Scott McKinney, Jeffrey Han, Isa Ful- ford, Hyung Won Chung, Alex Tachard Passos, William Fedus, and Amelia Glaese. BrowseComp: A simple yet challenging benchmark for browsing agents.arXiv preprint arXiv:2504.12516, 2025. OpenAI 2025-04-17; 1266 hard browsing questions. Preprint — May 2026 56
work page internal anchor Pith review arXiv 2025
-
[65]
arXiv preprint arXiv:2601.12538 , year=
Tianxin Wei, Heng Ji, et al. Agentic reasoning for large language models.arXiv preprint arXiv:2601.12538, 2026. UIUC; 29-author team
-
[66]
MAGIC: A co-evolving attacker-defender adversarial game for robust LLM safety
Xiaoyu Wen, Zhida He, Han Qi, Ziyu Wan, Zhongtian Ma, Ying Wen, Tianhang Zheng, Xingcheng Xu, Chaochao Lu, and Qiaosheng Zhang. MAGIC: A co-evolving attacker- defender adversarial game for robust LLM safety.arXiv preprint arXiv:2602.01539, 2026. Multi-turn attacker-defender multi-agent RL for safety alignment; accessed 2026-05-04
-
[67]
Wolpert and Kagan Tumer
David H. Wolpert and Kagan Tumer. Optimal payoff functions for members of collectives. Advances in Complex Systems, 4(2–3):265–279, 2001. Difference rewards / Wonderful Life Utility; foundational credit assignment
2001
-
[68]
OSWorld: Benchmarking multimodal agents for open-ended tasks in real computer environments
Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh Jing Hua, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, Yitao Liu, Yiheng Xu, Shuyan Zhou, Silvio Savarese, Caiming Xiong, Victor Zhong, and Tao Yu. OSWorld: Benchmarking multimodal agents for open-ended tasks in real computer environments. InAdvances in Neural Information P...
2024
-
[69]
Zelai Xu, Zhexuan Xu, Ruize Zhang, Chunyang Zhu, Shi Yu, Weilin Liu, Quanlu Zhang, Wenbo Ding, Chao Yu, and Yu Wang. WideSeek-R1: Exploring width scaling for broad in- formation seeking via multi-agent reinforcement learning.arXiv preprint arXiv:2602.04634,
-
[70]
Lead-agent/subagent MARL for broad information seeking and width scaling; ac- cessed 2026-05-04
2026
-
[71]
CoMAS: Co-evolving multi-agent systems via interaction rewards
Xiangyuan Xue, Yifan Zhou, Guibin Zhang, Zaibin Zhang, Yijiang Li, Chen Zhang, Zhenfei Yin, Philip Torr, Wanli Ouyang, and Lei Bai. CoMAS: Co-evolving multi-agent systems via interaction rewards. InInternational Conference on Learning Representations (ICLR),
-
[72]
ICLR 2026 poster; self-evolution through interaction-derived rewards and LLM-as- judge reward construction; accessed 2026-04-27
2026
-
[73]
Wei Yang and Jesse Thomason. Learning to deliberate: Meta-policy collaboration for agentic LLMs with multi-agent reinforcement learning.arXiv preprint arXiv:2509.03817,
-
[74]
Introduces MPDF and SoftRankPO for decentralized meta-cognitive actions Persist, Refine, and Concede; accessed 2026-04-27
2026
-
[75]
Langmarl: Natural language multi-agent reinforcement learning,
Huaiyuan Yao, Longchao Da, Xiaoou Liu, Charles Fleming, Tianlong Chen, and Hua Wei. LangMARL: Natural language multi-agent reinforcement learning.arXiv preprint arXiv:2604.00722, 2026. Agent-level language credit assignment and policy-gradient evo- lution in language space; accessed 2026-05-04
-
[76]
$\tau$-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains
Shunyu Yao, Noah Shinn, Pedram Razavi, and Karthik Narasimhan.τ-bench: A benchmark for tool-agent-user interaction in real-world domains.arXiv preprint arXiv:2406.12045, 2024. Sierra Research; retail/airline domains with policy adherence
work page internal anchor Pith review arXiv 2024
-
[77]
ReAct: Synergizing reasoning and acting in language models
Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. ReAct: Synergizing reasoning and acting in language models. InInternational Conference on Learning Representations (ICLR), 2023. Interleaved reasoning+acting; agentic origin
2023
-
[78]
The surprising effectiveness of PPO in cooperative multi-agent games
Chao Yu, Akash Velu, Eugene Vinitsky, Jiaxuan Gao, Yu Wang, Alexandre Bayen, and Yi Wu. The surprising effectiveness of PPO in cooperative multi-agent games. InAdvances in Neural Information Processing Systems (NeurIPS) Datasets and Benchmarks Track,
-
[79]
Preprint — May 2026 57
MAPPO; PPO with centralized value for cooperative MARL. Preprint — May 2026 57
2026
-
[80]
MARSHAL: Incentivizing multi-agent reasoning via self-play with strategic LLMs
Huining Yuan, Zelai Xu, Zheyue Tan, Xiangmin Yi, Mo Guang, Kaiwen Long, Haojia Hui, Boxun Li, Xinlei Chen, Bo Zhao, Xiao-Ping Zhang, Chao Yu, and Yu Wang. MARSHAL: Incentivizing multi-agent reasoning via self-play with strategic LLMs. InInternational Con- ference on Learning Representations (ICLR), 2026. ICLR 2026 poster; turn-level advantage estimation a...
2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.