Recognition: unknown
MARS²: Scaling Multi-Agent Tree Search via Reinforcement Learning for Code Generation
Pith reviewed 2026-05-10 11:23 UTC · model grok-4.3
The pith
MARS² lets multiple agents collaborate inside a shared tree search to improve reinforcement learning for code generation.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
MARS² is a unified RL framework in which multiple independently-optimized agents collaborate within a shared tree-structured search environment. The search tree is treated as a learnable multi-agent interaction setting that lets heterogeneous agents generate and refine candidate solutions together. A path-level group advantage formulation based on tree-consistent reward shaping supports credit assignment across the resulting complex trajectories. Experiments confirm that coupling multi-agent collaboration with tree search raises performance on code generation benchmarks across varied model combinations and training regimes.
What carries the argument
Shared multi-agent tree search environment with path-level group advantage based on tree-consistent reward shaping for credit assignment.
If this is right
- Code generation accuracy rises consistently when multi-agent collaboration is embedded inside tree search rather than used separately.
- The same framework delivers gains across many different model sizes, architectures, and training schedules.
- Structured exploration from trees plus diverse signals from multiple policies together produce stronger learning than either element alone.
- Path-level credit assignment makes stable training feasible even when search trajectories become long and branched.
Where Pith is reading between the lines
- The same multi-agent tree structure could be tested on other structured reasoning tasks such as mathematical proof generation or program synthesis.
- Giving agents distinct roles or tool sets inside the shared tree might increase solution diversity further.
- Scaling might occur by growing the number of collaborating agents and tree depth instead of solely enlarging individual models.
- The tree topology itself could be made learnable so the search structure adapts during training.
Load-bearing premise
The path-level group advantage with tree-consistent reward shaping assigns credit reliably across multi-agent search paths without causing instability or bias that would erase the performance gains.
What would settle it
An experiment on a code generation benchmark that finds no accuracy improvement or added training instability when the path-level group advantage is replaced by a simpler single-agent baseline would falsify the central claim.
Figures
read the original abstract
Reinforcement learning (RL) paradigms have demonstrated strong performance on reasoning-intensive tasks such as code generation. However, limited trajectory diversity often leads to diminishing returns, which constrains the achievable performance ceiling. Search-enhanced RL alleviates this issue by introducing structured exploration, which remains constrained by the single-agent policy priors. Meanwhile, leveraging multiple interacting policies can acquire more diverse exploratory signals, but existing approaches are typically decoupled from structured search. We propose \textbf{MARS$^2$} (Multi-Agent Reinforced Tree-Search Scaling), a unified RL framework in which multiple independently-optimized agents collaborate within a shared tree-structured search environment. MARS$^2$ models the search tree as a learnable multi-agent interaction environment, enabling heterogeneous agents to collaboratively generate and refine candidate solutions within a shared search topology. To support effective learning, we introduce a path-level group advantage formulation based on tree-consistent reward shaping, which facilitates effective credit assignment across complex search trajectories. Experiments on code generation benchmarks show that MARS$^2$ consistently improves performance across diverse model combinations and training settings, demonstrating the effectiveness of coupling multi-agent collaboration with tree search for enhancing reinforcement learning. Our code is publicly available at https://github.com/TsinghuaC3I/MARTI.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes MARS², a unified RL framework for code generation in which multiple independently optimized agents collaborate inside a shared tree-structured search environment. It introduces a path-level group advantage formulation with tree-consistent reward shaping to enable credit assignment over complex trajectories, and reports that this coupling of multi-agent interaction with tree search yields consistent performance gains across diverse model combinations and training settings.
Significance. If the empirical claims hold after addressing the credit-assignment details, the work would offer a concrete route to increase trajectory diversity in RL for reasoning tasks beyond single-agent search priors. The public code release at https://github.com/TsinghuaC3I/MARTI supports reproducibility and is a positive contribution.
major comments (2)
- [§3] §3 (Path-level group advantage formulation): the description of the advantage estimator does not include per-agent normalization, trajectory-length correction, or variance controls. In a shared tree topology, heterogeneous agents produce overlapping paths; without these safeguards the grouping can let longer or higher-variance trajectories from stronger agents dominate the advantage signal, risking that observed gains are artifacts of reward shaping rather than genuine multi-agent collaboration. This directly affects the central claim that the formulation enables effective credit assignment.
- [Experiments] Experiments section: while the abstract states 'consistent improvements across diverse model combinations,' the manuscript supplies no quantitative tables, baseline deltas, statistical significance tests, or ablation results isolating the contribution of the path-level advantage versus simpler multi-agent or tree-search baselines. Without these data it is impossible to verify whether the claimed unification actually delivers the reported gains.
minor comments (2)
- [§3] Notation for the tree-consistent reward shaping is introduced without an explicit equation reference or pseudocode; a single displayed equation would improve clarity.
- [Abstract] The abstract claims performance gains but contains no numerical results or baseline names; moving at least one key table or figure reference into the abstract would help readers assess the magnitude of improvement.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment below and clarify our responses while committing to revisions that strengthen the technical exposition and empirical support without altering the core claims.
read point-by-point responses
-
Referee: [§3] §3 (Path-level group advantage formulation): the description of the advantage estimator does not include per-agent normalization, trajectory-length correction, or variance controls. In a shared tree topology, heterogeneous agents produce overlapping paths; without these safeguards the grouping can let longer or higher-variance trajectories from stronger agents dominate the advantage signal, risking that observed gains are artifacts of reward shaping rather than genuine multi-agent collaboration. This directly affects the central claim that the formulation enables effective credit assignment.
Authors: We appreciate this precise concern about potential bias in the shared-tree setting. The path-level group advantage groups complete trajectories by their terminal nodes and applies tree-consistent reward shaping to ensure rewards are assigned consistently along each path. However, the original §3 description did not explicitly detail safeguards against dominance by longer or higher-variance paths. We agree this omission could affect interpretability of the credit-assignment claim. In the revised manuscript we will augment §3 with per-agent advantage normalization, explicit trajectory-length correction, and a variance-control baseline. Updated equations and a short derivation showing how these preserve the group-level signal while mitigating the identified risks will be included. revision: yes
-
Referee: [Experiments] Experiments section: while the abstract states 'consistent improvements across diverse model combinations,' the manuscript supplies no quantitative tables, baseline deltas, statistical significance tests, or ablation results isolating the contribution of the path-level advantage versus simpler multi-agent or tree-search baselines. Without these data it is impossible to verify whether the claimed unification actually delivers the reported gains.
Authors: We acknowledge that the experimental presentation in the submitted manuscript was insufficiently detailed for independent verification. While Section 4 reports performance across model combinations, it did not include explicit delta tables, statistical tests, or ablations that isolate the path-level group advantage. We have prepared the following additions for the revision: (i) expanded tables with absolute scores, deltas relative to single-agent tree search and multi-agent baselines, (ii) paired t-test p-values across runs, and (iii) a dedicated ablation subsection comparing the full MARS² formulation against ablated variants that remove either the group advantage or the tree structure. These changes will be placed in the Experiments section and will directly quantify the contribution of the proposed unification. revision: yes
Circularity Check
Empirical framework proposal with no load-bearing circular derivations
full rationale
The manuscript introduces MARS² as a unified RL framework combining multi-agent collaboration and tree search for code generation. The central technical contribution is the path-level group advantage formulation with tree-consistent reward shaping, presented as a design choice to support credit assignment. No equations, uniqueness theorems, or first-principles derivations are exhibited that reduce the claimed performance gains to fitted parameters, self-referential quantities, or self-citation chains. Experiments are framed as empirical validation across model combinations rather than predictions derived from the inputs by construction. This is a standard empirical systems paper; the derivation chain is self-contained against external benchmarks and does not trigger any of the enumerated circularity patterns.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Chenxin An, Zhihui Xie, Xiaonan Li, Lei Li, Jun Zhang, Shansan Gong, Ming Zhong, Jingjing Xu, Xipeng Qiu, Mingxuan Wang, and Lingpeng Kong. 2025. https://hkunlp.github.io/blog/2025/Polaris Polaris: A post-training recipe for scaling reinforcement learning on advanced reasoning models
2025
- [2]
-
[3]
Martin Ester, Hans-Peter Kriegel, J \"o rg Sander, Xiaowei Xu, and 1 others. 1996. A density-based algorithm for discovering clusters in large spatial databases with noise. In kdd, volume 96, pages 226--231
1996
-
[4]
Wei Fu, Jiaxuan Gao, Xujie Shen, Chen Zhu, Zhiyu Mei, Chuyi He, Shusheng Xu, Guo Wei, Jun Mei, Jiashu Wang, and 1 others. 2025. Areal: A large-scale asynchronous reinforcement learning system for language reasoning
2025
-
[5]
Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, and 1 others. 2025. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948
work page internal anchor Pith review Pith/arXiv arXiv 2025
- [6]
-
[7]
Zhenyu Hou, Ziniu Hu, Yujiang Li, Rui Lu, Jie Tang, and Yuxiao Dong. 2025. Treerl: Llm reinforcement learning with on-policy tree search. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 12355--12369
2025
- [8]
-
[9]
Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica. 2024. Livecodebench: Holistic and contamination free evaluation of large language models for code. arXiv preprint arXiv:2403.07974
work page internal anchor Pith review arXiv 2024
-
[10]
Jaehun Jung, Seungju Han, Ximing Lu, Skyler Hallinan, David Acuna, Shrimai Prabhumoye, Mostafa Patwary, Mohammad Shoeybi, Bryan Catanzaro, and Yejin Choi. 2025. Prismatic synthesis: Gradient-based data diversification boosts generalization in llm reasoning. arXiv preprint arXiv:2505.20161
- [11]
-
[12]
Gonzalez, Hao Zhang, and Ion Stoica
Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. 2023. Efficient memory management for large language model serving with pagedattention. In Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles
2023
-
[13]
Seonghyeon Lee, HeeJae Chon, Joonwon Jang, Dongha Lee, and Hwanjo Yu. 2025. How diversely can language models solve problems? exploring the algorithmic diversity of model-generated code. In Findings of the Association for Computational Linguistics: EMNLP 2025, Suzhou, China. Association for Computational Linguistics
2025
-
[14]
Yizhi Li, Qingshui Gu, Zhoufutu Wen, Ziniu Li, Tianshun Xing, Shuyue Guo, Tianyu Zheng, Xin Zhou, Xingwei Qu, Wangchunshu Zhou, Zheng Zhang, Wei Shen, Qian Liu, Chenghua Lin, Jian Yang, Ge Zhang, and Wenhao Huang. 2025. https://arxiv.org/abs/2508.17445 Treepo: Bridging the gap of policy optimization and efficacy and inference efficiency with heuristic tre...
-
[15]
Shulin Liu, Dong Du, Tao Yang, Yang Li, and Boyu Qiu. 2025. Marsrl: Advancing multi-agent reasoning system via reinforcement learning with agentic pipeline parallelism
2025
-
[16]
Ryan Lowe, Yi Wu, Aviv Tamar, Jean Harb, Pieter Abbeel, and Igor Mordatch. 2017. https://proceedings.neurips.cc/paper/2017/hash/68a9750337a418a86fe06c1991a1d64c-Abstract.html Multi-agent actor-critic for mixed cooperative-competitive environments . In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing ...
2017
-
[17]
Michael Luo, Sijun Tan, Roy Huang, Ameen Patel, Alpay Ariyak, Qingyang Wu, Xiaoxiang Shi, Rachel Xin, Colin Cai, Maurice Weber, and 1 others. 2025. Deepcoder: A fully open-source 14b coder at o3-mini level. Notion Blog
2025
-
[18]
Ozdaglar, Kaiqing Zhang, and Joo - Kyung Kim
Chanwoo Park, Seungju Han, Xingzhi Guo, Asuman E. Ozdaglar, Kaiqing Zhang, and Joo - Kyung Kim. 2025 a . https://aclanthology.org/2025.acl-long.1459/ Maporl: Multi-agent post-co-training for collaborative large language models with reinforcement learning . In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1...
2025
-
[19]
Chanwoo Park, Seungju Han, Xingzhi Guo, Asuman E Ozdaglar, Kaiqing Zhang, and Joo-Kyung Kim. 2025 b . Maporl: Multi-agent post-co-training for collaborative large language models with reinforcement learning. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 30215--30248
2025
-
[20]
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, and 1 others. 2024. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300
work page internal anchor Pith review Pith/arXiv arXiv 2024
- [21]
-
[22]
David Silver and 1 others. 2016. Mastering the game of go with deep neural networks and tree search. Nature
2016
-
[23]
Yuda Song, Julia Kempe, and R \'e mi Munos. 2025. https://openreview.net/forum?id=VORSpYLBJ6 Outcome-based exploration for LLM reasoning . In NeurIPS 2025 Workshop: Second Workshop on Aligning Reinforcement Learning Experimentalists and Theorists
2025
-
[24]
Thompson
William R. Thompson. 1933. https://api.semanticscholar.org/CorpusID:120462794 On the likelihood that one unknown probability exceeds another in view of the evidence of two samples . Biometrika, 25:285--294
1933
-
[25]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, and 1 others. 2025. Qwen3 technical report. arXiv preprint arXiv:2505.09388
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[26]
An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, and 22 others. 2024. https://doi.org/10.48550/ARXIV.2412.15115 Qwen2.5 technical report . CoRR, abs/2412.15115
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2412.15115 2024
-
[27]
Feng Yao, Liyuan Liu, Dinghuai Zhang, Chengyu Dong, Jingbo Shang, and Jianfeng Gao. 2025. Your efficient rl framework secretly brings you off-policy rl training, august 2025. URL https://fengyao. notion. site/off-policy-rl
2025
-
[28]
Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, and 1 others. 2025. Dapo: An open-source llm reinforcement learning system at scale. arXiv preprint arXiv:2503.14476
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[29]
Kaiyan Zhang, Runze Liu, Xuekai Zhu, Kai Tian, Sihang Zeng, Guoli Jia, Yuchen Fan, Xingtai Lv, Yuxin Zuo, Che Jiang, Ziyang Liu, Jianyu Wang, Yuru Wang, Ruotong Zhao, Ermo Hua, Yibo Wang, Shijie Wang, Junqi Gao, Xinwei Long, and 7 others. 2025 a . https://github.com/TsinghuaC3I/MARTI Marti: A framework for multi-agent llm systems reinforced training and inference
2025
-
[30]
Qiyuan Zhang, Fuyuan Lyu, Zexu Sun, Lei Wang, Weixu Zhang, Wenyue Hua, Haolun Wu, Zhihan Guo, Yufei Wang, Niklas Muennighoff, and 1 others. 2025 b . A survey on test-time scaling in large language models: What, how, where, and how well? arXiv preprint arXiv:2503.24235
work page internal anchor Pith review arXiv 2025
-
[31]
Shaowei Zhang and Deyi Xiong. 2025. https://aclanthology.org/2025.findings-acl.862/ Debate4math: Multi-agent debate for fine-grained reasoning in math . In Findings of the Association for Computational Linguistics, ACL 2025, Vienna, Austria, July 27 - August 1, 2025 , pages 16810--16824. Association for Computational Linguistics
2025
-
[32]
Andrew Zhao, Yiran Wu, Yang Yue, Tong Wu, Quentin Xu, Matthieu Lin, Shenzhi Wang, Qingyun Wu, Zilong Zheng, and Gao Huang. 2025. Absolute zero: Reinforced self-play reasoning with zero data. arXiv preprint arXiv:2505.03335
work page internal anchor Pith review arXiv 2025
-
[33]
Yanli Zhao, Andrew Gu, Rohan Varma, Liang Luo, Chien - Chin Huang, Min Xu, Less Wright, Hamid Shojanazeri, Myle Ott, Sam Shleifer, Alban Desmaison, Can Balioglu, Bernard Nguyen, Geeta Chauhan, Yuchen Hao, and Shen Li. 2023. https://doi.org/10.48550/ARXIV.2304.11277 Pytorch FSDP: experiences on scaling fully sharded data parallel . CoRR, abs/2304.11277
work page internal anchor Pith review doi:10.48550/arxiv.2304.11277 2023
-
[34]
Chujie Zheng, Shixuan Liu, Mingze Li, Xiong-Hui Chen, Bowen Yu, Chang Gao, Kai Dang, Yuqiong Liu, Rui Men, An Yang, and 1 others. 2025. Group sequence policy optimization. arXiv preprint arXiv:2507.18071
work page internal anchor Pith review arXiv 2025
- [35]
-
[36]
Yuxin Zuo, Kaiyan Zhang, Li Sheng, Shang Qu, Ganqu Cui, Xuekai Zhu, Haozhan Li, Yuchen Zhang, Xinwei Long, Ermo Hua, and 1 others. 2025. Ttrl: Test-time reinforcement learning. arXiv preprint arXiv:2504.16084
work page Pith review arXiv 2025
-
[37]
online" 'onlinestring :=
ENTRY address archivePrefix author booktitle chapter edition editor eid eprint eprinttype howpublished institution journal key month note number organization pages publisher school series title type volume year doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRING...
-
[38]
write newline
" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.