Learning from Self-Debate: Preparing Reasoning Models for Multi-Agent Debate

Chenxi Liu; Heng Huang; Ruibo Chen; Tianyi Xiong; Tong Zheng; Yanshuo Chen

arxiv: 2601.22297 · v2 · pith:44QNB6TKnew · submitted 2026-01-29 · 💻 cs.CL

Learning from Self-Debate: Preparing Reasoning Models for Multi-Agent Debate

Chenxi Liu , Yanshuo Chen , Ruibo Chen , Tianyi Xiong , Tong Zheng , Heng Huang This is my paper

Pith reviewed 2026-05-21 14:22 UTC · model grok-4.3

classification 💻 cs.CL

keywords self-debatereinforcement learningmulti-agent debateLLM reasoningcollaborative reasoningRLVRreasoning models

0 comments

The pith

Training LLMs on self-generated debates strengthens both solo reasoning and multi-agent collaboration.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Reinforcement learning for reasoning has focused on isolated problem solving, leaving models unprepared to weigh conflicting or complementary lines of thought from others. The paper shows that a model can instead sample several of its own candidate solutions to a prompt, assemble them into a simulated debate, and then produce a follow-up response that draws on that context. Joint optimization of the original answer and the debate-informed answer yields a single model that solves problems more accurately on its own and also extracts more value when placed inside actual multi-agent debates. A reader should care because the approach removes the need to train or maintain separate models just to obtain the benefits of collaboration.

Core claim

Self-Debate Reinforcement Learning first draws multiple candidate solutions from the model for a given prompt, constructs a debate context that contains these distinct reasoning paths, and then generates second-turn responses conditioned on the context. Jointly optimizing both the initial responses and the debate-conditioned responses produces a model that is effective as a standalone solver and as a participant in multi-agent debate.

What carries the argument

Self-Debate Reinforcement Learning (SDRL), the procedure of sampling diverse solution trajectories from one model, packaging them as debate context, and jointly optimizing first-turn and context-conditioned outputs.

If this is right

SDRL raises Multi-Agent Debate accuracy across different debate protocols and numbers of agents.
The same training simultaneously lifts the model's standalone performance on reasoning benchmarks.
The gains appear consistently for several base models and across multiple reasoning tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

A model prepared this way may cope better with noisy or contradictory information supplied at inference time.
Self-debate offers a way to generate collaborative training data without requiring access to multiple separate models.
Applying the method iteratively over several rounds of self-debate could further increase robustness on long-horizon tasks.

Load-bearing premise

That the range of solutions a single model can sample on its own produces debate contexts whose diversity and quality match those that would arise among truly independent models.

What would settle it

If a model trained with SDRL shows no improvement, or even a drop, in actual multi-agent debates run with other distinct LLMs on the same benchmarks, the claim that self-debate prepares models for collaboration would be falsified.

Figures

Figures reproduced from arXiv: 2601.22297 by Chenxi Liu, Heng Huang, Ruibo Chen, Tianyi Xiong, Tong Zheng, Yanshuo Chen.

**Figure 2.** Figure 2: Prompt for debate training. 12 [PITH_FULL_IMAGE:figures/full_fig_p012_2.png] view at source ↗

**Figure 3.** Figure 3: Prompt for Multi-Agent evaluation. Three agents debate is shown for brevity. D. Additional Experiments D.1. Training Dynamics of SDRL To better understand SDRL training dynamics, we visualize (i) the number of constructed debate prompts and (ii) the number of debate-conditioned responses used at each prompt-generation step on the training set. The results are shown in [PITH_FULL_IMAGE:figures/full_fig_p01… view at source ↗

**Figure 4.** Figure 4: Training dynamics of debate-related values using SDRL [PITH_FULL_IMAGE:figures/full_fig_p014_4.png] view at source ↗

**Figure 5.** Figure 5: Initial responses on MATH500 with 5 agents [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗

**Figure 6.** Figure 6: Responses after one round debate on MATH500 with 5 agents. Critique-augmented update and constant mass. The belief update is (Definition 4.1) αi,t = αi,t−1 + βi,t−1 + wici,t, βi,t−1 ∈ R K + , wi ≥ 0. (16) Throughout the proofs we use the constant-mass assumption ∥βi,t∥1 = mβ, for all i, t, (17) which corresponds to adding a fixed “budget” of private pseudo-observations per round. Why assume constant ∥β∥1. … view at source ↗

**Figure 7.** Figure 7: Responses after two round debate on MATH500 with 5 agents. Critique advantage. Recall Definition 4.3: δi,t := E h β (1) i,t | Ft i − mβ pi,t. (18) E.2. one-step drift decomposition Lemma E.1. Under (8)–(9) and ∥βi,t−1∥1 = mβ, E[pi,t | Ft−1] = pi,t−1 + δi,t−1 Zi,t−1 + wi |N(i)| Zi,t−1 [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗

read the original abstract

The reasoning abilities of large language models (LLMs) have been substantially improved by reinforcement learning with verifiable rewards (RLVR). At test time, collaborative reasoning through Multi-Agent Debate (MAD) has emerged as a promising approach for enhancing LLM performance. However, current RLVR methods typically train LLMs to solve problems in isolation, without explicitly preparing them to synthesize and benefit from different rationales that arise during debate. In this work, we propose Self-Debate Reinforcement Learning(SDRL), a training framework where models learn from self-debate, equipping a single LLM with both strong standalone problem-solving ability and the capability to process diverse reasoning trajectories in MAD. Given a prompt, SDRL first samples multiple candidate solutions, then constructs a debate context with diverse reasoning paths and generates second-turn responses conditioned on this context. Finally, SDRL jointly optimizes both the initial and debate-conditioned responses, yielding a model that is effective as both a standalone solver and a debate participant. Experiments across multiple base models and reasoning benchmarks show that SDRL consistently improves MAD performance across diverse debate protocols and agent configurations, while simultaneously strengthening single-model reasoning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SDRL adds a self-debate loop to RLVR so one model practices both solo solving and synthesizing multiple rationales, but the gains may not fully transfer to debates between truly independent models.

read the letter

The main point is that SDRL has a model sample several candidate solutions to a prompt, build a debate context from them, generate a second-turn response conditioned on that context, and then jointly optimize both the initial and conditioned outputs with verifiable rewards. This is a direct extension of standard RLVR, which trains only on isolated solutions, and the paper shows it improves both single-model accuracy and multi-agent debate results across several base models and benchmarks.

Referee Report

3 major / 1 minor

Summary. The manuscript proposes Self-Debate Reinforcement Learning (SDRL), a training framework in which an LLM first samples multiple candidate solutions to a given prompt, constructs a debate context from these diverse reasoning paths, and then generates second-turn responses conditioned on the context. Both the initial and debate-conditioned responses are jointly optimized via reinforcement learning with verifiable rewards. The central claim is that this self-debate procedure equips the model with stronger standalone reasoning while also improving its ability to synthesize rationales in actual Multi-Agent Debate (MAD) settings. Experiments across multiple base models and reasoning benchmarks are reported to show consistent MAD gains across diverse protocols and agent configurations, alongside improved single-model performance.

Significance. If the empirical results hold under rigorous controls, the work would address a genuine gap between isolated RLVR training and collaborative reasoning, offering a practical route to models that are effective both alone and as debate participants. The joint optimization of initial and conditioned responses is a clean design choice that directly targets the dual objective. However, because the approach is purely empirical with no parameter-free derivations, formal guarantees, or machine-checked proofs, its significance depends entirely on the strength and reproducibility of the experimental evidence.

major comments (3)

[Abstract] Abstract: the claim that SDRL 'consistently improves MAD performance across diverse debate protocols and agent configurations' is presented without any mention of baselines, effect sizes, statistical tests, or ablation controls. This omission makes it impossible to isolate the contribution of the self-debate component from generic RLVR gains.
[Method] Method (self-debate construction): sampling multiple candidates from a single model to build the debate context does not automatically guarantee the heterogeneity of error patterns and reasoning styles that arise when independent base models debate. The central transfer claim therefore requires explicit cross-model controls (e.g., evaluating SDRL-trained agents in mixed-model MAD settings) that are not described.
[Experiments] Experiments: the reported robustness 'across agent configurations' is load-bearing for the claim that SDRL prepares models for genuine multi-agent interaction, yet no details are given on whether configurations include distinct base models or only varied sampling temperatures within one model.

minor comments (1)

[Abstract] The acronym SDRL is introduced in parentheses in the abstract but the full expansion appears only later; consistent first-use expansion would improve readability.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. The comments highlight important areas for improving clarity around baselines, diversity mechanisms, and experimental details. We address each point below and have revised the manuscript accordingly.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that SDRL 'consistently improves MAD performance across diverse debate protocols and agent configurations' is presented without any mention of baselines, effect sizes, statistical tests, or ablation controls. This omission makes it impossible to isolate the contribution of the self-debate component from generic RLVR gains.

Authors: We agree that the abstract should more explicitly contextualize the results. The main body already reports comparisons against standard RLVR baselines, quantitative improvements, and ablations that help isolate the self-debate component. In the revised version we have updated the abstract to reference these baseline comparisons and to note the magnitude of gains and statistical testing performed in the experiments. revision: yes
Referee: [Method] Method (self-debate construction): sampling multiple candidates from a single model to build the debate context does not automatically guarantee the heterogeneity of error patterns and reasoning styles that arise when independent base models debate. The central transfer claim therefore requires explicit cross-model controls (e.g., evaluating SDRL-trained agents in mixed-model MAD settings) that are not described.

Authors: We acknowledge that intra-model sampling produces diversity through stochastic variation rather than architectural or training differences. Our design intentionally uses this controllable source of diversity to train the model to synthesize multiple rationales. The reported MAD gains are measured in both homogeneous and heterogeneous agent settings; we have added explicit description of the mixed-model evaluations already present in the experimental results to make the transfer evidence clearer. revision: partial
Referee: [Experiments] Experiments: the reported robustness 'across agent configurations' is load-bearing for the claim that SDRL prepares models for genuine multi-agent interaction, yet no details are given on whether configurations include distinct base models or only varied sampling temperatures within one model.

Authors: We have expanded the experiments section to explicitly enumerate the agent configurations. These include both temperature-varied sampling within a single base model and settings that combine distinct base models. Revised text and tables now specify which protocols and results correspond to homogeneous versus heterogeneous agent groups. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical training procedure without self-referential derivations

full rationale

The paper presents SDRL as an empirical training framework: sample multiple candidate solutions from a single LLM, construct a debate context from those trajectories, generate conditioned second-turn responses, and jointly optimize both initial and debate-conditioned outputs via RLVR-style rewards. No equations, uniqueness theorems, or derivations are described that reduce the claimed MAD improvements or standalone gains to quantities fitted or defined in terms of the method itself. Central claims rest on experimental results across base models and benchmarks rather than a closed logical chain. No load-bearing self-citations or ansatzes imported from prior author work appear in the method description. The approach is therefore self-contained against external benchmarks and receives a non-finding.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The method rests on standard RLVR assumptions plus the untested premise that self-generated debate contexts transfer to real multi-agent settings. No new entities are postulated. Specific training hyperparameters such as number of sampled paths and reward weighting are implicit free parameters.

free parameters (2)

number of candidate solutions sampled per prompt
The framework samples multiple solutions to build debate context; the exact count is a design choice that affects training signal diversity.
reward weighting between initial and debate-conditioned responses
Joint optimization requires balancing the two response types; the weighting is not specified in the abstract.

axioms (1)

domain assumption Verifiable rewards exist for the reasoning benchmarks used
The method builds on RLVR which requires clear correctness signals for training.

pith-pipeline@v0.9.0 · 5738 in / 1226 out tokens · 41613 ms · 2026-05-21T14:22:49.924567+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

SDRL first samples multiple candidate solutions, then constructs a debate context with diverse reasoning paths and generates second-turn responses conditioned on this context. Finally, SDRL jointly optimizes both the initial and debate-conditioned responses
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

we model multi-agent debate as Bayesian belief updating and recover their key implication that standard MAD induces a martingale over each agent’s belief

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

30 extracted references · 30 canonical work pages · 15 internal anchors

[1]

ChatEval: Towards Better LLM-based Evaluators through Multi-Agent Debate

Chan, C.-M., Chen, W., Su, Y ., Yu, J., Xue, W., Zhang, S., Fu, J., and Liu, Z. Chateval: Towards better llm-based evaluators through multi-agent debate.arXiv preprint arXiv:2308.07201,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

ReSearch: Learning to Reason with Search for LLMs via Reinforcement Learning

Chen, M., Sun, L., Li, T., Sun, H., Zhou, Y ., Zhu, C., Wang, H., Pan, J. Z., Zhang, W., Chen, H., et al. Learning to reason with search for llms via reinforcement learning. arXiv preprint arXiv:2503.19470, 2025a. Chen, X., Li, G., Wang, Z., Jin, B., Qian, C., Wang, Y ., Wang, H., Zhang, Y ., Zhang, D., Zhang, T., et al. Rm-r1: Reward modeling as reasonin...

work page internal anchor Pith review Pith/arXiv arXiv
[3]

K., Zhu, X., and Li, S

Choi, H. K., Zhu, X., and Li, S. Debate or vote: Which yields better decisions in multi-agent large language mod- els?arXiv preprint arXiv:2508.17536,

work page arXiv
[4]

The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models

Cui, G., Zhang, Y ., Chen, J., Yuan, L., Wang, Z., Zuo, Y ., Li, H., Fan, Y ., Chen, H., Chen, W., et al. The entropy mech- anism of reinforcement learning for reasoning language models.arXiv preprint arXiv:2505.22617,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

CRITIC: Large Language Models Can Self-Correct with Tool-Interactive Critiquing

Gou, Z., Shao, Z., Gong, Y ., Shen, Y ., Yang, Y ., Duan, N., and Chen, W. Critic: Large language models can self- correct with tool-interactive critiquing.arXiv preprint arXiv:2305.11738,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Guo, D., Yang, D., Zhang, H., Song, J., Zhang, R., Xu, R., Zhu, Q., Ma, S., Wang, P., Bi, X., et al. Deepseek-r1: In- centivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025a. Guo, J., Chi, Z., Dong, L., Dong, Q., Wu, X., Huang, S., and Wei, F. Reward reasoning model.arXiv preprint arXiv:2505.14674, 2025b....

work page internal anchor Pith review Pith/arXiv arXiv
[7]

LLM Multi-Agent Systems: Challenges and Open Problems

Han, S., Zhang, Q., Yao, Y ., Jin, W., and Xu, Z. Llm multi- agent systems: Challenges and open problems.arXiv preprint arXiv:2402.03578,

work page internal anchor Pith review Pith/arXiv arXiv
[8]

Measuring Mathematical Problem Solving With the MATH Dataset

Hendrycks, D., Burns, C., Kadavath, S., Arora, A., Basart, S., Tang, E., Song, D., and Steinhardt, J. Measuring math- ematical problem solving with the math dataset.arXiv preprint arXiv:2103.03874,

work page internal anchor Pith review Pith/arXiv arXiv
[9]

Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning

Jin, B., Zeng, H., Yue, Z., Yoon, J., Arik, S., Wang, D., Zamani, H., and Han, J. Search-r1: Training llms to reason and leverage search engines with reinforcement learning.arXiv preprint arXiv:2503.09516,

work page internal anchor Pith review Pith/arXiv arXiv
[10]

M., Markeeva, L., Heyward, J., Banino, A., Mathur, M., Pascanu, R., Osindero, S., De Martino, B., Velickovic, P., et al

Kumaran, D., Fleming, S. M., Markeeva, L., Heyward, J., Banino, A., Mathur, M., Pascanu, R., Osindero, S., De Martino, B., Velickovic, P., et al. How overconfidence in initial choices and underconfidence under criticism modulate change of mind in large language models.arXiv preprint arXiv:2507.03120,

work page arXiv
[11]

Li, A. O. and Goyal, T. Off-trajectory reasoning: Can llms collaborate on reasoning trajectory?arXiv preprint arXiv:2510.06410,

work page arXiv
[12]

A survey on llm-based multi-agent systems: workflow, infrastructure, and challenges.Vicinagearth, 1(1):9, 2024a

Li, X., Wang, S., Zeng, S., Wu, Y ., and Yang, Y . A survey on llm-based multi-agent systems: workflow, infrastructure, and challenges.Vicinagearth, 1(1):9, 2024a. Li, Y ., Du, Y ., Zhang, J., Hou, L., Grabowski, P., Li, Y ., and Ie, E. Improving multi-agent debate with sparse com- munication topology.arXiv preprint arXiv:2406.11776, 2024b. Liang, T., He,...

work page arXiv 2024
[13]

Marft: Multi-agent reinforcement fine-tuning, 2025

Liao, J., Wen, M., Wang, J., and Zhang, W. Marft: Multi-agent reinforcement fine-tuning.arXiv preprint arXiv:2504.16129,

work page arXiv
[14]

DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models

Liu, A., Mei, A., Lin, B., Xue, B., Wang, B., Xu, B., Wu, B., Zhang, B., Lin, C., Dong, C., et al. Deepseek-v3. 2: Pushing the frontier of open large language models. arXiv preprint arXiv:2512.02556, 2025a. Liu, B., Guertler, L., Yu, S., Liu, Z., Qi, P., Balcells, D., Liu, M., Tan, C., Shi, W., Lin, M., et al. Spiral: Self- play on zero-sum games incentiv...

work page internal anchor Pith review Pith/arXiv arXiv
[15]

Trust, but verify: A self-verification ap- proach to reinforcement learning with verifiable rewards

Liu, X., Liang, T., He, Z., Xu, J., Wang, W., He, P., Tu, Z., Mi, H., and Yu, D. Trust, but verify: A self-verification ap- proach to reinforcement learning with verifiable rewards. arXiv preprint arXiv:2505.13445, 2025g. Ma, C., Zhang, E., Zhao, Y ., Liu, W., Jia, Y ., Qing, P., Shi, L., Cohan, A., Yan, Y ., and V osoughi, S. Judging with many minds: Do ...

work page arXiv
[16]

Learning to reason across parallel samples for llm reasoning.arXiv preprint arXiv:2506.09014,

Qi, J., Ye, X., Tang, H., Zhu, Z., and Choi, E. Learning to reason across parallel samples for llm reasoning.arXiv preprint arXiv:2506.09014,

work page arXiv
[17]

ToolRL: Reward is All Tool Learning Needs

Qian, C., Acikgoz, E. C., He, Q., Wang, H., Chen, X., Hakkani-T¨ur, D., Tur, G., and Ji, H. Toolrl: Reward is all tool learning needs.arXiv preprint arXiv:2504.13958,

work page internal anchor Pith review Pith/arXiv arXiv
[18]

Qwen2.5 Technical Report

URL https: //arxiv.org/abs/2412.15115. Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Bi, X., Zhang, H., Zhang, M., Li, Y ., Wu, Y ., et al. Deepseekmath: Push- ing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300,

work page internal anchor Pith review Pith/arXiv arXiv
[19]

HybridFlow: A Flexible and Efficient RLHF Framework

Sheng, G., Zhang, C., Ye, Z., Wu, X., Zhang, W., Zhang, R., Peng, Y ., Lin, H., and Wu, C. Hybridflow: A flexi- ble and efficient rlhf framework.arXiv preprint arXiv: 2409.19256,

work page internal anchor Pith review Pith/arXiv arXiv
[20]

Barrett, and Arnu Pretorius

Smit, A., Duckworth, P., Grinsztajn, N., Barrett, T. D., and Pretorius, A. Should we be going mad? a look at multi-agent debate strategies for llms.arXiv preprint arXiv:2311.17371,

work page arXiv
[21]

Multi-Agent Collaboration: Harnessing the Power of Intelligent LLM Agents

Talebirad, Y . and Nadiri, A. Multi-agent collaboration: Harnessing the power of intelligent llm agents.arXiv preprint arXiv:2306.03314,

work page internal anchor Pith review Pith/arXiv arXiv
[22]

Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning

Wang, S., Yu, L., Gao, C., Zheng, C., Liu, S., Lu, R., Dang, K., Chen, X., Yang, J., Zhang, Z., et al. Beyond the 80/20 rule: High-entropy minority tokens drive effective reinforcement learning for llm reasoning.arXiv preprint arXiv:2506.01939, 2025a. Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., and Zhou, D. Self-consiste...

work page internal anchor Pith review Pith/arXiv arXiv
[23]

Llava-critic-r1: Your critic model is secretly a strong policy model.arXiv preprint arXiv:2509.00676, 2025b

Wang, X., Li, C., Yang, J., Zhang, K., Liu, B., Xiong, T., and Huang, F. Llava-critic-r1: Your critic model is secretly a strong policy model.arXiv preprint arXiv:2509.00676, 2025b. Xiong, T., Ge, Y ., Li, M., Zhang, Z., Kulkarni, P., Wang, K., He, Q., Zhu, Z., Liu, C., Chen, R., et al. Multi-crit: Benchmarking multimodal judges on pluralistic criteria- f...

work page arXiv
[24]

From debate to equilibrium: Belief-driven multi-agent llm rea- soning via bayesian nash equilibrium.arXiv preprint arXiv:2506.08292,

Yi, X., Zhou, Z., Cao, C., Niu, Q., Liu, T., and Han, B. From debate to equilibrium: Belief-driven multi-agent llm rea- soning via bayesian nash equilibrium.arXiv preprint arXiv:2506.08292,

work page arXiv
[25]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Yu, Q., Zhang, Z., Zhu, R., Yuan, Y ., Zuo, X., Yue, Y ., Dai, W., Fan, T., Liu, G., Liu, L., et al. Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476,

work page internal anchor Pith review Pith/arXiv arXiv
[26]

Incentivizing llms to self-verify their answers.arXiv preprint arXiv:2506.01369, 2025a

Zhang, F., Xu, J., Wang, C., Cui, C., Liu, Y ., and An, B. Incentivizing llms to self-verify their answers.arXiv preprint arXiv:2506.01369, 2025a. 10 Preprint Zhang, K., Liu, R., Zhu, X., Tian, K., Zeng, S., Jia, G., Fan, Y ., Lv, X., Zuo, Y ., Jiang, C., Liu, Z., Wang, J., Wang, Y ., Zhao, R., Hua, E., Wang, Y ., Wang, S., Gao, J., Long, X., Sun, Y ., Ma...

work page arXiv 2025
[27]

We use 200 prompt generation steps, corresponding to 3,200 policy update steps for the DAPO baseline

The training mini-batch size is 128, corresponding to 16 gradient updates per rollout step. We use 200 prompt generation steps, corresponding to 3,200 policy update steps for the DAPO baseline. SDRL has more policy update steps due to the additional debate training. To improve training efficiency, at each prompt generation step we randomly sample128prompt...

work page 2025
[28]

(2025), and extends it by incorporating a private critique pseudo-count vectorβ(Definition 4.1)

Our analysis follows the DCM-based treatment in Choi et al. (2025), and extends it by incorporating a private critique pseudo-count vectorβ(Definition 4.1). E.1. Notation and standing assumptions Fix an input question q and a finite answer setA={1, . . . , K} . Without loss of generality, index1 is the correct answer. For each agent i and round t, let αi,...

work page 2025
[29]

+ (-2 + 3k)\n \\]\n\n2. **Removing Parentheses:**\n\n \\[\n -k + 4 - 2 + 3k\n \\]\n\n - \\(-k\\) from the first parentheses.\n - \\(+4\\) from the first parentheses.\n - \\(+(-2)\\) from the second parentheses (which is \\(-2\\)).\n - \\(+3k\\) from the second parentheses.\n\n So, combining them:\n\n \\[\n -k + 4 - 2 + 3k\n \\]\n\n3. **Combining Like Term...

work page 2025
[30]

rise-then-fall

Then there exists an indexj̸= 1such thatˆp j ≥ˆp1. Consider theℓ 1 distance: ∥ˆp−p∥1 = KX k=1 |ˆpk −p k| ≥ |ˆp1 −p 1|+|ˆpj −p j|. Using the inequality|a|+|b| ≥ |a−b|, we obtain |ˆp1 −p 1|+|ˆpj −p j| ≥ |(ˆp1 −p 1)−(ˆpj −p j)|=|(ˆp1 −ˆpj)−(p 1 −p j)|.(38) Sinceˆpj ≥ˆp1, we haveˆp1 −ˆpj ≤0, hence (ˆp1 −ˆpj)−(p 1 −p j)≤ −(p 1 −p j). Taking absolute values yie...

work page 2025

[1] [1]

ChatEval: Towards Better LLM-based Evaluators through Multi-Agent Debate

Chan, C.-M., Chen, W., Su, Y ., Yu, J., Xue, W., Zhang, S., Fu, J., and Liu, Z. Chateval: Towards better llm-based evaluators through multi-agent debate.arXiv preprint arXiv:2308.07201,

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

ReSearch: Learning to Reason with Search for LLMs via Reinforcement Learning

Chen, M., Sun, L., Li, T., Sun, H., Zhou, Y ., Zhu, C., Wang, H., Pan, J. Z., Zhang, W., Chen, H., et al. Learning to reason with search for llms via reinforcement learning. arXiv preprint arXiv:2503.19470, 2025a. Chen, X., Li, G., Wang, Z., Jin, B., Qian, C., Wang, Y ., Wang, H., Zhang, Y ., Zhang, D., Zhang, T., et al. Rm-r1: Reward modeling as reasonin...

work page internal anchor Pith review Pith/arXiv arXiv

[3] [3]

K., Zhu, X., and Li, S

Choi, H. K., Zhu, X., and Li, S. Debate or vote: Which yields better decisions in multi-agent large language mod- els?arXiv preprint arXiv:2508.17536,

work page arXiv

[4] [4]

The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models

Cui, G., Zhang, Y ., Chen, J., Yuan, L., Wang, Z., Zuo, Y ., Li, H., Fan, Y ., Chen, H., Chen, W., et al. The entropy mech- anism of reinforcement learning for reasoning language models.arXiv preprint arXiv:2505.22617,

work page internal anchor Pith review Pith/arXiv arXiv

[5] [5]

CRITIC: Large Language Models Can Self-Correct with Tool-Interactive Critiquing

Gou, Z., Shao, Z., Gong, Y ., Shen, Y ., Yang, Y ., Duan, N., and Chen, W. Critic: Large language models can self- correct with tool-interactive critiquing.arXiv preprint arXiv:2305.11738,

work page internal anchor Pith review Pith/arXiv arXiv

[6] [6]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Guo, D., Yang, D., Zhang, H., Song, J., Zhang, R., Xu, R., Zhu, Q., Ma, S., Wang, P., Bi, X., et al. Deepseek-r1: In- centivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025a. Guo, J., Chi, Z., Dong, L., Dong, Q., Wu, X., Huang, S., and Wei, F. Reward reasoning model.arXiv preprint arXiv:2505.14674, 2025b....

work page internal anchor Pith review Pith/arXiv arXiv

[7] [7]

LLM Multi-Agent Systems: Challenges and Open Problems

Han, S., Zhang, Q., Yao, Y ., Jin, W., and Xu, Z. Llm multi- agent systems: Challenges and open problems.arXiv preprint arXiv:2402.03578,

work page internal anchor Pith review Pith/arXiv arXiv

[8] [8]

Measuring Mathematical Problem Solving With the MATH Dataset

Hendrycks, D., Burns, C., Kadavath, S., Arora, A., Basart, S., Tang, E., Song, D., and Steinhardt, J. Measuring math- ematical problem solving with the math dataset.arXiv preprint arXiv:2103.03874,

work page internal anchor Pith review Pith/arXiv arXiv

[9] [9]

Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning

Jin, B., Zeng, H., Yue, Z., Yoon, J., Arik, S., Wang, D., Zamani, H., and Han, J. Search-r1: Training llms to reason and leverage search engines with reinforcement learning.arXiv preprint arXiv:2503.09516,

work page internal anchor Pith review Pith/arXiv arXiv

[10] [10]

M., Markeeva, L., Heyward, J., Banino, A., Mathur, M., Pascanu, R., Osindero, S., De Martino, B., Velickovic, P., et al

Kumaran, D., Fleming, S. M., Markeeva, L., Heyward, J., Banino, A., Mathur, M., Pascanu, R., Osindero, S., De Martino, B., Velickovic, P., et al. How overconfidence in initial choices and underconfidence under criticism modulate change of mind in large language models.arXiv preprint arXiv:2507.03120,

work page arXiv

[11] [11]

Li, A. O. and Goyal, T. Off-trajectory reasoning: Can llms collaborate on reasoning trajectory?arXiv preprint arXiv:2510.06410,

work page arXiv

[12] [12]

A survey on llm-based multi-agent systems: workflow, infrastructure, and challenges.Vicinagearth, 1(1):9, 2024a

Li, X., Wang, S., Zeng, S., Wu, Y ., and Yang, Y . A survey on llm-based multi-agent systems: workflow, infrastructure, and challenges.Vicinagearth, 1(1):9, 2024a. Li, Y ., Du, Y ., Zhang, J., Hou, L., Grabowski, P., Li, Y ., and Ie, E. Improving multi-agent debate with sparse com- munication topology.arXiv preprint arXiv:2406.11776, 2024b. Liang, T., He,...

work page arXiv 2024

[13] [13]

Marft: Multi-agent reinforcement fine-tuning, 2025

Liao, J., Wen, M., Wang, J., and Zhang, W. Marft: Multi-agent reinforcement fine-tuning.arXiv preprint arXiv:2504.16129,

work page arXiv

[14] [14]

DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models

Liu, A., Mei, A., Lin, B., Xue, B., Wang, B., Xu, B., Wu, B., Zhang, B., Lin, C., Dong, C., et al. Deepseek-v3. 2: Pushing the frontier of open large language models. arXiv preprint arXiv:2512.02556, 2025a. Liu, B., Guertler, L., Yu, S., Liu, Z., Qi, P., Balcells, D., Liu, M., Tan, C., Shi, W., Lin, M., et al. Spiral: Self- play on zero-sum games incentiv...

work page internal anchor Pith review Pith/arXiv arXiv

[15] [15]

Trust, but verify: A self-verification ap- proach to reinforcement learning with verifiable rewards

Liu, X., Liang, T., He, Z., Xu, J., Wang, W., He, P., Tu, Z., Mi, H., and Yu, D. Trust, but verify: A self-verification ap- proach to reinforcement learning with verifiable rewards. arXiv preprint arXiv:2505.13445, 2025g. Ma, C., Zhang, E., Zhao, Y ., Liu, W., Jia, Y ., Qing, P., Shi, L., Cohan, A., Yan, Y ., and V osoughi, S. Judging with many minds: Do ...

work page arXiv

[16] [16]

Learning to reason across parallel samples for llm reasoning.arXiv preprint arXiv:2506.09014,

Qi, J., Ye, X., Tang, H., Zhu, Z., and Choi, E. Learning to reason across parallel samples for llm reasoning.arXiv preprint arXiv:2506.09014,

work page arXiv

[17] [17]

ToolRL: Reward is All Tool Learning Needs

Qian, C., Acikgoz, E. C., He, Q., Wang, H., Chen, X., Hakkani-T¨ur, D., Tur, G., and Ji, H. Toolrl: Reward is all tool learning needs.arXiv preprint arXiv:2504.13958,

work page internal anchor Pith review Pith/arXiv arXiv

[18] [18]

Qwen2.5 Technical Report

URL https: //arxiv.org/abs/2412.15115. Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Bi, X., Zhang, H., Zhang, M., Li, Y ., Wu, Y ., et al. Deepseekmath: Push- ing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300,

work page internal anchor Pith review Pith/arXiv arXiv

[19] [19]

HybridFlow: A Flexible and Efficient RLHF Framework

Sheng, G., Zhang, C., Ye, Z., Wu, X., Zhang, W., Zhang, R., Peng, Y ., Lin, H., and Wu, C. Hybridflow: A flexi- ble and efficient rlhf framework.arXiv preprint arXiv: 2409.19256,

work page internal anchor Pith review Pith/arXiv arXiv

[20] [20]

Barrett, and Arnu Pretorius

Smit, A., Duckworth, P., Grinsztajn, N., Barrett, T. D., and Pretorius, A. Should we be going mad? a look at multi-agent debate strategies for llms.arXiv preprint arXiv:2311.17371,

work page arXiv

[21] [21]

Multi-Agent Collaboration: Harnessing the Power of Intelligent LLM Agents

Talebirad, Y . and Nadiri, A. Multi-agent collaboration: Harnessing the power of intelligent llm agents.arXiv preprint arXiv:2306.03314,

work page internal anchor Pith review Pith/arXiv arXiv

[22] [22]

Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning

Wang, S., Yu, L., Gao, C., Zheng, C., Liu, S., Lu, R., Dang, K., Chen, X., Yang, J., Zhang, Z., et al. Beyond the 80/20 rule: High-entropy minority tokens drive effective reinforcement learning for llm reasoning.arXiv preprint arXiv:2506.01939, 2025a. Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., and Zhou, D. Self-consiste...

work page internal anchor Pith review Pith/arXiv arXiv

[23] [23]

Llava-critic-r1: Your critic model is secretly a strong policy model.arXiv preprint arXiv:2509.00676, 2025b

Wang, X., Li, C., Yang, J., Zhang, K., Liu, B., Xiong, T., and Huang, F. Llava-critic-r1: Your critic model is secretly a strong policy model.arXiv preprint arXiv:2509.00676, 2025b. Xiong, T., Ge, Y ., Li, M., Zhang, Z., Kulkarni, P., Wang, K., He, Q., Zhu, Z., Liu, C., Chen, R., et al. Multi-crit: Benchmarking multimodal judges on pluralistic criteria- f...

work page arXiv

[24] [24]

From debate to equilibrium: Belief-driven multi-agent llm rea- soning via bayesian nash equilibrium.arXiv preprint arXiv:2506.08292,

Yi, X., Zhou, Z., Cao, C., Niu, Q., Liu, T., and Han, B. From debate to equilibrium: Belief-driven multi-agent llm rea- soning via bayesian nash equilibrium.arXiv preprint arXiv:2506.08292,

work page arXiv

[25] [25]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Yu, Q., Zhang, Z., Zhu, R., Yuan, Y ., Zuo, X., Yue, Y ., Dai, W., Fan, T., Liu, G., Liu, L., et al. Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476,

work page internal anchor Pith review Pith/arXiv arXiv

[26] [26]

Incentivizing llms to self-verify their answers.arXiv preprint arXiv:2506.01369, 2025a

Zhang, F., Xu, J., Wang, C., Cui, C., Liu, Y ., and An, B. Incentivizing llms to self-verify their answers.arXiv preprint arXiv:2506.01369, 2025a. 10 Preprint Zhang, K., Liu, R., Zhu, X., Tian, K., Zeng, S., Jia, G., Fan, Y ., Lv, X., Zuo, Y ., Jiang, C., Liu, Z., Wang, J., Wang, Y ., Zhao, R., Hua, E., Wang, Y ., Wang, S., Gao, J., Long, X., Sun, Y ., Ma...

work page arXiv 2025

[27] [27]

We use 200 prompt generation steps, corresponding to 3,200 policy update steps for the DAPO baseline

The training mini-batch size is 128, corresponding to 16 gradient updates per rollout step. We use 200 prompt generation steps, corresponding to 3,200 policy update steps for the DAPO baseline. SDRL has more policy update steps due to the additional debate training. To improve training efficiency, at each prompt generation step we randomly sample128prompt...

work page 2025

[28] [28]

(2025), and extends it by incorporating a private critique pseudo-count vectorβ(Definition 4.1)

Our analysis follows the DCM-based treatment in Choi et al. (2025), and extends it by incorporating a private critique pseudo-count vectorβ(Definition 4.1). E.1. Notation and standing assumptions Fix an input question q and a finite answer setA={1, . . . , K} . Without loss of generality, index1 is the correct answer. For each agent i and round t, let αi,...

work page 2025

[29] [29]

+ (-2 + 3k)\n \\]\n\n2. **Removing Parentheses:**\n\n \\[\n -k + 4 - 2 + 3k\n \\]\n\n - \\(-k\\) from the first parentheses.\n - \\(+4\\) from the first parentheses.\n - \\(+(-2)\\) from the second parentheses (which is \\(-2\\)).\n - \\(+3k\\) from the second parentheses.\n\n So, combining them:\n\n \\[\n -k + 4 - 2 + 3k\n \\]\n\n3. **Combining Like Term...

work page 2025

[30] [30]

rise-then-fall

Then there exists an indexj̸= 1such thatˆp j ≥ˆp1. Consider theℓ 1 distance: ∥ˆp−p∥1 = KX k=1 |ˆpk −p k| ≥ |ˆp1 −p 1|+|ˆpj −p j|. Using the inequality|a|+|b| ≥ |a−b|, we obtain |ˆp1 −p 1|+|ˆpj −p j| ≥ |(ˆp1 −p 1)−(ˆpj −p j)|=|(ˆp1 −ˆpj)−(p 1 −p j)|.(38) Sinceˆpj ≥ˆp1, we haveˆp1 −ˆpj ≤0, hence (ˆp1 −ˆpj)−(p 1 −p j)≤ −(p 1 −p j). Taking absolute values yie...

work page 2025