pith. sign in

arxiv: 2601.22297 · v2 · pith:44QNB6TKnew · submitted 2026-01-29 · 💻 cs.CL

Learning from Self-Debate: Preparing Reasoning Models for Multi-Agent Debate

Pith reviewed 2026-05-21 14:22 UTC · model grok-4.3

classification 💻 cs.CL
keywords self-debatereinforcement learningmulti-agent debateLLM reasoningcollaborative reasoningRLVRreasoning models
0
0 comments X

The pith

Training LLMs on self-generated debates strengthens both solo reasoning and multi-agent collaboration.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Reinforcement learning for reasoning has focused on isolated problem solving, leaving models unprepared to weigh conflicting or complementary lines of thought from others. The paper shows that a model can instead sample several of its own candidate solutions to a prompt, assemble them into a simulated debate, and then produce a follow-up response that draws on that context. Joint optimization of the original answer and the debate-informed answer yields a single model that solves problems more accurately on its own and also extracts more value when placed inside actual multi-agent debates. A reader should care because the approach removes the need to train or maintain separate models just to obtain the benefits of collaboration.

Core claim

Self-Debate Reinforcement Learning first draws multiple candidate solutions from the model for a given prompt, constructs a debate context that contains these distinct reasoning paths, and then generates second-turn responses conditioned on the context. Jointly optimizing both the initial responses and the debate-conditioned responses produces a model that is effective as a standalone solver and as a participant in multi-agent debate.

What carries the argument

Self-Debate Reinforcement Learning (SDRL), the procedure of sampling diverse solution trajectories from one model, packaging them as debate context, and jointly optimizing first-turn and context-conditioned outputs.

If this is right

  • SDRL raises Multi-Agent Debate accuracy across different debate protocols and numbers of agents.
  • The same training simultaneously lifts the model's standalone performance on reasoning benchmarks.
  • The gains appear consistently for several base models and across multiple reasoning tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • A model prepared this way may cope better with noisy or contradictory information supplied at inference time.
  • Self-debate offers a way to generate collaborative training data without requiring access to multiple separate models.
  • Applying the method iteratively over several rounds of self-debate could further increase robustness on long-horizon tasks.

Load-bearing premise

That the range of solutions a single model can sample on its own produces debate contexts whose diversity and quality match those that would arise among truly independent models.

What would settle it

If a model trained with SDRL shows no improvement, or even a drop, in actual multi-agent debates run with other distinct LLMs on the same benchmarks, the claim that self-debate prepares models for collaboration would be falsified.

Figures

Figures reproduced from arXiv: 2601.22297 by Chenxi Liu, Heng Huang, Ruibo Chen, Tianyi Xiong, Tong Zheng, Yanshuo Chen.

Figure 1
Figure 1. Figure 1: Debate performance of 5 agents in different debate rounds under the decentralized MAD setting [PITH_FULL_IMAGE:figures/full_fig_p007_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Prompt for debate training. 12 [PITH_FULL_IMAGE:figures/full_fig_p012_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Prompt for Multi-Agent evaluation. Three agents debate is shown for brevity. D. Additional Experiments D.1. Training Dynamics of SDRL To better understand SDRL training dynamics, we visualize (i) the number of constructed debate prompts and (ii) the number of debate-conditioned responses used at each prompt-generation step on the training set. The results are shown in [PITH_FULL_IMAGE:figures/full_fig_p01… view at source ↗
Figure 4
Figure 4. Figure 4: Training dynamics of debate-related values using SDRL [PITH_FULL_IMAGE:figures/full_fig_p014_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Initial responses on MATH500 with 5 agents [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Responses after one round debate on MATH500 with 5 agents. Critique-augmented update and constant mass. The belief update is (Definition 4.1) αi,t = αi,t−1 + βi,t−1 + wici,t, βi,t−1 ∈ R K + , wi ≥ 0. (16) Throughout the proofs we use the constant-mass assumption ∥βi,t∥1 = mβ, for all i, t, (17) which corresponds to adding a fixed “budget” of private pseudo-observations per round. Why assume constant ∥β∥1. … view at source ↗
Figure 7
Figure 7. Figure 7: Responses after two round debate on MATH500 with 5 agents. Critique advantage. Recall Definition 4.3: δi,t := E h β (1) i,t | Ft i − mβ pi,t. (18) E.2. one-step drift decomposition Lemma E.1. Under (8)–(9) and ∥βi,t−1∥1 = mβ, E[pi,t | Ft−1] = pi,t−1 + δi,t−1 Zi,t−1 + wi |N(i)| Zi,t−1 [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗
read the original abstract

The reasoning abilities of large language models (LLMs) have been substantially improved by reinforcement learning with verifiable rewards (RLVR). At test time, collaborative reasoning through Multi-Agent Debate (MAD) has emerged as a promising approach for enhancing LLM performance. However, current RLVR methods typically train LLMs to solve problems in isolation, without explicitly preparing them to synthesize and benefit from different rationales that arise during debate. In this work, we propose Self-Debate Reinforcement Learning(SDRL), a training framework where models learn from self-debate, equipping a single LLM with both strong standalone problem-solving ability and the capability to process diverse reasoning trajectories in MAD. Given a prompt, SDRL first samples multiple candidate solutions, then constructs a debate context with diverse reasoning paths and generates second-turn responses conditioned on this context. Finally, SDRL jointly optimizes both the initial and debate-conditioned responses, yielding a model that is effective as both a standalone solver and a debate participant. Experiments across multiple base models and reasoning benchmarks show that SDRL consistently improves MAD performance across diverse debate protocols and agent configurations, while simultaneously strengthening single-model reasoning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The manuscript proposes Self-Debate Reinforcement Learning (SDRL), a training framework in which an LLM first samples multiple candidate solutions to a given prompt, constructs a debate context from these diverse reasoning paths, and then generates second-turn responses conditioned on the context. Both the initial and debate-conditioned responses are jointly optimized via reinforcement learning with verifiable rewards. The central claim is that this self-debate procedure equips the model with stronger standalone reasoning while also improving its ability to synthesize rationales in actual Multi-Agent Debate (MAD) settings. Experiments across multiple base models and reasoning benchmarks are reported to show consistent MAD gains across diverse protocols and agent configurations, alongside improved single-model performance.

Significance. If the empirical results hold under rigorous controls, the work would address a genuine gap between isolated RLVR training and collaborative reasoning, offering a practical route to models that are effective both alone and as debate participants. The joint optimization of initial and conditioned responses is a clean design choice that directly targets the dual objective. However, because the approach is purely empirical with no parameter-free derivations, formal guarantees, or machine-checked proofs, its significance depends entirely on the strength and reproducibility of the experimental evidence.

major comments (3)
  1. [Abstract] Abstract: the claim that SDRL 'consistently improves MAD performance across diverse debate protocols and agent configurations' is presented without any mention of baselines, effect sizes, statistical tests, or ablation controls. This omission makes it impossible to isolate the contribution of the self-debate component from generic RLVR gains.
  2. [Method] Method (self-debate construction): sampling multiple candidates from a single model to build the debate context does not automatically guarantee the heterogeneity of error patterns and reasoning styles that arise when independent base models debate. The central transfer claim therefore requires explicit cross-model controls (e.g., evaluating SDRL-trained agents in mixed-model MAD settings) that are not described.
  3. [Experiments] Experiments: the reported robustness 'across agent configurations' is load-bearing for the claim that SDRL prepares models for genuine multi-agent interaction, yet no details are given on whether configurations include distinct base models or only varied sampling temperatures within one model.
minor comments (1)
  1. [Abstract] The acronym SDRL is introduced in parentheses in the abstract but the full expansion appears only later; consistent first-use expansion would improve readability.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. The comments highlight important areas for improving clarity around baselines, diversity mechanisms, and experimental details. We address each point below and have revised the manuscript accordingly.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim that SDRL 'consistently improves MAD performance across diverse debate protocols and agent configurations' is presented without any mention of baselines, effect sizes, statistical tests, or ablation controls. This omission makes it impossible to isolate the contribution of the self-debate component from generic RLVR gains.

    Authors: We agree that the abstract should more explicitly contextualize the results. The main body already reports comparisons against standard RLVR baselines, quantitative improvements, and ablations that help isolate the self-debate component. In the revised version we have updated the abstract to reference these baseline comparisons and to note the magnitude of gains and statistical testing performed in the experiments. revision: yes

  2. Referee: [Method] Method (self-debate construction): sampling multiple candidates from a single model to build the debate context does not automatically guarantee the heterogeneity of error patterns and reasoning styles that arise when independent base models debate. The central transfer claim therefore requires explicit cross-model controls (e.g., evaluating SDRL-trained agents in mixed-model MAD settings) that are not described.

    Authors: We acknowledge that intra-model sampling produces diversity through stochastic variation rather than architectural or training differences. Our design intentionally uses this controllable source of diversity to train the model to synthesize multiple rationales. The reported MAD gains are measured in both homogeneous and heterogeneous agent settings; we have added explicit description of the mixed-model evaluations already present in the experimental results to make the transfer evidence clearer. revision: partial

  3. Referee: [Experiments] Experiments: the reported robustness 'across agent configurations' is load-bearing for the claim that SDRL prepares models for genuine multi-agent interaction, yet no details are given on whether configurations include distinct base models or only varied sampling temperatures within one model.

    Authors: We have expanded the experiments section to explicitly enumerate the agent configurations. These include both temperature-varied sampling within a single base model and settings that combine distinct base models. Revised text and tables now specify which protocols and results correspond to homogeneous versus heterogeneous agent groups. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical training procedure without self-referential derivations

full rationale

The paper presents SDRL as an empirical training framework: sample multiple candidate solutions from a single LLM, construct a debate context from those trajectories, generate conditioned second-turn responses, and jointly optimize both initial and debate-conditioned outputs via RLVR-style rewards. No equations, uniqueness theorems, or derivations are described that reduce the claimed MAD improvements or standalone gains to quantities fitted or defined in terms of the method itself. Central claims rest on experimental results across base models and benchmarks rather than a closed logical chain. No load-bearing self-citations or ansatzes imported from prior author work appear in the method description. The approach is therefore self-contained against external benchmarks and receives a non-finding.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The method rests on standard RLVR assumptions plus the untested premise that self-generated debate contexts transfer to real multi-agent settings. No new entities are postulated. Specific training hyperparameters such as number of sampled paths and reward weighting are implicit free parameters.

free parameters (2)
  • number of candidate solutions sampled per prompt
    The framework samples multiple solutions to build debate context; the exact count is a design choice that affects training signal diversity.
  • reward weighting between initial and debate-conditioned responses
    Joint optimization requires balancing the two response types; the weighting is not specified in the abstract.
axioms (1)
  • domain assumption Verifiable rewards exist for the reasoning benchmarks used
    The method builds on RLVR which requires clear correctness signals for training.

pith-pipeline@v0.9.0 · 5738 in / 1226 out tokens · 41613 ms · 2026-05-21T14:22:49.924567+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

30 extracted references · 30 canonical work pages · 15 internal anchors

  1. [1]

    ChatEval: Towards Better LLM-based Evaluators through Multi-Agent Debate

    Chan, C.-M., Chen, W., Su, Y ., Yu, J., Xue, W., Zhang, S., Fu, J., and Liu, Z. Chateval: Towards better llm-based evaluators through multi-agent debate.arXiv preprint arXiv:2308.07201,

  2. [2]

    ReSearch: Learning to Reason with Search for LLMs via Reinforcement Learning

    Chen, M., Sun, L., Li, T., Sun, H., Zhou, Y ., Zhu, C., Wang, H., Pan, J. Z., Zhang, W., Chen, H., et al. Learning to reason with search for llms via reinforcement learning. arXiv preprint arXiv:2503.19470, 2025a. Chen, X., Li, G., Wang, Z., Jin, B., Qian, C., Wang, Y ., Wang, H., Zhang, Y ., Zhang, D., Zhang, T., et al. Rm-r1: Reward modeling as reasonin...

  3. [3]

    K., Zhu, X., and Li, S

    Choi, H. K., Zhu, X., and Li, S. Debate or vote: Which yields better decisions in multi-agent large language mod- els?arXiv preprint arXiv:2508.17536,

  4. [4]

    The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models

    Cui, G., Zhang, Y ., Chen, J., Yuan, L., Wang, Z., Zuo, Y ., Li, H., Fan, Y ., Chen, H., Chen, W., et al. The entropy mech- anism of reinforcement learning for reasoning language models.arXiv preprint arXiv:2505.22617,

  5. [5]

    CRITIC: Large Language Models Can Self-Correct with Tool-Interactive Critiquing

    Gou, Z., Shao, Z., Gong, Y ., Shen, Y ., Yang, Y ., Duan, N., and Chen, W. Critic: Large language models can self- correct with tool-interactive critiquing.arXiv preprint arXiv:2305.11738,

  6. [6]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Guo, D., Yang, D., Zhang, H., Song, J., Zhang, R., Xu, R., Zhu, Q., Ma, S., Wang, P., Bi, X., et al. Deepseek-r1: In- centivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025a. Guo, J., Chi, Z., Dong, L., Dong, Q., Wu, X., Huang, S., and Wei, F. Reward reasoning model.arXiv preprint arXiv:2505.14674, 2025b....

  7. [7]

    LLM Multi-Agent Systems: Challenges and Open Problems

    Han, S., Zhang, Q., Yao, Y ., Jin, W., and Xu, Z. Llm multi- agent systems: Challenges and open problems.arXiv preprint arXiv:2402.03578,

  8. [8]

    Measuring Mathematical Problem Solving With the MATH Dataset

    Hendrycks, D., Burns, C., Kadavath, S., Arora, A., Basart, S., Tang, E., Song, D., and Steinhardt, J. Measuring math- ematical problem solving with the math dataset.arXiv preprint arXiv:2103.03874,

  9. [9]

    Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning

    Jin, B., Zeng, H., Yue, Z., Yoon, J., Arik, S., Wang, D., Zamani, H., and Han, J. Search-r1: Training llms to reason and leverage search engines with reinforcement learning.arXiv preprint arXiv:2503.09516,

  10. [10]

    M., Markeeva, L., Heyward, J., Banino, A., Mathur, M., Pascanu, R., Osindero, S., De Martino, B., Velickovic, P., et al

    Kumaran, D., Fleming, S. M., Markeeva, L., Heyward, J., Banino, A., Mathur, M., Pascanu, R., Osindero, S., De Martino, B., Velickovic, P., et al. How overconfidence in initial choices and underconfidence under criticism modulate change of mind in large language models.arXiv preprint arXiv:2507.03120,

  11. [11]

    Li, A. O. and Goyal, T. Off-trajectory reasoning: Can llms collaborate on reasoning trajectory?arXiv preprint arXiv:2510.06410,

  12. [12]

    A survey on llm-based multi-agent systems: workflow, infrastructure, and challenges.Vicinagearth, 1(1):9, 2024a

    Li, X., Wang, S., Zeng, S., Wu, Y ., and Yang, Y . A survey on llm-based multi-agent systems: workflow, infrastructure, and challenges.Vicinagearth, 1(1):9, 2024a. Li, Y ., Du, Y ., Zhang, J., Hou, L., Grabowski, P., Li, Y ., and Ie, E. Improving multi-agent debate with sparse com- munication topology.arXiv preprint arXiv:2406.11776, 2024b. Liang, T., He,...

  13. [13]

    Marft: Multi-agent reinforcement fine-tuning, 2025

    Liao, J., Wen, M., Wang, J., and Zhang, W. Marft: Multi-agent reinforcement fine-tuning.arXiv preprint arXiv:2504.16129,

  14. [14]

    DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models

    Liu, A., Mei, A., Lin, B., Xue, B., Wang, B., Xu, B., Wu, B., Zhang, B., Lin, C., Dong, C., et al. Deepseek-v3. 2: Pushing the frontier of open large language models. arXiv preprint arXiv:2512.02556, 2025a. Liu, B., Guertler, L., Yu, S., Liu, Z., Qi, P., Balcells, D., Liu, M., Tan, C., Shi, W., Lin, M., et al. Spiral: Self- play on zero-sum games incentiv...

  15. [15]

    Trust, but verify: A self-verification ap- proach to reinforcement learning with verifiable rewards

    Liu, X., Liang, T., He, Z., Xu, J., Wang, W., He, P., Tu, Z., Mi, H., and Yu, D. Trust, but verify: A self-verification ap- proach to reinforcement learning with verifiable rewards. arXiv preprint arXiv:2505.13445, 2025g. Ma, C., Zhang, E., Zhao, Y ., Liu, W., Jia, Y ., Qing, P., Shi, L., Cohan, A., Yan, Y ., and V osoughi, S. Judging with many minds: Do ...

  16. [16]

    Learning to reason across parallel samples for llm reasoning.arXiv preprint arXiv:2506.09014,

    Qi, J., Ye, X., Tang, H., Zhu, Z., and Choi, E. Learning to reason across parallel samples for llm reasoning.arXiv preprint arXiv:2506.09014,

  17. [17]

    ToolRL: Reward is All Tool Learning Needs

    Qian, C., Acikgoz, E. C., He, Q., Wang, H., Chen, X., Hakkani-T¨ur, D., Tur, G., and Ji, H. Toolrl: Reward is all tool learning needs.arXiv preprint arXiv:2504.13958,

  18. [18]

    Qwen2.5 Technical Report

    URL https: //arxiv.org/abs/2412.15115. Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Bi, X., Zhang, H., Zhang, M., Li, Y ., Wu, Y ., et al. Deepseekmath: Push- ing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300,

  19. [19]

    HybridFlow: A Flexible and Efficient RLHF Framework

    Sheng, G., Zhang, C., Ye, Z., Wu, X., Zhang, W., Zhang, R., Peng, Y ., Lin, H., and Wu, C. Hybridflow: A flexi- ble and efficient rlhf framework.arXiv preprint arXiv: 2409.19256,

  20. [20]

    Barrett, and Arnu Pretorius

    Smit, A., Duckworth, P., Grinsztajn, N., Barrett, T. D., and Pretorius, A. Should we be going mad? a look at multi-agent debate strategies for llms.arXiv preprint arXiv:2311.17371,

  21. [21]

    Multi-Agent Collaboration: Harnessing the Power of Intelligent LLM Agents

    Talebirad, Y . and Nadiri, A. Multi-agent collaboration: Harnessing the power of intelligent llm agents.arXiv preprint arXiv:2306.03314,

  22. [22]

    Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning

    Wang, S., Yu, L., Gao, C., Zheng, C., Liu, S., Lu, R., Dang, K., Chen, X., Yang, J., Zhang, Z., et al. Beyond the 80/20 rule: High-entropy minority tokens drive effective reinforcement learning for llm reasoning.arXiv preprint arXiv:2506.01939, 2025a. Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., and Zhou, D. Self-consiste...

  23. [23]

    Llava-critic-r1: Your critic model is secretly a strong policy model.arXiv preprint arXiv:2509.00676, 2025b

    Wang, X., Li, C., Yang, J., Zhang, K., Liu, B., Xiong, T., and Huang, F. Llava-critic-r1: Your critic model is secretly a strong policy model.arXiv preprint arXiv:2509.00676, 2025b. Xiong, T., Ge, Y ., Li, M., Zhang, Z., Kulkarni, P., Wang, K., He, Q., Zhu, Z., Liu, C., Chen, R., et al. Multi-crit: Benchmarking multimodal judges on pluralistic criteria- f...

  24. [24]

    From debate to equilibrium: Belief-driven multi-agent llm rea- soning via bayesian nash equilibrium.arXiv preprint arXiv:2506.08292,

    Yi, X., Zhou, Z., Cao, C., Niu, Q., Liu, T., and Han, B. From debate to equilibrium: Belief-driven multi-agent llm rea- soning via bayesian nash equilibrium.arXiv preprint arXiv:2506.08292,

  25. [25]

    DAPO: An Open-Source LLM Reinforcement Learning System at Scale

    Yu, Q., Zhang, Z., Zhu, R., Yuan, Y ., Zuo, X., Yue, Y ., Dai, W., Fan, T., Liu, G., Liu, L., et al. Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476,

  26. [26]

    Incentivizing llms to self-verify their answers.arXiv preprint arXiv:2506.01369, 2025a

    Zhang, F., Xu, J., Wang, C., Cui, C., Liu, Y ., and An, B. Incentivizing llms to self-verify their answers.arXiv preprint arXiv:2506.01369, 2025a. 10 Preprint Zhang, K., Liu, R., Zhu, X., Tian, K., Zeng, S., Jia, G., Fan, Y ., Lv, X., Zuo, Y ., Jiang, C., Liu, Z., Wang, J., Wang, Y ., Zhao, R., Hua, E., Wang, Y ., Wang, S., Gao, J., Long, X., Sun, Y ., Ma...

  27. [27]

    We use 200 prompt generation steps, corresponding to 3,200 policy update steps for the DAPO baseline

    The training mini-batch size is 128, corresponding to 16 gradient updates per rollout step. We use 200 prompt generation steps, corresponding to 3,200 policy update steps for the DAPO baseline. SDRL has more policy update steps due to the additional debate training. To improve training efficiency, at each prompt generation step we randomly sample128prompt...

  28. [28]

    (2025), and extends it by incorporating a private critique pseudo-count vectorβ(Definition 4.1)

    Our analysis follows the DCM-based treatment in Choi et al. (2025), and extends it by incorporating a private critique pseudo-count vectorβ(Definition 4.1). E.1. Notation and standing assumptions Fix an input question q and a finite answer setA={1, . . . , K} . Without loss of generality, index1 is the correct answer. For each agent i and round t, let αi,...

  29. [29]

    + (-2 + 3k)\n \\]\n\n2. **Removing Parentheses:**\n\n \\[\n -k + 4 - 2 + 3k\n \\]\n\n - \\(-k\\) from the first parentheses.\n - \\(+4\\) from the first parentheses.\n - \\(+(-2)\\) from the second parentheses (which is \\(-2\\)).\n - \\(+3k\\) from the second parentheses.\n\n So, combining them:\n\n \\[\n -k + 4 - 2 + 3k\n \\]\n\n3. **Combining Like Term...

  30. [30]

    rise-then-fall

    Then there exists an indexj̸= 1such thatˆp j ≥ˆp1. Consider theℓ 1 distance: ∥ˆp−p∥1 = KX k=1 |ˆpk −p k| ≥ |ˆp1 −p 1|+|ˆpj −p j|. Using the inequality|a|+|b| ≥ |a−b|, we obtain |ˆp1 −p 1|+|ˆpj −p j| ≥ |(ˆp1 −p 1)−(ˆpj −p j)|=|(ˆp1 −ˆpj)−(p 1 −p j)|.(38) Sinceˆpj ≥ˆp1, we haveˆp1 −ˆpj ≤0, hence (ˆp1 −ˆpj)−(p 1 −p j)≤ −(p 1 −p j). Taking absolute values yie...