pith. sign in

arxiv: 2605.19461 · v1 · pith:IJHYJWCCnew · submitted 2026-05-19 · 💻 cs.AI

Beyond Mode Collapse: Distribution Matching for Diverse Reasoning

Pith reviewed 2026-05-20 05:27 UTC · model grok-4.3

classification 💻 cs.AI
keywords mode collapsedistribution matchingpolicy optimizationforward KLreasoningcombinatorial optimizationreinforcement learningdiversity
0
0 comments X p. Extension
pith:IJHYJWCC Add to your LaTeX paper What is a Pith Number?
\usepackage{pith}
\pithnumber{IJHYJWCC}

Prints a linked pith:IJHYJWCC badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more

The pith

DMPO approximates forward KL minimization via group-level reward-proportional distributions to prevent mode collapse in on-policy reinforcement learning for reasoning tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that on-policy methods like GRPO collapse to single solutions because reverse KL minimization reinforces the first high-reward trajectory found rather than spreading probability across alternatives. DMPO counters this by building a target distribution over a group of sampled trajectories in proportion to their rewards and then pulling the policy toward that target. This supplies the mode-covering property of forward KL without ever needing to draw from the full intractable global distribution. The approach is evaluated on NP-hard combinatorial problems that have exponentially many feasible answers but few near-optimal ones, where it yields higher quality ratios and carries over to mathematical reasoning and out-of-domain tasks.

Core claim

DMPO constructs a group level target distribution over sampled trajectories proportional to their rewards, then aligns the policy distribution to this target. This provides mode-covering behavior without requiring sampling from the intractable global target distribution, enabling sustained exploration throughout training.

What carries the argument

Group-level target distribution over sampled trajectories, built proportionally to rewards, serving as a practical proxy for the forward-KL objective.

If this is right

  • Raises Quality Ratio from 40.1% to 43.9% on text-based NP-Bench.
  • Raises Quality Ratio from 38.4% to 43.1% on vision-based NP-Bench.
  • Delivers an additional 2.0% on mathematical reasoning benchmarks.
  • Delivers an additional 2.3% on out-of-domain tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same group-level matching step could be inserted into other on-policy RL pipelines that currently suffer from output homogenization.
  • Larger group sizes during the target-construction step would tighten the approximation to the ideal forward-KL target and might further increase solution variety.
  • Tasks whose solution space contains many near-equivalent optima, such as program synthesis or multi-step planning, stand to benefit most from this style of distribution matching.

Load-bearing premise

The reward-proportional distribution over a modest group of sampled trajectories is a stable and faithful enough stand-in for the true global forward-KL target.

What would settle it

If DMPO runs exhibit the same progressive concentration onto one or two solutions as GRPO, or if diversity metrics stop rising once a high-reward trajectory appears, the group-level proxy would be shown insufficient.

read the original abstract

On-policy reinforcement learning methods like GRPO suffer from mode collapse: they exhibit reduced solution diversity, concentrating probability mass on a single solution once discovered and ceasing exploration of alternative strategies. We show this stems from reverse KL minimization's mode-seeking behavior, which reinforces the first high-reward trajectory found rather than maintaining a distribution over multiple diverse solutions. We propose DMPO (Distribution-Matching Policy Optimization), which prevents mode collapse through principled approximation of forward KL minimization. DMPO constructs a group level target distribution over sampled trajectories proportional to their rewards, then aligns the policy distribution to this target. This provides mode-covering behavior without requiring sampling from the intractable global target distribution, enabling sustained exploration throughout training. We validate DMPO on NP-hard combinatorial optimization, where exponentially many feasible solutions exist but only a few approach optimality, an ideal testbed for evaluating exploration. DMPO achieves 43.9% Quality Ratio on text-based NP-Bench (vs. GRPO's 40.1%) and 43.1% on vision-based NP-Bench (vs. 38.4%), demonstrating 9% and 12% relative improvements respectively. These gains generalize to mathematical reasoning (+2.0%) and out-of-domain tasks (+2.3%), showing that diversity-preserving training enhances general reasoning capabilities across modalities. Our work establishes distribution matching as a practical, principled approach to preventing mode collapse in on-policy RL, with consistent quality improvements demonstrating sustained exploration across diverse reasoning tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that on-policy RL methods like GRPO suffer mode collapse due to reverse KL minimization, and proposes DMPO which constructs a per-group target distribution p(τ) ∝ r(τ) over on-policy sampled trajectories then aligns the policy to this target to approximate forward KL minimization. This is argued to yield sustained mode-covering behavior on NP-hard combinatorial tasks without sampling the intractable global target. Reported results include 43.9% Quality Ratio on text-based NP-Bench (vs. GRPO 40.1%) and 43.1% on vision-based (vs. 38.4%), with generalization gains of +2.0% on mathematical reasoning and +2.3% on out-of-domain tasks.

Significance. If the group-level reward-proportional target remains a faithful proxy for global forward KL throughout training, the approach could provide a practical mechanism for preserving solution diversity in RL-based reasoning systems, particularly in combinatorial settings with many near-optimal solutions. The concrete benchmark deltas and cross-task generalization are potentially useful if supported by controls, though the absence of variance estimates and ablations limits immediate impact assessment.

major comments (2)
  1. [Method] Method section (target distribution construction): the claim that the empirical group-level p(τ) ∝ r(τ) serves as a sufficient stable proxy for the intractable global forward-KL optimum lacks any derivation, convergence bound, or analysis showing that on-policy sampling avoids early high-reward mode dominance; this directly underpins the central assertion of sustained mode-covering behavior.
  2. [Experiments] Experiments section (benchmark reporting): the abstract and results cite specific deltas (43.9% vs 40.1%, 43.1% vs 38.4%) without variance estimates, number of independent runs, ablation controls on group size or reward scaling, or implementation details of the alignment objective, rendering attribution of gains to the distribution-matching mechanism unverifiable.
minor comments (2)
  1. [Abstract] Abstract: the phrase 'principled approximation of forward KL minimization' would benefit from a one-sentence clarification of the exact loss (e.g., whether it is explicit KL or an equivalent alignment surrogate).
  2. [Method] Notation: the symbol p_target is used without an explicit equation defining its normalization or how it is updated across training iterations.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the constructive feedback on our manuscript. We appreciate the referee's identification of areas where additional justification and experimental rigor would strengthen the presentation. We address each major comment below and commit to revisions that directly respond to the concerns raised.

read point-by-point responses
  1. Referee: [Method] Method section (target distribution construction): the claim that the empirical group-level p(τ) ∝ r(τ) serves as a sufficient stable proxy for the intractable global forward-KL optimum lacks any derivation, convergence bound, or analysis showing that on-policy sampling avoids early high-reward mode dominance; this directly underpins the central assertion of sustained mode-covering behavior.

    Authors: We agree that the manuscript would benefit from a more explicit discussion of the approximation properties. The group-level target is constructed by renormalizing rewards within each on-policy batch of trajectories, which locally encourages the policy to cover multiple high-reward modes rather than collapsing to the single highest-reward sample. This design choice is motivated by the intractability of the global target and is intended to provide a practical surrogate for forward KL behavior. We will add a dedicated paragraph in the Method section providing this intuition, along with empirical plots of solution diversity over the course of training to demonstrate that mode coverage is sustained rather than exhibiting early dominance. A full convergence bound is beyond the scope of the current work but we will note this limitation explicitly. revision: yes

  2. Referee: [Experiments] Experiments section (benchmark reporting): the abstract and results cite specific deltas (43.9% vs 40.1%, 43.1% vs 38.4%) without variance estimates, number of independent runs, ablation controls on group size or reward scaling, or implementation details of the alignment objective, rendering attribution of gains to the distribution-matching mechanism unverifiable.

    Authors: We acknowledge that the current reporting is insufficient for full verifiability. The reported Quality Ratio numbers are means across five independent runs using different random seeds; we will add standard deviation values to all tables in the revised manuscript. We will also include ablations varying group size (default of 8 trajectories) and reward scaling, plus a concise description of the alignment objective implementation (including the exact form of the distribution-matching loss) in the main Experiments section with further pseudocode in the appendix. These changes will allow readers to better attribute the observed gains to the proposed mechanism. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper defines DMPO explicitly as constructing a group-level target distribution p(τ) ∝ r(τ) over on-policy samples and aligning the policy to this target to approximate forward KL. This is a direct methodological choice rather than a derived prediction or result that reduces back to its own inputs by construction. No equations, fitted parameters, or self-citations are presented in the abstract or description that make the claimed mode-covering behavior or performance gains equivalent to the input definitions. Empirical results on NP-Bench and generalization tasks are reported as independent validations. The approach does not invoke load-bearing self-citations, uniqueness theorems, or ansatzes smuggled from prior work.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The method rests on standard RL assumptions plus the novel construction of a per-group target distribution; no new physical entities or ad-hoc constants are introduced in the abstract.

axioms (1)
  • domain assumption On-policy sampling produces trajectories whose rewards can be used to form a stable target distribution for forward KL alignment.
    Invoked when the paper states that the group-level target is built proportionally to rewards from sampled trajectories.

pith-pipeline@v0.9.0 · 5837 in / 1310 out tokens · 31211 ms · 2026-05-20T05:27:01.311080+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

58 extracted references · 58 canonical work pages · 22 internal anchors

  1. [1]

    Understanding the impact of entropy on policy optimization

    Zafarali Ahmed, Nicolas Le Roux, Mohammad Norouzi, and Dale Schuurmans. Understanding the impact of entropy on policy optimization. InInternational conference on machine learning, pages 151–160. PMLR,

  2. [2]

    Polaris: A post-training recipe for scaling reinforcement learning on advanced reasoning models, 2025.URL https://hkunlp

    Chenxin An, Zhihui Xie, Xiaonan Li, Lei Li, Jun Zhang, Shansan Gong, Ming Zhong, Jingjing Xu, Xipeng Qiu, Mingxuan Wang, et al. Polaris: A post-training recipe for scaling reinforcement learning on advanced reasoning models, 2025.URL https://hkunlp. github. io/blog/2025/Polaris, 2025. 2

  3. [3]

    Exploration by Random Network Distillation

    YuriBurda, HarrisonEdwards, AmosStorkey, andOlegKlimov. Explorationbyrandomnetworkdistillation. arXiv preprint arXiv:1810.12894, 2018. 2

  4. [5]

    Pass@k training for adaptively balancing exploration and exploitation of large reasoning models, 2025

    Zhipeng Chen, Xiaobo Qin, Youbin Wu, Yue Ling, Qinghao Ye, Wayne Xin Zhao, and Guang Shi. Pass@k training for adaptively balancing exploration and exploitation of large reasoning models, 2025. URL https://arxiv.org/abs/2508.10751. 2, 5

  5. [6]

    Reasoning with Exploration: An Entropy Perspective

    Daixuan Cheng, Shaohan Huang, Xuekai Zhu, Bo Dai, Wayne Xin Zhao, Zhenliang Zhang, and Furu Wei. Reasoning with exploration: An entropy perspective.arXiv preprint arXiv:2506.14758, 2025. 2

  6. [7]

    Gpg: A simple and strong reinforcement learning baseline for model reasoning, 2025

    Xiangxiang Chu, Hailang Huang, Xiao Zhang, Fei Wei, and Yong Wang. Gpg: A simple and strong reinforcement learning baseline for model reasoning, 2025. URLhttps://arxiv.org/abs/2504. 02546. 2, 2, 3, 5, D.1, 7, 8

  7. [8]

    The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models

    Ganqu Cui, Yuchen Zhang, Jiacheng Chen, Lifan Yuan, Zhi Wang, Yuxin Zuo, Haozhan Li, Yuchen Fan, Huayu Chen, Weize Chen, Zhiyuan Liu, Hao Peng, Lei Bai, Wanli Ouyang, Yu Cheng, Bowen Zhou, and Ning Ding. The entropy mechanism of reinforcement learning for reasoning language models, 2025. URL https://arxiv.org/abs/2505.22617. 2, 2, 3, 5, D.1, 7, 8

  8. [9]

    Vlmevalkit: An open-source toolkit for evaluating large multi-modality models, 2025

    Haodong Duan, Xinyu Fang, Junming Yang, Xiangyu Zhao, Yuxuan Qiao, Mo Li, Amit Agarwal, Zhe Chen, Lin Chen, Yuan Liu, Yubo Ma, Hailong Sun, Yifan Zhang, Shiyin Lu, Tack Hwa Wong, Weiyun Wang, Peiheng Zhou, Xiaozhe Li, Chaoyou Fu, Junbo Cui, Jixuan Chen, Enxin Song, Song Mao, Shengyuan Ding, Tianhao Liang, Zicheng Zhang, Xiaoyi Dong, Yuhang Zang, Pan Zhang...

  9. [10]

    The wisdom of the crowd: Reliable deep reinforcement learning through ensembles of q-functions.IEEE transactions on neural networks and learning systems, 34(1): 43–51, 2021

    Daniel L Elliott and Charles Anderson. The wisdom of the crowd: Reliable deep reinforcement learning through ensembles of q-functions.IEEE transactions on neural networks and learning systems, 34(1): 43–51, 2021. 2

  10. [11]

    Maximum entropy rl (provably) solves some robust rl problems

    Benjamin Eysenbach and Sergey Levine. Maximum entropy rl (provably) solves some robust rl problems. arXiv preprint arXiv:2103.06257, 2021. 2

  11. [12]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025. 2

  12. [13]

    Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor

    Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. InInternational conference on machine learning, pages 1861–1870. Pmlr, 2018. 2

  13. [14]

    Olympiadbench: A challenging benchmark for promoting agi with olympiad- level bilingual multimodal scientific problems

    Chaoqun He, Renjie Luo, Yuzhuo Bai, Shengding Hu, Zhen Thai, Junhao Shen, Jinyi Hu, Xu Han, Yujie Huang, Yuxiang Zhang, et al. Olympiadbench: A challenging benchmark for promoting agi with olympiad- level bilingual multimodal scientific problems. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Pape...

  14. [15]

    DeepMath-103K: A Large-Scale, Challenging, Decontaminated, and Verifiable Mathematical Dataset for Advancing Reasoning

    Zhiwei He, Tian Liang, Jiahao Xu, Qiuzhi Liu, Xingyu Chen, Yue Wang, Linfeng Song, Dian Yu, Zhenwen Liang, Wenxuan Wang, et al. Deepmath-103k: A large-scale, challenging, decontaminated, and verifiable mathematical dataset for advancing reasoning.arXiv preprint arXiv:2504.11456, 2025. 1

  15. [16]

    Measuring Mathematical Problem Solving With the MATH Dataset

    Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset.arXiv preprint arXiv:2103.03874, 2021. D.1

  16. [17]

    Reasoning with Sampling: Your Base Model is Smarter Than You Think

    Aayush Karan and Yilun Du. Reasoning with sampling: Your base model is smarter than you think.arXiv preprint arXiv:2510.14901, 2025. 1

  17. [18]

    Reinforcement Learning and Control as Probabilistic Inference: Tutorial and Review

    Sergey Levine. Reinforcement learning and control as probabilistic inference: Tutorial and review.arXiv preprint arXiv:1805.00909, 2018. 2, 3.1

  18. [19]

    Solving quantitative reasoning problems with language models.Advances in neural information processing systems, 35:3843–3857, 2022

    Aitor Lewkowycz, Anders Andreassen, David Dohan, Ethan Dyer, Henryk Michalewski, Vinay Ramasesh, Ambrose Slone, Cem Anil, Imanol Schlag, Theo Gutman-Solo, et al. Solving quantitative reasoning problems with language models.Advances in neural information processing systems, 35:3843–3857, 2022. D.1

  19. [20]

    Numinamath: The largest public dataset in ai4maths with 860k pairs of competition math problems and solutions.Hugging Face repository, 13(9):9, 2024

    Jia Li, Edward Beeching, Lewis Tunstall, Ben Lipkin, Roman Soletskyi, Shengyi Huang, Kashif Rasul, Longhui Yu, Albert Q Jiang, Ziju Shen, et al. Numinamath: The largest public dataset in ai4maths with 860k pairs of competition math problems and solutions.Hugging Face repository, 13(9):9, 2024. D.1

  20. [21]

    Np-engine: Empowering optimization reasoning in large language models with verifiable synthetic np problems, 2025

    Xiaozhe Li, Xinyu Fang, Shengyuan Ding, Linyang Li, Haodong Duan, Qingwen Liu, and Kai Chen. Np-engine: Empowering optimization reasoning in large language models with verifiable synthetic np problems, 2025. URLhttps://arxiv.org/abs/2510.16476. 1, 4, C.1, D.1

  21. [22]

    Code-r1: Reproducing r1 for code with reliable rewards.https: //github.com/ganler/code-r1, 2025

    Jiawei Liu and Lingming Zhang. Code-r1: Reproducing r1 for code with reliable rewards.https: //github.com/ganler/code-r1, 2025. GitHub repository. 1

  22. [23]

    MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts

    Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts, 2024. URLhttps://arxiv.org/abs/2310.02255. D.1

  23. [24]

    MM-Eureka: Exploring the Frontiers of Multimodal Reasoning with Rule-based Reinforcement Learning

    Fanqing Meng, Lingxiao Du, Zongkai Liu, Zhixiang Zhou, Quanfeng Lu, Daocheng Fu, Botian Shi, Wenhai Wang, Junjun He, Kaipeng Zhang, et al. Mm-eureka: Exploring visual aha moment with rule-based large-scale reinforcement learning.arXiv preprint arXiv:2503.07365, 2025. D.1

  24. [25]

    MIT press, 2012

    Kevin P Murphy.Machine learning: a probabilistic perspective. MIT press, 2012. 1, 2, 3.2

  25. [26]

    Learning to reason with llms, 2024

    OpenAI. Learning to reason with llms, 2024. URL https://openai.com/index/ learning-to-reason-with-llms/. 2

  26. [27]

    Curiosity-driven exploration by self-supervised prediction

    Deepak Pathak, Pulkit Agrawal, Alexei A Efros, and Trevor Darrell. Curiosity-driven exploration by self-supervised prediction. InInternational conference on machine learning, pages 2778–2787. PMLR,

  27. [28]

    We-Math: Does Your Large Multimodal Model Achieve Human-like Mathematical Reasoning?

    Runqi Qiao, Qiuna Tan, Guanting Dong, Minhui Wu, Chong Sun, Xiaoshuai Song, Zhuoma GongQue, Shanglin Lei, Zhe Wei, Miaoxuan Zhang, Runfeng Qiao, Yifan Zhang, Xiao Zong, Yida Xu, Muxi Diao, Zhimin Bao, Chen Li, and Honggang Zhang. We-math: Does your large multimodal model achieve human-like mathematical reasoning?, 2024. URLhttps://arxiv.org/abs/2407.01284. D.1

  28. [29]

    Proximal Policy Optimization Algorithms

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy opti- mization algorithms, 2017. URLhttps://arxiv.org/abs/1707.06347. 2

  29. [31]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models, 2024.URL https://arxiv. org/abs/2402.03300, 2(3):5, 2024. 1

  30. [32]

    Outcome-based exploration for llm reasoning.arXiv preprint arXiv:2509.06941, 2025

    Yuda Song, Julia Kempe, and Remi Munos. Outcome-based exploration for llm reasoning.arXiv preprint arXiv:2509.06941, 2025. 1, 2

  31. [33]

    Optimizing language models for inference time objectives using reinforcement learning, 2025

    Yunhao Tang, Kunhao Zheng, Gabriel Synnaeve, and Rémi Munos. Optimizing language models for inference time objectives using reinforcement learning, 2025. URLhttps://arxiv.org/abs/2503. 19595. 2, 5, D.1

  32. [34]

    Measuring Multimodal Mathematical Reasoning with MATH-Vision Dataset

    Ke Wang, Junting Pan, Weikang Shi, Zimu Lu, Mingjie Zhan, and Hongsheng Li. Measuring multimodal mathematical reasoning with math-vision dataset, 2024. URLhttps://arxiv.org/abs/2402.14804. D.1

  33. [35]

    Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning

    Shenzhi Wang, Le Yu, Chang Gao, Chujie Zheng, Shixuan Liu, Rui Lu, Kai Dang, Xionghui Chen, Jianxin Yang, Zhenru Zhang, et al. Beyond the 80/20 rule: High-entropy minority tokens drive effective reinforcement learning for llm reasoning.arXiv preprint arXiv:2506.01939, 2025. 2

  34. [36]

    LogicVista: Multimodal LLM Logical Reasoning Benchmark in Visual Contexts

    Yijia Xiao, Edward Sun, Tianyu Liu, and Wei Wang. Logicvista: Multimodal llm logical reasoning benchmark in visual contexts, 2024. URLhttps://arxiv.org/abs/2407.04973. D.1

  35. [37]

    Visulogic: A benchmark for evaluating visual reasoning in multi-modal large language models.arXiv preprint arXiv:2504.15279, 2025

    Weiye Xu, Jiahao Wang, Weiyun Wang, Zhe Chen, Wengang Zhou, Aijun Yang, Lewei Lu, Houqiang Li, Xiaohua Wang, Xizhou Zhu, Wenhai Wang, Jifeng Dai, and Jinguo Zhu. Visulogic: A benchmark for evaluating visual reasoning in multi-modal large language models, 2025. URLhttps://arxiv.org/ abs/2504.15279. D.1

  36. [38]

    Learning to Reason under Off-Policy Guidance

    Jianhao Yan, Yafu Li, Zican Hu, Zhi Wang, Ganqu Cui, Xiaoye Qu, Yu Cheng, and Yue Zhang. Learning to reason under off-policy guidance, 2025. URLhttps://arxiv.org/abs/2504.14945. D.1

  37. [39]

    DAPO: An Open-Source LLM Reinforcement Learning System at Scale

    Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Tiantian Fan, Gaohong Liu, Lingjun Liu, Xin Liu, et al. Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476, 2025. 1

  38. [40]

    Gsm8k-v: Can vision language models solve grade school math word problems in visual contexts, 2025

    Fan Yuan, Yuchen Yan, Yifan Jiang, Haoran Zhao, Tao Feng, Jinyan Chen, Yanwei Lou, Wenqi Zhang, Yongliang Shen, Weiming Lu, Jun Xiao, and Yueting Zhuang. Gsm8k-v: Can vision language models solve grade school math word problems in visual contexts, 2025. URLhttps://arxiv.org/abs/2509. 25160. D.1

  39. [41]

    Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?

    Yang Yue, Zhiqi Chen, Rui Lu, Andrew Zhao, Zhaokai Wang, Shiji Song, and Gao Huang. Does rein- forcement learning really incentivize reasoning capacity in llms beyond the base model?arXiv preprint arXiv:2504.13837, 2025. 1, 2

  40. [42]

    A Survey of Reinforcement Learning for Large Reasoning Models

    Kaiyan Zhang, Yuxin Zuo, Bingxiang He, Youbang Sun, Runze Liu, Che Jiang, Yuchen Fan, Kai Tian, Guoli Jia, Pengfei Li, et al. A survey of reinforcement learning for large reasoning models.arXiv preprint arXiv:2509.08827, 2025. 2

  41. [43]

    MathVerse: Does Your Multi-modal LLM Truly See the Diagrams in Visual Math Problems?

    Renrui Zhang, Dongzhi Jiang, Yichi Zhang, Haokun Lin, Ziyu Guo, Pengshuo Qiu, Aojun Zhou, Pan Lu, Kai-Wei Chang, Peng Gao, and Hongsheng Li. Mathverse: Does your multi-modal llm truly see the diagrams in visual math problems?, 2024. URLhttps://arxiv.org/abs/2403.14624. D.1

  42. [44]

    Group Sequence Policy Optimization

    Chujie Zheng, Shixuan Liu, Mingze Li, Xiong-Hui Chen, Bowen Yu, Chang Gao, Kai Dang, Yuqiong Liu, Rui Men, An Yang, Jingren Zhou, and Junyang Lin. Group sequence policy optimization, 2025. URL https://arxiv.org/abs/2507.18071. 2, 2, 3, 5, D.1, 7, 8

  43. [45]

    Flowrl: Matching reward distributions for llm reasoning, 2025

    Xuekai Zhu, Daixuan Cheng, Dinghuai Zhang, Hengli Li, Kaiyan Zhang, Che Jiang, Youbang Sun, Ermo Hua, Yuxin Zuo, Xingtai Lv, Qizheng Zhang, Lin Chen, Fanghao Shao, Bo Xue, Yunchong Song, Zhenjie Yang, Ganqu Cui, Ning Ding, Jianfeng Gao, Xiaodong Liu, Bowen Zhou, Hongyuan Mei, and Zhouhan Lin. Flowrl: Matching reward distributions for llm reasoning, 2025. ...

  44. [46]

    Exploring multi-temperature strategies for token-and rollout-level control in rlvr.arXiv preprint arXiv:2510.08892, 2025

    Haomin Zhuang, Yujun Zhou, Taicheng Guo, Yue Huang, Fangxu Liu, Kai Song, and Xiangliang Zhang. Exploring multi-temperature strategies for token-and rollout-level control in rlvr.arXiv preprint arXiv:2510.08892, 2025. 2

  45. [47]

    Ensembling diverse policies improves generalizability of reinforcement learning algorithms in continuous control tasks

    Abilmansur Zhumabekov, Daniel May, Tianyu Zhang, Aakash Krishna GS, Omid Ardakanian, and Matthew E Taylor. Ensembling diverse policies improves generalizability of reinforcement learning algorithms in continuous control tasks. InProceedings of the Adaptive and Learning Agents Workshop (ALA 2023). ALA, 2023. 2

  46. [48]

    Find the shortest tour visiting all 20 cities exactly once

    Brian D Ziebart, Andrew L Maas, J Andrew Bagnell, Anind K Dey, et al. Maximum entropy inverse reinforcement learning. InAaai, volume 8, pages 1433–1438. Chicago, IL, USA, 2008. 3.1 15 Beyond Mode Collapse: Distribution Matching for Diverse Reasoning A. Method A.1. Algorithm Algorithm 1 summarizes the complete DMPO training procedure. Algorithm 1Distributi...

  47. [49]

    optimality gap

    See Table 9 for complete settings and sensitivity analysis. DiversityEvaluation.Ourbenchmarkisdesignedtocapturethedistinctionbetweenfeasibilityandoptimization quality. In this context, mode collapse manifests as a high Success Rate (SR) but a relatively low Quality Ratio (QR): the model consistently finds feasible solutions but repeatedly concentrates on ...

  48. [50]

    **Identify the vertices and edges**: The graph has 17 vertices labeled from 0 to 16

  49. [51]

    **Check the degree of each vertex**:

  50. [52]

    **Start with a feasible solution**:

  51. [53]

    Let's start by assigning colors step-by-step: - Vertex 16 has the highest degree (5 edges), so it can be assigned color 1

    **Optimize the coloring**: Gradually reduce ... Let's start by assigning colors step-by-step: - Vertex 16 has the highest degree (5 edges), so it can be assigned color 1. - Next, assign color 2 to the vertex that is most connected but not yet colored. ... The final coloring is as follows: Answer: [1, 2, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15] v...

  52. [54]

    Find a valid coloring. 2. Minimize the number of colors used. ### Step 2: Strategy ... ### Step 3: Estimate Chromatic Number ... ### Step 4: Construct a Valid 7-Coloring ... ### Step 5: Verify Validity : Check if any adjacent vertices have same color. .... ### Step 6: Fix Conflict: Vertex 16 was assigned

  53. [55]

    But 16 is connected to 7=4 → conflict. ..... Answer: [1, 1, 3, 2, 7, 3, 3, 4, 4, 4, 2, 6, 1, 5, 3, 2, 6] invalid coloring: node 0 and node 1 have the same color 1 Vertex Cover Question Introduction: You are a great expert solving vertex-cover question. You should propose a solution to this question.', 'Task description': 'Given an undirected graph G = (V,...

  54. [56]

    **Identify the degree of each vertex**:

  55. [57]

    **Select the vertex with the highest degree**

  56. [58]

    **Remove the edges connected to the selected vertex from the graph**

  57. [59]

    Let's apply this to the given graph:

    **Repeat the process until all edges are covered**. Let's apply this to the given graph:... Here's the list of vertices sorted by their degrees: 17 (6), 5 (5), 9 (5), 8 (5), 6 (5), 0 (4), 10 (4), ... Now, let's select the vertices one by one based on their degrees: 1. Select vertex 17

  58. [60]

    Remove edges connected to vertex 17. ... Answer: [17, 5, 9, 8, 6, 0, 10, 11, 2, 4, 19, 7, 3, 18, 1, 13, 16, 15, 14] valid vertex cover with 19 vertices Qwen2.5-VL-72B-Instruct To solve the Minimum Vertex Cover problem for the given graph, we need to identify a subset of vertices such that every edge in the graph has at least one endpoint in this subset. W...