Beyond Mode Collapse: Distribution Matching for Diverse Reasoning
Pith reviewed 2026-05-20 05:27 UTC · model grok-4.3
pith:IJHYJWCC Add to your LaTeX paper
What is a Pith Number?\usepackage{pith}
\pithnumber{IJHYJWCC}
Prints a linked pith:IJHYJWCC badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more
The pith
DMPO approximates forward KL minimization via group-level reward-proportional distributions to prevent mode collapse in on-policy reinforcement learning for reasoning tasks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
DMPO constructs a group level target distribution over sampled trajectories proportional to their rewards, then aligns the policy distribution to this target. This provides mode-covering behavior without requiring sampling from the intractable global target distribution, enabling sustained exploration throughout training.
What carries the argument
Group-level target distribution over sampled trajectories, built proportionally to rewards, serving as a practical proxy for the forward-KL objective.
If this is right
- Raises Quality Ratio from 40.1% to 43.9% on text-based NP-Bench.
- Raises Quality Ratio from 38.4% to 43.1% on vision-based NP-Bench.
- Delivers an additional 2.0% on mathematical reasoning benchmarks.
- Delivers an additional 2.3% on out-of-domain tasks.
Where Pith is reading between the lines
- The same group-level matching step could be inserted into other on-policy RL pipelines that currently suffer from output homogenization.
- Larger group sizes during the target-construction step would tighten the approximation to the ideal forward-KL target and might further increase solution variety.
- Tasks whose solution space contains many near-equivalent optima, such as program synthesis or multi-step planning, stand to benefit most from this style of distribution matching.
Load-bearing premise
The reward-proportional distribution over a modest group of sampled trajectories is a stable and faithful enough stand-in for the true global forward-KL target.
What would settle it
If DMPO runs exhibit the same progressive concentration onto one or two solutions as GRPO, or if diversity metrics stop rising once a high-reward trajectory appears, the group-level proxy would be shown insufficient.
read the original abstract
On-policy reinforcement learning methods like GRPO suffer from mode collapse: they exhibit reduced solution diversity, concentrating probability mass on a single solution once discovered and ceasing exploration of alternative strategies. We show this stems from reverse KL minimization's mode-seeking behavior, which reinforces the first high-reward trajectory found rather than maintaining a distribution over multiple diverse solutions. We propose DMPO (Distribution-Matching Policy Optimization), which prevents mode collapse through principled approximation of forward KL minimization. DMPO constructs a group level target distribution over sampled trajectories proportional to their rewards, then aligns the policy distribution to this target. This provides mode-covering behavior without requiring sampling from the intractable global target distribution, enabling sustained exploration throughout training. We validate DMPO on NP-hard combinatorial optimization, where exponentially many feasible solutions exist but only a few approach optimality, an ideal testbed for evaluating exploration. DMPO achieves 43.9% Quality Ratio on text-based NP-Bench (vs. GRPO's 40.1%) and 43.1% on vision-based NP-Bench (vs. 38.4%), demonstrating 9% and 12% relative improvements respectively. These gains generalize to mathematical reasoning (+2.0%) and out-of-domain tasks (+2.3%), showing that diversity-preserving training enhances general reasoning capabilities across modalities. Our work establishes distribution matching as a practical, principled approach to preventing mode collapse in on-policy RL, with consistent quality improvements demonstrating sustained exploration across diverse reasoning tasks.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that on-policy RL methods like GRPO suffer mode collapse due to reverse KL minimization, and proposes DMPO which constructs a per-group target distribution p(τ) ∝ r(τ) over on-policy sampled trajectories then aligns the policy to this target to approximate forward KL minimization. This is argued to yield sustained mode-covering behavior on NP-hard combinatorial tasks without sampling the intractable global target. Reported results include 43.9% Quality Ratio on text-based NP-Bench (vs. GRPO 40.1%) and 43.1% on vision-based (vs. 38.4%), with generalization gains of +2.0% on mathematical reasoning and +2.3% on out-of-domain tasks.
Significance. If the group-level reward-proportional target remains a faithful proxy for global forward KL throughout training, the approach could provide a practical mechanism for preserving solution diversity in RL-based reasoning systems, particularly in combinatorial settings with many near-optimal solutions. The concrete benchmark deltas and cross-task generalization are potentially useful if supported by controls, though the absence of variance estimates and ablations limits immediate impact assessment.
major comments (2)
- [Method] Method section (target distribution construction): the claim that the empirical group-level p(τ) ∝ r(τ) serves as a sufficient stable proxy for the intractable global forward-KL optimum lacks any derivation, convergence bound, or analysis showing that on-policy sampling avoids early high-reward mode dominance; this directly underpins the central assertion of sustained mode-covering behavior.
- [Experiments] Experiments section (benchmark reporting): the abstract and results cite specific deltas (43.9% vs 40.1%, 43.1% vs 38.4%) without variance estimates, number of independent runs, ablation controls on group size or reward scaling, or implementation details of the alignment objective, rendering attribution of gains to the distribution-matching mechanism unverifiable.
minor comments (2)
- [Abstract] Abstract: the phrase 'principled approximation of forward KL minimization' would benefit from a one-sentence clarification of the exact loss (e.g., whether it is explicit KL or an equivalent alignment surrogate).
- [Method] Notation: the symbol p_target is used without an explicit equation defining its normalization or how it is updated across training iterations.
Simulated Author's Rebuttal
Thank you for the constructive feedback on our manuscript. We appreciate the referee's identification of areas where additional justification and experimental rigor would strengthen the presentation. We address each major comment below and commit to revisions that directly respond to the concerns raised.
read point-by-point responses
-
Referee: [Method] Method section (target distribution construction): the claim that the empirical group-level p(τ) ∝ r(τ) serves as a sufficient stable proxy for the intractable global forward-KL optimum lacks any derivation, convergence bound, or analysis showing that on-policy sampling avoids early high-reward mode dominance; this directly underpins the central assertion of sustained mode-covering behavior.
Authors: We agree that the manuscript would benefit from a more explicit discussion of the approximation properties. The group-level target is constructed by renormalizing rewards within each on-policy batch of trajectories, which locally encourages the policy to cover multiple high-reward modes rather than collapsing to the single highest-reward sample. This design choice is motivated by the intractability of the global target and is intended to provide a practical surrogate for forward KL behavior. We will add a dedicated paragraph in the Method section providing this intuition, along with empirical plots of solution diversity over the course of training to demonstrate that mode coverage is sustained rather than exhibiting early dominance. A full convergence bound is beyond the scope of the current work but we will note this limitation explicitly. revision: yes
-
Referee: [Experiments] Experiments section (benchmark reporting): the abstract and results cite specific deltas (43.9% vs 40.1%, 43.1% vs 38.4%) without variance estimates, number of independent runs, ablation controls on group size or reward scaling, or implementation details of the alignment objective, rendering attribution of gains to the distribution-matching mechanism unverifiable.
Authors: We acknowledge that the current reporting is insufficient for full verifiability. The reported Quality Ratio numbers are means across five independent runs using different random seeds; we will add standard deviation values to all tables in the revised manuscript. We will also include ablations varying group size (default of 8 trajectories) and reward scaling, plus a concise description of the alignment objective implementation (including the exact form of the distribution-matching loss) in the main Experiments section with further pseudocode in the appendix. These changes will allow readers to better attribute the observed gains to the proposed mechanism. revision: yes
Circularity Check
No significant circularity in derivation chain
full rationale
The paper defines DMPO explicitly as constructing a group-level target distribution p(τ) ∝ r(τ) over on-policy samples and aligning the policy to this target to approximate forward KL. This is a direct methodological choice rather than a derived prediction or result that reduces back to its own inputs by construction. No equations, fitted parameters, or self-citations are presented in the abstract or description that make the claimed mode-covering behavior or performance gains equivalent to the input definitions. Empirical results on NP-Bench and generalization tasks are reported as independent validations. The approach does not invoke load-bearing self-citations, uniqueness theorems, or ansatzes smuggled from prior work.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption On-policy sampling produces trajectories whose rewards can be used to form a stable target distribution for forward KL alignment.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
DMPO constructs a group-level target distribution over sampled trajectories proportional to their rewards (a Boltzmann distribution) then aligns the policy distribution to this target... ℒDM(θ)=1/G Σ (p(oi|O)−qθ(oi|O))²
-
IndisputableMonolith/Foundation/BranchSelection.leanbranch_selection unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
forward KL minimization exhibits mode-covering behavior... p*(τ)=exp(r(τ)/α)/Z
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Understanding the impact of entropy on policy optimization
Zafarali Ahmed, Nicolas Le Roux, Mohammad Norouzi, and Dale Schuurmans. Understanding the impact of entropy on policy optimization. InInternational conference on machine learning, pages 151–160. PMLR,
-
[2]
Chenxin An, Zhihui Xie, Xiaonan Li, Lei Li, Jun Zhang, Shansan Gong, Ming Zhong, Jingjing Xu, Xipeng Qiu, Mingxuan Wang, et al. Polaris: A post-training recipe for scaling reinforcement learning on advanced reasoning models, 2025.URL https://hkunlp. github. io/blog/2025/Polaris, 2025. 2
work page 2025
-
[3]
Exploration by Random Network Distillation
YuriBurda, HarrisonEdwards, AmosStorkey, andOlegKlimov. Explorationbyrandomnetworkdistillation. arXiv preprint arXiv:1810.12894, 2018. 2
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[5]
Zhipeng Chen, Xiaobo Qin, Youbin Wu, Yue Ling, Qinghao Ye, Wayne Xin Zhao, and Guang Shi. Pass@k training for adaptively balancing exploration and exploitation of large reasoning models, 2025. URL https://arxiv.org/abs/2508.10751. 2, 5
-
[6]
Reasoning with Exploration: An Entropy Perspective
Daixuan Cheng, Shaohan Huang, Xuekai Zhu, Bo Dai, Wayne Xin Zhao, Zhenliang Zhang, and Furu Wei. Reasoning with exploration: An entropy perspective.arXiv preprint arXiv:2506.14758, 2025. 2
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[7]
Gpg: A simple and strong reinforcement learning baseline for model reasoning, 2025
Xiangxiang Chu, Hailang Huang, Xiao Zhang, Fei Wei, and Yong Wang. Gpg: A simple and strong reinforcement learning baseline for model reasoning, 2025. URLhttps://arxiv.org/abs/2504. 02546. 2, 2, 3, 5, D.1, 7, 8
work page 2025
-
[8]
The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models
Ganqu Cui, Yuchen Zhang, Jiacheng Chen, Lifan Yuan, Zhi Wang, Yuxin Zuo, Haozhan Li, Yuchen Fan, Huayu Chen, Weize Chen, Zhiyuan Liu, Hao Peng, Lei Bai, Wanli Ouyang, Yu Cheng, Bowen Zhou, and Ning Ding. The entropy mechanism of reinforcement learning for reasoning language models, 2025. URL https://arxiv.org/abs/2505.22617. 2, 2, 3, 5, D.1, 7, 8
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[9]
Vlmevalkit: An open-source toolkit for evaluating large multi-modality models, 2025
Haodong Duan, Xinyu Fang, Junming Yang, Xiangyu Zhao, Yuxuan Qiao, Mo Li, Amit Agarwal, Zhe Chen, Lin Chen, Yuan Liu, Yubo Ma, Hailong Sun, Yifan Zhang, Shiyin Lu, Tack Hwa Wong, Weiyun Wang, Peiheng Zhou, Xiaozhe Li, Chaoyou Fu, Junbo Cui, Jixuan Chen, Enxin Song, Song Mao, Shengyuan Ding, Tianhao Liang, Zicheng Zhang, Xiaoyi Dong, Yuhang Zang, Pan Zhang...
-
[10]
Daniel L Elliott and Charles Anderson. The wisdom of the crowd: Reliable deep reinforcement learning through ensembles of q-functions.IEEE transactions on neural networks and learning systems, 34(1): 43–51, 2021. 2
work page 2021
-
[11]
Maximum entropy rl (provably) solves some robust rl problems
Benjamin Eysenbach and Sergey Levine. Maximum entropy rl (provably) solves some robust rl problems. arXiv preprint arXiv:2103.06257, 2021. 2
-
[12]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025. 2
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[13]
Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor
Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. InInternational conference on machine learning, pages 1861–1870. Pmlr, 2018. 2
work page 2018
-
[14]
Chaoqun He, Renjie Luo, Yuzhuo Bai, Shengding Hu, Zhen Thai, Junhao Shen, Jinyi Hu, Xu Han, Yujie Huang, Yuxiang Zhang, et al. Olympiadbench: A challenging benchmark for promoting agi with olympiad- level bilingual multimodal scientific problems. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Pape...
work page 2024
-
[15]
Zhiwei He, Tian Liang, Jiahao Xu, Qiuzhi Liu, Xingyu Chen, Yue Wang, Linfeng Song, Dian Yu, Zhenwen Liang, Wenxuan Wang, et al. Deepmath-103k: A large-scale, challenging, decontaminated, and verifiable mathematical dataset for advancing reasoning.arXiv preprint arXiv:2504.11456, 2025. 1
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[16]
Measuring Mathematical Problem Solving With the MATH Dataset
Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset.arXiv preprint arXiv:2103.03874, 2021. D.1
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[17]
Reasoning with Sampling: Your Base Model is Smarter Than You Think
Aayush Karan and Yilun Du. Reasoning with sampling: Your base model is smarter than you think.arXiv preprint arXiv:2510.14901, 2025. 1
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[18]
Reinforcement Learning and Control as Probabilistic Inference: Tutorial and Review
Sergey Levine. Reinforcement learning and control as probabilistic inference: Tutorial and review.arXiv preprint arXiv:1805.00909, 2018. 2, 3.1
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[19]
Aitor Lewkowycz, Anders Andreassen, David Dohan, Ethan Dyer, Henryk Michalewski, Vinay Ramasesh, Ambrose Slone, Cem Anil, Imanol Schlag, Theo Gutman-Solo, et al. Solving quantitative reasoning problems with language models.Advances in neural information processing systems, 35:3843–3857, 2022. D.1
work page 2022
-
[20]
Jia Li, Edward Beeching, Lewis Tunstall, Ben Lipkin, Roman Soletskyi, Shengyi Huang, Kashif Rasul, Longhui Yu, Albert Q Jiang, Ziju Shen, et al. Numinamath: The largest public dataset in ai4maths with 860k pairs of competition math problems and solutions.Hugging Face repository, 13(9):9, 2024. D.1
work page 2024
-
[21]
Xiaozhe Li, Xinyu Fang, Shengyuan Ding, Linyang Li, Haodong Duan, Qingwen Liu, and Kai Chen. Np-engine: Empowering optimization reasoning in large language models with verifiable synthetic np problems, 2025. URLhttps://arxiv.org/abs/2510.16476. 1, 4, C.1, D.1
-
[22]
Code-r1: Reproducing r1 for code with reliable rewards.https: //github.com/ganler/code-r1, 2025
Jiawei Liu and Lingming Zhang. Code-r1: Reproducing r1 for code with reliable rewards.https: //github.com/ganler/code-r1, 2025. GitHub repository. 1
work page 2025
-
[23]
MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts
Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts, 2024. URLhttps://arxiv.org/abs/2310.02255. D.1
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[24]
MM-Eureka: Exploring the Frontiers of Multimodal Reasoning with Rule-based Reinforcement Learning
Fanqing Meng, Lingxiao Du, Zongkai Liu, Zhixiang Zhou, Quanfeng Lu, Daocheng Fu, Botian Shi, Wenhai Wang, Junjun He, Kaipeng Zhang, et al. Mm-eureka: Exploring visual aha moment with rule-based large-scale reinforcement learning.arXiv preprint arXiv:2503.07365, 2025. D.1
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[25]
Kevin P Murphy.Machine learning: a probabilistic perspective. MIT press, 2012. 1, 2, 3.2
work page 2012
-
[26]
Learning to reason with llms, 2024
OpenAI. Learning to reason with llms, 2024. URL https://openai.com/index/ learning-to-reason-with-llms/. 2
work page 2024
-
[27]
Curiosity-driven exploration by self-supervised prediction
Deepak Pathak, Pulkit Agrawal, Alexei A Efros, and Trevor Darrell. Curiosity-driven exploration by self-supervised prediction. InInternational conference on machine learning, pages 2778–2787. PMLR,
-
[28]
We-Math: Does Your Large Multimodal Model Achieve Human-like Mathematical Reasoning?
Runqi Qiao, Qiuna Tan, Guanting Dong, Minhui Wu, Chong Sun, Xiaoshuai Song, Zhuoma GongQue, Shanglin Lei, Zhe Wei, Miaoxuan Zhang, Runfeng Qiao, Yifan Zhang, Xiao Zong, Yida Xu, Muxi Diao, Zhimin Bao, Chen Li, and Honggang Zhang. We-math: Does your large multimodal model achieve human-like mathematical reasoning?, 2024. URLhttps://arxiv.org/abs/2407.01284. D.1
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[29]
Proximal Policy Optimization Algorithms
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy opti- mization algorithms, 2017. URLhttps://arxiv.org/abs/1707.06347. 2
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[31]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models, 2024.URL https://arxiv. org/abs/2402.03300, 2(3):5, 2024. 1
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[32]
Outcome-based exploration for llm reasoning.arXiv preprint arXiv:2509.06941, 2025
Yuda Song, Julia Kempe, and Remi Munos. Outcome-based exploration for llm reasoning.arXiv preprint arXiv:2509.06941, 2025. 1, 2
-
[33]
Optimizing language models for inference time objectives using reinforcement learning, 2025
Yunhao Tang, Kunhao Zheng, Gabriel Synnaeve, and Rémi Munos. Optimizing language models for inference time objectives using reinforcement learning, 2025. URLhttps://arxiv.org/abs/2503. 19595. 2, 5, D.1
work page 2025
-
[34]
Measuring Multimodal Mathematical Reasoning with MATH-Vision Dataset
Ke Wang, Junting Pan, Weikang Shi, Zimu Lu, Mingjie Zhan, and Hongsheng Li. Measuring multimodal mathematical reasoning with math-vision dataset, 2024. URLhttps://arxiv.org/abs/2402.14804. D.1
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[35]
Shenzhi Wang, Le Yu, Chang Gao, Chujie Zheng, Shixuan Liu, Rui Lu, Kai Dang, Xionghui Chen, Jianxin Yang, Zhenru Zhang, et al. Beyond the 80/20 rule: High-entropy minority tokens drive effective reinforcement learning for llm reasoning.arXiv preprint arXiv:2506.01939, 2025. 2
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[36]
LogicVista: Multimodal LLM Logical Reasoning Benchmark in Visual Contexts
Yijia Xiao, Edward Sun, Tianyu Liu, and Wei Wang. Logicvista: Multimodal llm logical reasoning benchmark in visual contexts, 2024. URLhttps://arxiv.org/abs/2407.04973. D.1
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[37]
Weiye Xu, Jiahao Wang, Weiyun Wang, Zhe Chen, Wengang Zhou, Aijun Yang, Lewei Lu, Houqiang Li, Xiaohua Wang, Xizhou Zhu, Wenhai Wang, Jifeng Dai, and Jinguo Zhu. Visulogic: A benchmark for evaluating visual reasoning in multi-modal large language models, 2025. URLhttps://arxiv.org/ abs/2504.15279. D.1
-
[38]
Learning to Reason under Off-Policy Guidance
Jianhao Yan, Yafu Li, Zican Hu, Zhi Wang, Ganqu Cui, Xiaoye Qu, Yu Cheng, and Yue Zhang. Learning to reason under off-policy guidance, 2025. URLhttps://arxiv.org/abs/2504.14945. D.1
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[39]
DAPO: An Open-Source LLM Reinforcement Learning System at Scale
Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Tiantian Fan, Gaohong Liu, Lingjun Liu, Xin Liu, et al. Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476, 2025. 1
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[40]
Gsm8k-v: Can vision language models solve grade school math word problems in visual contexts, 2025
Fan Yuan, Yuchen Yan, Yifan Jiang, Haoran Zhao, Tao Feng, Jinyan Chen, Yanwei Lou, Wenqi Zhang, Yongliang Shen, Weiming Lu, Jun Xiao, and Yueting Zhuang. Gsm8k-v: Can vision language models solve grade school math word problems in visual contexts, 2025. URLhttps://arxiv.org/abs/2509. 25160. D.1
work page 2025
-
[41]
Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?
Yang Yue, Zhiqi Chen, Rui Lu, Andrew Zhao, Zhaokai Wang, Shiji Song, and Gao Huang. Does rein- forcement learning really incentivize reasoning capacity in llms beyond the base model?arXiv preprint arXiv:2504.13837, 2025. 1, 2
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[42]
A Survey of Reinforcement Learning for Large Reasoning Models
Kaiyan Zhang, Yuxin Zuo, Bingxiang He, Youbang Sun, Runze Liu, Che Jiang, Yuchen Fan, Kai Tian, Guoli Jia, Pengfei Li, et al. A survey of reinforcement learning for large reasoning models.arXiv preprint arXiv:2509.08827, 2025. 2
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[43]
MathVerse: Does Your Multi-modal LLM Truly See the Diagrams in Visual Math Problems?
Renrui Zhang, Dongzhi Jiang, Yichi Zhang, Haokun Lin, Ziyu Guo, Pengshuo Qiu, Aojun Zhou, Pan Lu, Kai-Wei Chang, Peng Gao, and Hongsheng Li. Mathverse: Does your multi-modal llm truly see the diagrams in visual math problems?, 2024. URLhttps://arxiv.org/abs/2403.14624. D.1
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[44]
Group Sequence Policy Optimization
Chujie Zheng, Shixuan Liu, Mingze Li, Xiong-Hui Chen, Bowen Yu, Chang Gao, Kai Dang, Yuqiong Liu, Rui Men, An Yang, Jingren Zhou, and Junyang Lin. Group sequence policy optimization, 2025. URL https://arxiv.org/abs/2507.18071. 2, 2, 3, 5, D.1, 7, 8
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[45]
Flowrl: Matching reward distributions for llm reasoning, 2025
Xuekai Zhu, Daixuan Cheng, Dinghuai Zhang, Hengli Li, Kaiyan Zhang, Che Jiang, Youbang Sun, Ermo Hua, Yuxin Zuo, Xingtai Lv, Qizheng Zhang, Lin Chen, Fanghao Shao, Bo Xue, Yunchong Song, Zhenjie Yang, Ganqu Cui, Ning Ding, Jianfeng Gao, Xiaodong Liu, Bowen Zhou, Hongyuan Mei, and Zhouhan Lin. Flowrl: Matching reward distributions for llm reasoning, 2025. ...
-
[46]
Haomin Zhuang, Yujun Zhou, Taicheng Guo, Yue Huang, Fangxu Liu, Kai Song, and Xiangliang Zhang. Exploring multi-temperature strategies for token-and rollout-level control in rlvr.arXiv preprint arXiv:2510.08892, 2025. 2
-
[47]
Abilmansur Zhumabekov, Daniel May, Tianyu Zhang, Aakash Krishna GS, Omid Ardakanian, and Matthew E Taylor. Ensembling diverse policies improves generalizability of reinforcement learning algorithms in continuous control tasks. InProceedings of the Adaptive and Learning Agents Workshop (ALA 2023). ALA, 2023. 2
work page 2023
-
[48]
Find the shortest tour visiting all 20 cities exactly once
Brian D Ziebart, Andrew L Maas, J Andrew Bagnell, Anind K Dey, et al. Maximum entropy inverse reinforcement learning. InAaai, volume 8, pages 1433–1438. Chicago, IL, USA, 2008. 3.1 15 Beyond Mode Collapse: Distribution Matching for Diverse Reasoning A. Method A.1. Algorithm Algorithm 1 summarizes the complete DMPO training procedure. Algorithm 1Distributi...
work page 2008
-
[49]
See Table 9 for complete settings and sensitivity analysis. DiversityEvaluation.Ourbenchmarkisdesignedtocapturethedistinctionbetweenfeasibilityandoptimization quality. In this context, mode collapse manifests as a high Success Rate (SR) but a relatively low Quality Ratio (QR): the model consistently finds feasible solutions but repeatedly concentrates on ...
-
[50]
**Identify the vertices and edges**: The graph has 17 vertices labeled from 0 to 16
-
[51]
**Check the degree of each vertex**:
-
[52]
**Start with a feasible solution**:
-
[53]
**Optimize the coloring**: Gradually reduce ... Let's start by assigning colors step-by-step: - Vertex 16 has the highest degree (5 edges), so it can be assigned color 1. - Next, assign color 2 to the vertex that is most connected but not yet colored. ... The final coloring is as follows: Answer: [1, 2, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15] v...
-
[54]
Find a valid coloring. 2. Minimize the number of colors used. ### Step 2: Strategy ... ### Step 3: Estimate Chromatic Number ... ### Step 4: Construct a Valid 7-Coloring ... ### Step 5: Verify Validity : Check if any adjacent vertices have same color. .... ### Step 6: Fix Conflict: Vertex 16 was assigned
-
[55]
But 16 is connected to 7=4 → conflict. ..... Answer: [1, 1, 3, 2, 7, 3, 3, 4, 4, 4, 2, 6, 1, 5, 3, 2, 6] invalid coloring: node 0 and node 1 have the same color 1 Vertex Cover Question Introduction: You are a great expert solving vertex-cover question. You should propose a solution to this question.', 'Task description': 'Given an undirected graph G = (V,...
-
[56]
**Identify the degree of each vertex**:
-
[57]
**Select the vertex with the highest degree**
-
[58]
**Remove the edges connected to the selected vertex from the graph**
-
[59]
Let's apply this to the given graph:
**Repeat the process until all edges are covered**. Let's apply this to the given graph:... Here's the list of vertices sorted by their degrees: 17 (6), 5 (5), 9 (5), 8 (5), 6 (5), 0 (4), 10 (4), ... Now, let's select the vertices one by one based on their degrees: 1. Select vertex 17
-
[60]
Remove edges connected to vertex 17. ... Answer: [17, 5, 9, 8, 6, 0, 10, 11, 2, 4, 19, 7, 3, 18, 1, 13, 16, 15, 14] valid vertex cover with 19 vertices Qwen2.5-VL-72B-Instruct To solve the Minimum Vertex Cover problem for the given graph, we need to identify a subset of vertices such that every edge in the graph has at least one endpoint in this subset. W...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.