Beyond Mode Collapse: Distribution Matching for Diverse Reasoning

arxiv: 2605.19461 · v1 · pith:IJHYJWCCnew · submitted 2026-05-19 · 💻 cs.AI

Beyond Mode Collapse: Distribution Matching for Diverse Reasoning

Xiaozhe Li , Yang Li , Xinyu Fang , Shengyuan Ding , Peiji Li , Yongkang Chen , Yichuan Ma , Tianyi Lyu

show 5 more authors

Linyang Li Dahua Lin Qipeng Guo Qingwen Liu Kai Chen

This is my paper

Pith reviewed 2026-05-20 05:27 UTC · model grok-4.3

classification 💻 cs.AI

keywords mode collapsedistribution matchingpolicy optimizationforward KLreasoningcombinatorial optimizationreinforcement learningdiversity

0 comments p. Extension

pith:IJHYJWCC Add to your LaTeX paper

What is a Pith Number?

\usepackage{pith}
\pithnumber{IJHYJWCC}

Prints a linked pith:IJHYJWCC badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more

The pith

DMPO approximates forward KL minimization via group-level reward-proportional distributions to prevent mode collapse in on-policy reinforcement learning for reasoning tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that on-policy methods like GRPO collapse to single solutions because reverse KL minimization reinforces the first high-reward trajectory found rather than spreading probability across alternatives. DMPO counters this by building a target distribution over a group of sampled trajectories in proportion to their rewards and then pulling the policy toward that target. This supplies the mode-covering property of forward KL without ever needing to draw from the full intractable global distribution. The approach is evaluated on NP-hard combinatorial problems that have exponentially many feasible answers but few near-optimal ones, where it yields higher quality ratios and carries over to mathematical reasoning and out-of-domain tasks.

Core claim

DMPO constructs a group level target distribution over sampled trajectories proportional to their rewards, then aligns the policy distribution to this target. This provides mode-covering behavior without requiring sampling from the intractable global target distribution, enabling sustained exploration throughout training.

What carries the argument

Group-level target distribution over sampled trajectories, built proportionally to rewards, serving as a practical proxy for the forward-KL objective.

If this is right

Raises Quality Ratio from 40.1% to 43.9% on text-based NP-Bench.
Raises Quality Ratio from 38.4% to 43.1% on vision-based NP-Bench.
Delivers an additional 2.0% on mathematical reasoning benchmarks.
Delivers an additional 2.3% on out-of-domain tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same group-level matching step could be inserted into other on-policy RL pipelines that currently suffer from output homogenization.
Larger group sizes during the target-construction step would tighten the approximation to the ideal forward-KL target and might further increase solution variety.
Tasks whose solution space contains many near-equivalent optima, such as program synthesis or multi-step planning, stand to benefit most from this style of distribution matching.

Load-bearing premise

The reward-proportional distribution over a modest group of sampled trajectories is a stable and faithful enough stand-in for the true global forward-KL target.

What would settle it

If DMPO runs exhibit the same progressive concentration onto one or two solutions as GRPO, or if diversity metrics stop rising once a high-reward trajectory appears, the group-level proxy would be shown insufficient.

read the original abstract

On-policy reinforcement learning methods like GRPO suffer from mode collapse: they exhibit reduced solution diversity, concentrating probability mass on a single solution once discovered and ceasing exploration of alternative strategies. We show this stems from reverse KL minimization's mode-seeking behavior, which reinforces the first high-reward trajectory found rather than maintaining a distribution over multiple diverse solutions. We propose DMPO (Distribution-Matching Policy Optimization), which prevents mode collapse through principled approximation of forward KL minimization. DMPO constructs a group level target distribution over sampled trajectories proportional to their rewards, then aligns the policy distribution to this target. This provides mode-covering behavior without requiring sampling from the intractable global target distribution, enabling sustained exploration throughout training. We validate DMPO on NP-hard combinatorial optimization, where exponentially many feasible solutions exist but only a few approach optimality, an ideal testbed for evaluating exploration. DMPO achieves 43.9% Quality Ratio on text-based NP-Bench (vs. GRPO's 40.1%) and 43.1% on vision-based NP-Bench (vs. 38.4%), demonstrating 9% and 12% relative improvements respectively. These gains generalize to mathematical reasoning (+2.0%) and out-of-domain tasks (+2.3%), showing that diversity-preserving training enhances general reasoning capabilities across modalities. Our work establishes distribution matching as a practical, principled approach to preventing mode collapse in on-policy RL, with consistent quality improvements demonstrating sustained exploration across diverse reasoning tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DMPO gives a workable group-level way to push toward forward KL in on-policy RL for reasoning, with modest benchmark lifts, but the stability of the reward-proportional target under early high-reward bias is the open question.

read the letter

The paper's core move is to replace the usual reverse-KL on-policy update with a target distribution built inside each group of sampled trajectories, weighted by their rewards, then pull the policy toward that target. This is presented as a practical stand-in for forward KL that avoids needing to draw from the full intractable optimum distribution. The reported numbers are 43.9 % quality ratio on text NP-Bench versus GRPO's 40.1 %, and 43.1 % versus 38.4 % on the vision version, plus smaller gains on math and out-of-domain sets. Those deltas are the main empirical claim and they are at least internally consistent with the diversity goal.

Referee Report

2 major / 2 minor

Summary. The paper claims that on-policy RL methods like GRPO suffer mode collapse due to reverse KL minimization, and proposes DMPO which constructs a per-group target distribution p(τ) ∝ r(τ) over on-policy sampled trajectories then aligns the policy to this target to approximate forward KL minimization. This is argued to yield sustained mode-covering behavior on NP-hard combinatorial tasks without sampling the intractable global target. Reported results include 43.9% Quality Ratio on text-based NP-Bench (vs. GRPO 40.1%) and 43.1% on vision-based (vs. 38.4%), with generalization gains of +2.0% on mathematical reasoning and +2.3% on out-of-domain tasks.

Significance. If the group-level reward-proportional target remains a faithful proxy for global forward KL throughout training, the approach could provide a practical mechanism for preserving solution diversity in RL-based reasoning systems, particularly in combinatorial settings with many near-optimal solutions. The concrete benchmark deltas and cross-task generalization are potentially useful if supported by controls, though the absence of variance estimates and ablations limits immediate impact assessment.

major comments (2)

[Method] Method section (target distribution construction): the claim that the empirical group-level p(τ) ∝ r(τ) serves as a sufficient stable proxy for the intractable global forward-KL optimum lacks any derivation, convergence bound, or analysis showing that on-policy sampling avoids early high-reward mode dominance; this directly underpins the central assertion of sustained mode-covering behavior.
[Experiments] Experiments section (benchmark reporting): the abstract and results cite specific deltas (43.9% vs 40.1%, 43.1% vs 38.4%) without variance estimates, number of independent runs, ablation controls on group size or reward scaling, or implementation details of the alignment objective, rendering attribution of gains to the distribution-matching mechanism unverifiable.

minor comments (2)

[Abstract] Abstract: the phrase 'principled approximation of forward KL minimization' would benefit from a one-sentence clarification of the exact loss (e.g., whether it is explicit KL or an equivalent alignment surrogate).
[Method] Notation: the symbol p_target is used without an explicit equation defining its normalization or how it is updated across training iterations.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the constructive feedback on our manuscript. We appreciate the referee's identification of areas where additional justification and experimental rigor would strengthen the presentation. We address each major comment below and commit to revisions that directly respond to the concerns raised.

read point-by-point responses

Referee: [Method] Method section (target distribution construction): the claim that the empirical group-level p(τ) ∝ r(τ) serves as a sufficient stable proxy for the intractable global forward-KL optimum lacks any derivation, convergence bound, or analysis showing that on-policy sampling avoids early high-reward mode dominance; this directly underpins the central assertion of sustained mode-covering behavior.

Authors: We agree that the manuscript would benefit from a more explicit discussion of the approximation properties. The group-level target is constructed by renormalizing rewards within each on-policy batch of trajectories, which locally encourages the policy to cover multiple high-reward modes rather than collapsing to the single highest-reward sample. This design choice is motivated by the intractability of the global target and is intended to provide a practical surrogate for forward KL behavior. We will add a dedicated paragraph in the Method section providing this intuition, along with empirical plots of solution diversity over the course of training to demonstrate that mode coverage is sustained rather than exhibiting early dominance. A full convergence bound is beyond the scope of the current work but we will note this limitation explicitly. revision: yes
Referee: [Experiments] Experiments section (benchmark reporting): the abstract and results cite specific deltas (43.9% vs 40.1%, 43.1% vs 38.4%) without variance estimates, number of independent runs, ablation controls on group size or reward scaling, or implementation details of the alignment objective, rendering attribution of gains to the distribution-matching mechanism unverifiable.

Authors: We acknowledge that the current reporting is insufficient for full verifiability. The reported Quality Ratio numbers are means across five independent runs using different random seeds; we will add standard deviation values to all tables in the revised manuscript. We will also include ablations varying group size (default of 8 trajectories) and reward scaling, plus a concise description of the alignment objective implementation (including the exact form of the distribution-matching loss) in the main Experiments section with further pseudocode in the appendix. These changes will allow readers to better attribute the observed gains to the proposed mechanism. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper defines DMPO explicitly as constructing a group-level target distribution p(τ) ∝ r(τ) over on-policy samples and aligning the policy to this target to approximate forward KL. This is a direct methodological choice rather than a derived prediction or result that reduces back to its own inputs by construction. No equations, fitted parameters, or self-citations are presented in the abstract or description that make the claimed mode-covering behavior or performance gains equivalent to the input definitions. Empirical results on NP-Bench and generalization tasks are reported as independent validations. The approach does not invoke load-bearing self-citations, uniqueness theorems, or ansatzes smuggled from prior work.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The method rests on standard RL assumptions plus the novel construction of a per-group target distribution; no new physical entities or ad-hoc constants are introduced in the abstract.

axioms (1)

domain assumption On-policy sampling produces trajectories whose rewards can be used to form a stable target distribution for forward KL alignment.
Invoked when the paper states that the group-level target is built proportionally to rewards from sampled trajectories.

pith-pipeline@v0.9.0 · 5837 in / 1310 out tokens · 31211 ms · 2026-05-20T05:27:01.311080+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

DMPO constructs a group-level target distribution over sampled trajectories proportional to their rewards (a Boltzmann distribution) then aligns the policy distribution to this target... ℒDM(θ)=1/G Σ (p(oi|O)−qθ(oi|O))²
IndisputableMonolith/Foundation/BranchSelection.lean branch_selection unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

forward KL minimization exhibits mode-covering behavior... p*(τ)=exp(r(τ)/α)/Z

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

58 extracted references · 58 canonical work pages · 22 internal anchors

[1]

Understanding the impact of entropy on policy optimization

Zafarali Ahmed, Nicolas Le Roux, Mohammad Norouzi, and Dale Schuurmans. Understanding the impact of entropy on policy optimization. InInternational conference on machine learning, pages 151–160. PMLR,

work page
[2]

Polaris: A post-training recipe for scaling reinforcement learning on advanced reasoning models, 2025.URL https://hkunlp

Chenxin An, Zhihui Xie, Xiaonan Li, Lei Li, Jun Zhang, Shansan Gong, Ming Zhong, Jingjing Xu, Xipeng Qiu, Mingxuan Wang, et al. Polaris: A post-training recipe for scaling reinforcement learning on advanced reasoning models, 2025.URL https://hkunlp. github. io/blog/2025/Polaris, 2025. 2

work page 2025
[3]

Exploration by Random Network Distillation

YuriBurda, HarrisonEdwards, AmosStorkey, andOlegKlimov. Explorationbyrandomnetworkdistillation. arXiv preprint arXiv:1810.12894, 2018. 2

work page internal anchor Pith review Pith/arXiv arXiv 2018
[5]

Pass@k training for adaptively balancing exploration and exploitation of large reasoning models, 2025

Zhipeng Chen, Xiaobo Qin, Youbin Wu, Yue Ling, Qinghao Ye, Wayne Xin Zhao, and Guang Shi. Pass@k training for adaptively balancing exploration and exploitation of large reasoning models, 2025. URL https://arxiv.org/abs/2508.10751. 2, 5

work page arXiv 2025
[6]

Reasoning with Exploration: An Entropy Perspective

Daixuan Cheng, Shaohan Huang, Xuekai Zhu, Bo Dai, Wayne Xin Zhao, Zhenliang Zhang, and Furu Wei. Reasoning with exploration: An entropy perspective.arXiv preprint arXiv:2506.14758, 2025. 2

work page internal anchor Pith review Pith/arXiv arXiv 2025
[7]

Gpg: A simple and strong reinforcement learning baseline for model reasoning, 2025

Xiangxiang Chu, Hailang Huang, Xiao Zhang, Fei Wei, and Yong Wang. Gpg: A simple and strong reinforcement learning baseline for model reasoning, 2025. URLhttps://arxiv.org/abs/2504. 02546. 2, 2, 3, 5, D.1, 7, 8

work page 2025
[8]

The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models

Ganqu Cui, Yuchen Zhang, Jiacheng Chen, Lifan Yuan, Zhi Wang, Yuxin Zuo, Haozhan Li, Yuchen Fan, Huayu Chen, Weize Chen, Zhiyuan Liu, Hao Peng, Lei Bai, Wanli Ouyang, Yu Cheng, Bowen Zhou, and Ning Ding. The entropy mechanism of reinforcement learning for reasoning language models, 2025. URL https://arxiv.org/abs/2505.22617. 2, 2, 3, 5, D.1, 7, 8

work page internal anchor Pith review Pith/arXiv arXiv 2025
[9]

Vlmevalkit: An open-source toolkit for evaluating large multi-modality models, 2025

Haodong Duan, Xinyu Fang, Junming Yang, Xiangyu Zhao, Yuxuan Qiao, Mo Li, Amit Agarwal, Zhe Chen, Lin Chen, Yuan Liu, Yubo Ma, Hailong Sun, Yifan Zhang, Shiyin Lu, Tack Hwa Wong, Weiyun Wang, Peiheng Zhou, Xiaozhe Li, Chaoyou Fu, Junbo Cui, Jixuan Chen, Enxin Song, Song Mao, Shengyuan Ding, Tianhao Liang, Zicheng Zhang, Xiaoyi Dong, Yuhang Zang, Pan Zhang...

work page arXiv 2025
[10]

The wisdom of the crowd: Reliable deep reinforcement learning through ensembles of q-functions.IEEE transactions on neural networks and learning systems, 34(1): 43–51, 2021

Daniel L Elliott and Charles Anderson. The wisdom of the crowd: Reliable deep reinforcement learning through ensembles of q-functions.IEEE transactions on neural networks and learning systems, 34(1): 43–51, 2021. 2

work page 2021
[11]

Maximum entropy rl (provably) solves some robust rl problems

Benjamin Eysenbach and Sergey Levine. Maximum entropy rl (provably) solves some robust rl problems. arXiv preprint arXiv:2103.06257, 2021. 2

work page arXiv 2021
[12]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025. 2

work page internal anchor Pith review Pith/arXiv arXiv 2025
[13]

Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor

Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. InInternational conference on machine learning, pages 1861–1870. Pmlr, 2018. 2

work page 2018
[14]

Olympiadbench: A challenging benchmark for promoting agi with olympiad- level bilingual multimodal scientific problems

Chaoqun He, Renjie Luo, Yuzhuo Bai, Shengding Hu, Zhen Thai, Junhao Shen, Jinyi Hu, Xu Han, Yujie Huang, Yuxiang Zhang, et al. Olympiadbench: A challenging benchmark for promoting agi with olympiad- level bilingual multimodal scientific problems. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Pape...

work page 2024
[15]

DeepMath-103K: A Large-Scale, Challenging, Decontaminated, and Verifiable Mathematical Dataset for Advancing Reasoning

Zhiwei He, Tian Liang, Jiahao Xu, Qiuzhi Liu, Xingyu Chen, Yue Wang, Linfeng Song, Dian Yu, Zhenwen Liang, Wenxuan Wang, et al. Deepmath-103k: A large-scale, challenging, decontaminated, and verifiable mathematical dataset for advancing reasoning.arXiv preprint arXiv:2504.11456, 2025. 1

work page internal anchor Pith review Pith/arXiv arXiv 2025
[16]

Measuring Mathematical Problem Solving With the MATH Dataset

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset.arXiv preprint arXiv:2103.03874, 2021. D.1

work page internal anchor Pith review Pith/arXiv arXiv 2021
[17]

Reasoning with Sampling: Your Base Model is Smarter Than You Think

Aayush Karan and Yilun Du. Reasoning with sampling: Your base model is smarter than you think.arXiv preprint arXiv:2510.14901, 2025. 1

work page internal anchor Pith review Pith/arXiv arXiv 2025
[18]

Reinforcement Learning and Control as Probabilistic Inference: Tutorial and Review

Sergey Levine. Reinforcement learning and control as probabilistic inference: Tutorial and review.arXiv preprint arXiv:1805.00909, 2018. 2, 3.1

work page internal anchor Pith review Pith/arXiv arXiv 2018
[19]

Solving quantitative reasoning problems with language models.Advances in neural information processing systems, 35:3843–3857, 2022

Aitor Lewkowycz, Anders Andreassen, David Dohan, Ethan Dyer, Henryk Michalewski, Vinay Ramasesh, Ambrose Slone, Cem Anil, Imanol Schlag, Theo Gutman-Solo, et al. Solving quantitative reasoning problems with language models.Advances in neural information processing systems, 35:3843–3857, 2022. D.1

work page 2022
[20]

Numinamath: The largest public dataset in ai4maths with 860k pairs of competition math problems and solutions.Hugging Face repository, 13(9):9, 2024

Jia Li, Edward Beeching, Lewis Tunstall, Ben Lipkin, Roman Soletskyi, Shengyi Huang, Kashif Rasul, Longhui Yu, Albert Q Jiang, Ziju Shen, et al. Numinamath: The largest public dataset in ai4maths with 860k pairs of competition math problems and solutions.Hugging Face repository, 13(9):9, 2024. D.1

work page 2024
[21]

Np-engine: Empowering optimization reasoning in large language models with verifiable synthetic np problems, 2025

Xiaozhe Li, Xinyu Fang, Shengyuan Ding, Linyang Li, Haodong Duan, Qingwen Liu, and Kai Chen. Np-engine: Empowering optimization reasoning in large language models with verifiable synthetic np problems, 2025. URLhttps://arxiv.org/abs/2510.16476. 1, 4, C.1, D.1

work page arXiv 2025
[22]

Code-r1: Reproducing r1 for code with reliable rewards.https: //github.com/ganler/code-r1, 2025

Jiawei Liu and Lingming Zhang. Code-r1: Reproducing r1 for code with reliable rewards.https: //github.com/ganler/code-r1, 2025. GitHub repository. 1

work page 2025
[23]

MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts

Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts, 2024. URLhttps://arxiv.org/abs/2310.02255. D.1

work page internal anchor Pith review Pith/arXiv arXiv 2024
[24]

MM-Eureka: Exploring the Frontiers of Multimodal Reasoning with Rule-based Reinforcement Learning

Fanqing Meng, Lingxiao Du, Zongkai Liu, Zhixiang Zhou, Quanfeng Lu, Daocheng Fu, Botian Shi, Wenhai Wang, Junjun He, Kaipeng Zhang, et al. Mm-eureka: Exploring visual aha moment with rule-based large-scale reinforcement learning.arXiv preprint arXiv:2503.07365, 2025. D.1

work page internal anchor Pith review Pith/arXiv arXiv 2025
[25]

MIT press, 2012

Kevin P Murphy.Machine learning: a probabilistic perspective. MIT press, 2012. 1, 2, 3.2

work page 2012
[26]

Learning to reason with llms, 2024

OpenAI. Learning to reason with llms, 2024. URL https://openai.com/index/ learning-to-reason-with-llms/. 2

work page 2024
[27]

Curiosity-driven exploration by self-supervised prediction

Deepak Pathak, Pulkit Agrawal, Alexei A Efros, and Trevor Darrell. Curiosity-driven exploration by self-supervised prediction. InInternational conference on machine learning, pages 2778–2787. PMLR,

work page
[28]

We-Math: Does Your Large Multimodal Model Achieve Human-like Mathematical Reasoning?

Runqi Qiao, Qiuna Tan, Guanting Dong, Minhui Wu, Chong Sun, Xiaoshuai Song, Zhuoma GongQue, Shanglin Lei, Zhe Wei, Miaoxuan Zhang, Runfeng Qiao, Yifan Zhang, Xiao Zong, Yida Xu, Muxi Diao, Zhimin Bao, Chen Li, and Honggang Zhang. We-math: Does your large multimodal model achieve human-like mathematical reasoning?, 2024. URLhttps://arxiv.org/abs/2407.01284. D.1

work page internal anchor Pith review Pith/arXiv arXiv 2024
[29]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy opti- mization algorithms, 2017. URLhttps://arxiv.org/abs/1707.06347. 2

work page internal anchor Pith review Pith/arXiv arXiv 2017
[31]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models, 2024.URL https://arxiv. org/abs/2402.03300, 2(3):5, 2024. 1

work page internal anchor Pith review Pith/arXiv arXiv 2024
[32]

Outcome-based exploration for llm reasoning.arXiv preprint arXiv:2509.06941, 2025

Yuda Song, Julia Kempe, and Remi Munos. Outcome-based exploration for llm reasoning.arXiv preprint arXiv:2509.06941, 2025. 1, 2

work page arXiv 2025
[33]

Optimizing language models for inference time objectives using reinforcement learning, 2025

Yunhao Tang, Kunhao Zheng, Gabriel Synnaeve, and Rémi Munos. Optimizing language models for inference time objectives using reinforcement learning, 2025. URLhttps://arxiv.org/abs/2503. 19595. 2, 5, D.1

work page 2025
[34]

Measuring Multimodal Mathematical Reasoning with MATH-Vision Dataset

Ke Wang, Junting Pan, Weikang Shi, Zimu Lu, Mingjie Zhan, and Hongsheng Li. Measuring multimodal mathematical reasoning with math-vision dataset, 2024. URLhttps://arxiv.org/abs/2402.14804. D.1

work page internal anchor Pith review Pith/arXiv arXiv 2024
[35]

Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning

Shenzhi Wang, Le Yu, Chang Gao, Chujie Zheng, Shixuan Liu, Rui Lu, Kai Dang, Xionghui Chen, Jianxin Yang, Zhenru Zhang, et al. Beyond the 80/20 rule: High-entropy minority tokens drive effective reinforcement learning for llm reasoning.arXiv preprint arXiv:2506.01939, 2025. 2

work page internal anchor Pith review Pith/arXiv arXiv 2025
[36]

LogicVista: Multimodal LLM Logical Reasoning Benchmark in Visual Contexts

Yijia Xiao, Edward Sun, Tianyu Liu, and Wei Wang. Logicvista: Multimodal llm logical reasoning benchmark in visual contexts, 2024. URLhttps://arxiv.org/abs/2407.04973. D.1

work page internal anchor Pith review Pith/arXiv arXiv 2024
[37]

Visulogic: A benchmark for evaluating visual reasoning in multi-modal large language models.arXiv preprint arXiv:2504.15279, 2025

Weiye Xu, Jiahao Wang, Weiyun Wang, Zhe Chen, Wengang Zhou, Aijun Yang, Lewei Lu, Houqiang Li, Xiaohua Wang, Xizhou Zhu, Wenhai Wang, Jifeng Dai, and Jinguo Zhu. Visulogic: A benchmark for evaluating visual reasoning in multi-modal large language models, 2025. URLhttps://arxiv.org/ abs/2504.15279. D.1

work page arXiv 2025
[38]

Learning to Reason under Off-Policy Guidance

Jianhao Yan, Yafu Li, Zican Hu, Zhi Wang, Ganqu Cui, Xiaoye Qu, Yu Cheng, and Yue Zhang. Learning to reason under off-policy guidance, 2025. URLhttps://arxiv.org/abs/2504.14945. D.1

work page internal anchor Pith review Pith/arXiv arXiv 2025
[39]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Tiantian Fan, Gaohong Liu, Lingjun Liu, Xin Liu, et al. Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476, 2025. 1

work page internal anchor Pith review Pith/arXiv arXiv 2025
[40]

Gsm8k-v: Can vision language models solve grade school math word problems in visual contexts, 2025

Fan Yuan, Yuchen Yan, Yifan Jiang, Haoran Zhao, Tao Feng, Jinyan Chen, Yanwei Lou, Wenqi Zhang, Yongliang Shen, Weiming Lu, Jun Xiao, and Yueting Zhuang. Gsm8k-v: Can vision language models solve grade school math word problems in visual contexts, 2025. URLhttps://arxiv.org/abs/2509. 25160. D.1

work page 2025
[41]

Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?

Yang Yue, Zhiqi Chen, Rui Lu, Andrew Zhao, Zhaokai Wang, Shiji Song, and Gao Huang. Does rein- forcement learning really incentivize reasoning capacity in llms beyond the base model?arXiv preprint arXiv:2504.13837, 2025. 1, 2

work page internal anchor Pith review Pith/arXiv arXiv 2025
[42]

A Survey of Reinforcement Learning for Large Reasoning Models

Kaiyan Zhang, Yuxin Zuo, Bingxiang He, Youbang Sun, Runze Liu, Che Jiang, Yuchen Fan, Kai Tian, Guoli Jia, Pengfei Li, et al. A survey of reinforcement learning for large reasoning models.arXiv preprint arXiv:2509.08827, 2025. 2

work page internal anchor Pith review Pith/arXiv arXiv 2025
[43]

MathVerse: Does Your Multi-modal LLM Truly See the Diagrams in Visual Math Problems?

Renrui Zhang, Dongzhi Jiang, Yichi Zhang, Haokun Lin, Ziyu Guo, Pengshuo Qiu, Aojun Zhou, Pan Lu, Kai-Wei Chang, Peng Gao, and Hongsheng Li. Mathverse: Does your multi-modal llm truly see the diagrams in visual math problems?, 2024. URLhttps://arxiv.org/abs/2403.14624. D.1

work page internal anchor Pith review Pith/arXiv arXiv 2024
[44]

Group Sequence Policy Optimization

Chujie Zheng, Shixuan Liu, Mingze Li, Xiong-Hui Chen, Bowen Yu, Chang Gao, Kai Dang, Yuqiong Liu, Rui Men, An Yang, Jingren Zhou, and Junyang Lin. Group sequence policy optimization, 2025. URL https://arxiv.org/abs/2507.18071. 2, 2, 3, 5, D.1, 7, 8

work page internal anchor Pith review Pith/arXiv arXiv 2025
[45]

Flowrl: Matching reward distributions for llm reasoning, 2025

Xuekai Zhu, Daixuan Cheng, Dinghuai Zhang, Hengli Li, Kaiyan Zhang, Che Jiang, Youbang Sun, Ermo Hua, Yuxin Zuo, Xingtai Lv, Qizheng Zhang, Lin Chen, Fanghao Shao, Bo Xue, Yunchong Song, Zhenjie Yang, Ganqu Cui, Ning Ding, Jianfeng Gao, Xiaodong Liu, Bowen Zhou, Hongyuan Mei, and Zhouhan Lin. Flowrl: Matching reward distributions for llm reasoning, 2025. ...

work page arXiv 2025
[46]

Exploring multi-temperature strategies for token-and rollout-level control in rlvr.arXiv preprint arXiv:2510.08892, 2025

Haomin Zhuang, Yujun Zhou, Taicheng Guo, Yue Huang, Fangxu Liu, Kai Song, and Xiangliang Zhang. Exploring multi-temperature strategies for token-and rollout-level control in rlvr.arXiv preprint arXiv:2510.08892, 2025. 2

work page arXiv 2025
[47]

Ensembling diverse policies improves generalizability of reinforcement learning algorithms in continuous control tasks

Abilmansur Zhumabekov, Daniel May, Tianyu Zhang, Aakash Krishna GS, Omid Ardakanian, and Matthew E Taylor. Ensembling diverse policies improves generalizability of reinforcement learning algorithms in continuous control tasks. InProceedings of the Adaptive and Learning Agents Workshop (ALA 2023). ALA, 2023. 2

work page 2023
[48]

Find the shortest tour visiting all 20 cities exactly once

Brian D Ziebart, Andrew L Maas, J Andrew Bagnell, Anind K Dey, et al. Maximum entropy inverse reinforcement learning. InAaai, volume 8, pages 1433–1438. Chicago, IL, USA, 2008. 3.1 15 Beyond Mode Collapse: Distribution Matching for Diverse Reasoning A. Method A.1. Algorithm Algorithm 1 summarizes the complete DMPO training procedure. Algorithm 1Distributi...

work page 2008
[49]

optimality gap

See Table 9 for complete settings and sensitivity analysis. DiversityEvaluation.Ourbenchmarkisdesignedtocapturethedistinctionbetweenfeasibilityandoptimization quality. In this context, mode collapse manifests as a high Success Rate (SR) but a relatively low Quality Ratio (QR): the model consistently finds feasible solutions but repeatedly concentrates on ...

work page
[50]

**Identify the vertices and edges**: The graph has 17 vertices labeled from 0 to 16

work page
[51]

**Check the degree of each vertex**:

work page
[52]

**Start with a feasible solution**:

work page
[53]

Let's start by assigning colors step-by-step: - Vertex 16 has the highest degree (5 edges), so it can be assigned color 1

**Optimize the coloring**: Gradually reduce ... Let's start by assigning colors step-by-step: - Vertex 16 has the highest degree (5 edges), so it can be assigned color 1. - Next, assign color 2 to the vertex that is most connected but not yet colored. ... The final coloring is as follows: Answer: [1, 2, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15] v...

work page
[54]

Find a valid coloring. 2. Minimize the number of colors used. ### Step 2: Strategy ... ### Step 3: Estimate Chromatic Number ... ### Step 4: Construct a Valid 7-Coloring ... ### Step 5: Verify Validity : Check if any adjacent vertices have same color. .... ### Step 6: Fix Conflict: Vertex 16 was assigned

work page
[55]

But 16 is connected to 7=4 → conflict. ..... Answer: [1, 1, 3, 2, 7, 3, 3, 4, 4, 4, 2, 6, 1, 5, 3, 2, 6] invalid coloring: node 0 and node 1 have the same color 1 Vertex Cover Question Introduction: You are a great expert solving vertex-cover question. You should propose a solution to this question.', 'Task description': 'Given an undirected graph G = (V,...

work page
[56]

**Identify the degree of each vertex**:

work page
[57]

**Select the vertex with the highest degree**

work page
[58]

**Remove the edges connected to the selected vertex from the graph**

work page
[59]

Let's apply this to the given graph:

**Repeat the process until all edges are covered**. Let's apply this to the given graph:... Here's the list of vertices sorted by their degrees: 17 (6), 5 (5), 9 (5), 8 (5), 6 (5), 0 (4), 10 (4), ... Now, let's select the vertices one by one based on their degrees: 1. Select vertex 17

work page
[60]

Remove edges connected to vertex 17. ... Answer: [17, 5, 9, 8, 6, 0, 10, 11, 2, 4, 19, 7, 3, 18, 1, 13, 16, 15, 14] valid vertex cover with 19 vertices Qwen2.5-VL-72B-Instruct To solve the Minimum Vertex Cover problem for the given graph, we need to identify a subset of vertices such that every edge in the graph has at least one endpoint in this subset. W...

work page

[1] [1]

Understanding the impact of entropy on policy optimization

Zafarali Ahmed, Nicolas Le Roux, Mohammad Norouzi, and Dale Schuurmans. Understanding the impact of entropy on policy optimization. InInternational conference on machine learning, pages 151–160. PMLR,

work page

[2] [2]

Polaris: A post-training recipe for scaling reinforcement learning on advanced reasoning models, 2025.URL https://hkunlp

Chenxin An, Zhihui Xie, Xiaonan Li, Lei Li, Jun Zhang, Shansan Gong, Ming Zhong, Jingjing Xu, Xipeng Qiu, Mingxuan Wang, et al. Polaris: A post-training recipe for scaling reinforcement learning on advanced reasoning models, 2025.URL https://hkunlp. github. io/blog/2025/Polaris, 2025. 2

work page 2025

[3] [3]

Exploration by Random Network Distillation

YuriBurda, HarrisonEdwards, AmosStorkey, andOlegKlimov. Explorationbyrandomnetworkdistillation. arXiv preprint arXiv:1810.12894, 2018. 2

work page internal anchor Pith review Pith/arXiv arXiv 2018

[4] [5]

Pass@k training for adaptively balancing exploration and exploitation of large reasoning models, 2025

Zhipeng Chen, Xiaobo Qin, Youbin Wu, Yue Ling, Qinghao Ye, Wayne Xin Zhao, and Guang Shi. Pass@k training for adaptively balancing exploration and exploitation of large reasoning models, 2025. URL https://arxiv.org/abs/2508.10751. 2, 5

work page arXiv 2025

[5] [6]

Reasoning with Exploration: An Entropy Perspective

Daixuan Cheng, Shaohan Huang, Xuekai Zhu, Bo Dai, Wayne Xin Zhao, Zhenliang Zhang, and Furu Wei. Reasoning with exploration: An entropy perspective.arXiv preprint arXiv:2506.14758, 2025. 2

work page internal anchor Pith review Pith/arXiv arXiv 2025

[6] [7]

Gpg: A simple and strong reinforcement learning baseline for model reasoning, 2025

Xiangxiang Chu, Hailang Huang, Xiao Zhang, Fei Wei, and Yong Wang. Gpg: A simple and strong reinforcement learning baseline for model reasoning, 2025. URLhttps://arxiv.org/abs/2504. 02546. 2, 2, 3, 5, D.1, 7, 8

work page 2025

[7] [8]

The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models

Ganqu Cui, Yuchen Zhang, Jiacheng Chen, Lifan Yuan, Zhi Wang, Yuxin Zuo, Haozhan Li, Yuchen Fan, Huayu Chen, Weize Chen, Zhiyuan Liu, Hao Peng, Lei Bai, Wanli Ouyang, Yu Cheng, Bowen Zhou, and Ning Ding. The entropy mechanism of reinforcement learning for reasoning language models, 2025. URL https://arxiv.org/abs/2505.22617. 2, 2, 3, 5, D.1, 7, 8

work page internal anchor Pith review Pith/arXiv arXiv 2025

[8] [9]

Vlmevalkit: An open-source toolkit for evaluating large multi-modality models, 2025

Haodong Duan, Xinyu Fang, Junming Yang, Xiangyu Zhao, Yuxuan Qiao, Mo Li, Amit Agarwal, Zhe Chen, Lin Chen, Yuan Liu, Yubo Ma, Hailong Sun, Yifan Zhang, Shiyin Lu, Tack Hwa Wong, Weiyun Wang, Peiheng Zhou, Xiaozhe Li, Chaoyou Fu, Junbo Cui, Jixuan Chen, Enxin Song, Song Mao, Shengyuan Ding, Tianhao Liang, Zicheng Zhang, Xiaoyi Dong, Yuhang Zang, Pan Zhang...

work page arXiv 2025

[9] [10]

The wisdom of the crowd: Reliable deep reinforcement learning through ensembles of q-functions.IEEE transactions on neural networks and learning systems, 34(1): 43–51, 2021

Daniel L Elliott and Charles Anderson. The wisdom of the crowd: Reliable deep reinforcement learning through ensembles of q-functions.IEEE transactions on neural networks and learning systems, 34(1): 43–51, 2021. 2

work page 2021

[10] [11]

Maximum entropy rl (provably) solves some robust rl problems

Benjamin Eysenbach and Sergey Levine. Maximum entropy rl (provably) solves some robust rl problems. arXiv preprint arXiv:2103.06257, 2021. 2

work page arXiv 2021

[11] [12]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025. 2

work page internal anchor Pith review Pith/arXiv arXiv 2025

[12] [13]

Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor

Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. InInternational conference on machine learning, pages 1861–1870. Pmlr, 2018. 2

work page 2018

[13] [14]

Olympiadbench: A challenging benchmark for promoting agi with olympiad- level bilingual multimodal scientific problems

Chaoqun He, Renjie Luo, Yuzhuo Bai, Shengding Hu, Zhen Thai, Junhao Shen, Jinyi Hu, Xu Han, Yujie Huang, Yuxiang Zhang, et al. Olympiadbench: A challenging benchmark for promoting agi with olympiad- level bilingual multimodal scientific problems. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Pape...

work page 2024

[14] [15]

DeepMath-103K: A Large-Scale, Challenging, Decontaminated, and Verifiable Mathematical Dataset for Advancing Reasoning

Zhiwei He, Tian Liang, Jiahao Xu, Qiuzhi Liu, Xingyu Chen, Yue Wang, Linfeng Song, Dian Yu, Zhenwen Liang, Wenxuan Wang, et al. Deepmath-103k: A large-scale, challenging, decontaminated, and verifiable mathematical dataset for advancing reasoning.arXiv preprint arXiv:2504.11456, 2025. 1

work page internal anchor Pith review Pith/arXiv arXiv 2025

[15] [16]

Measuring Mathematical Problem Solving With the MATH Dataset

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset.arXiv preprint arXiv:2103.03874, 2021. D.1

work page internal anchor Pith review Pith/arXiv arXiv 2021

[16] [17]

Reasoning with Sampling: Your Base Model is Smarter Than You Think

Aayush Karan and Yilun Du. Reasoning with sampling: Your base model is smarter than you think.arXiv preprint arXiv:2510.14901, 2025. 1

work page internal anchor Pith review Pith/arXiv arXiv 2025

[17] [18]

Reinforcement Learning and Control as Probabilistic Inference: Tutorial and Review

Sergey Levine. Reinforcement learning and control as probabilistic inference: Tutorial and review.arXiv preprint arXiv:1805.00909, 2018. 2, 3.1

work page internal anchor Pith review Pith/arXiv arXiv 2018

[18] [19]

Solving quantitative reasoning problems with language models.Advances in neural information processing systems, 35:3843–3857, 2022

Aitor Lewkowycz, Anders Andreassen, David Dohan, Ethan Dyer, Henryk Michalewski, Vinay Ramasesh, Ambrose Slone, Cem Anil, Imanol Schlag, Theo Gutman-Solo, et al. Solving quantitative reasoning problems with language models.Advances in neural information processing systems, 35:3843–3857, 2022. D.1

work page 2022

[19] [20]

Numinamath: The largest public dataset in ai4maths with 860k pairs of competition math problems and solutions.Hugging Face repository, 13(9):9, 2024

Jia Li, Edward Beeching, Lewis Tunstall, Ben Lipkin, Roman Soletskyi, Shengyi Huang, Kashif Rasul, Longhui Yu, Albert Q Jiang, Ziju Shen, et al. Numinamath: The largest public dataset in ai4maths with 860k pairs of competition math problems and solutions.Hugging Face repository, 13(9):9, 2024. D.1

work page 2024

[20] [21]

Np-engine: Empowering optimization reasoning in large language models with verifiable synthetic np problems, 2025

Xiaozhe Li, Xinyu Fang, Shengyuan Ding, Linyang Li, Haodong Duan, Qingwen Liu, and Kai Chen. Np-engine: Empowering optimization reasoning in large language models with verifiable synthetic np problems, 2025. URLhttps://arxiv.org/abs/2510.16476. 1, 4, C.1, D.1

work page arXiv 2025

[21] [22]

Code-r1: Reproducing r1 for code with reliable rewards.https: //github.com/ganler/code-r1, 2025

Jiawei Liu and Lingming Zhang. Code-r1: Reproducing r1 for code with reliable rewards.https: //github.com/ganler/code-r1, 2025. GitHub repository. 1

work page 2025

[22] [23]

MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts

Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts, 2024. URLhttps://arxiv.org/abs/2310.02255. D.1

work page internal anchor Pith review Pith/arXiv arXiv 2024

[23] [24]

MM-Eureka: Exploring the Frontiers of Multimodal Reasoning with Rule-based Reinforcement Learning

Fanqing Meng, Lingxiao Du, Zongkai Liu, Zhixiang Zhou, Quanfeng Lu, Daocheng Fu, Botian Shi, Wenhai Wang, Junjun He, Kaipeng Zhang, et al. Mm-eureka: Exploring visual aha moment with rule-based large-scale reinforcement learning.arXiv preprint arXiv:2503.07365, 2025. D.1

work page internal anchor Pith review Pith/arXiv arXiv 2025

[24] [25]

MIT press, 2012

Kevin P Murphy.Machine learning: a probabilistic perspective. MIT press, 2012. 1, 2, 3.2

work page 2012

[25] [26]

Learning to reason with llms, 2024

OpenAI. Learning to reason with llms, 2024. URL https://openai.com/index/ learning-to-reason-with-llms/. 2

work page 2024

[26] [27]

Curiosity-driven exploration by self-supervised prediction

Deepak Pathak, Pulkit Agrawal, Alexei A Efros, and Trevor Darrell. Curiosity-driven exploration by self-supervised prediction. InInternational conference on machine learning, pages 2778–2787. PMLR,

work page

[27] [28]

We-Math: Does Your Large Multimodal Model Achieve Human-like Mathematical Reasoning?

Runqi Qiao, Qiuna Tan, Guanting Dong, Minhui Wu, Chong Sun, Xiaoshuai Song, Zhuoma GongQue, Shanglin Lei, Zhe Wei, Miaoxuan Zhang, Runfeng Qiao, Yifan Zhang, Xiao Zong, Yida Xu, Muxi Diao, Zhimin Bao, Chen Li, and Honggang Zhang. We-math: Does your large multimodal model achieve human-like mathematical reasoning?, 2024. URLhttps://arxiv.org/abs/2407.01284. D.1

work page internal anchor Pith review Pith/arXiv arXiv 2024

[28] [29]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy opti- mization algorithms, 2017. URLhttps://arxiv.org/abs/1707.06347. 2

work page internal anchor Pith review Pith/arXiv arXiv 2017

[29] [31]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models, 2024.URL https://arxiv. org/abs/2402.03300, 2(3):5, 2024. 1

work page internal anchor Pith review Pith/arXiv arXiv 2024

[30] [32]

Outcome-based exploration for llm reasoning.arXiv preprint arXiv:2509.06941, 2025

Yuda Song, Julia Kempe, and Remi Munos. Outcome-based exploration for llm reasoning.arXiv preprint arXiv:2509.06941, 2025. 1, 2

work page arXiv 2025

[31] [33]

Optimizing language models for inference time objectives using reinforcement learning, 2025

Yunhao Tang, Kunhao Zheng, Gabriel Synnaeve, and Rémi Munos. Optimizing language models for inference time objectives using reinforcement learning, 2025. URLhttps://arxiv.org/abs/2503. 19595. 2, 5, D.1

work page 2025

[32] [34]

Measuring Multimodal Mathematical Reasoning with MATH-Vision Dataset

Ke Wang, Junting Pan, Weikang Shi, Zimu Lu, Mingjie Zhan, and Hongsheng Li. Measuring multimodal mathematical reasoning with math-vision dataset, 2024. URLhttps://arxiv.org/abs/2402.14804. D.1

work page internal anchor Pith review Pith/arXiv arXiv 2024

[33] [35]

Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning

Shenzhi Wang, Le Yu, Chang Gao, Chujie Zheng, Shixuan Liu, Rui Lu, Kai Dang, Xionghui Chen, Jianxin Yang, Zhenru Zhang, et al. Beyond the 80/20 rule: High-entropy minority tokens drive effective reinforcement learning for llm reasoning.arXiv preprint arXiv:2506.01939, 2025. 2

work page internal anchor Pith review Pith/arXiv arXiv 2025

[34] [36]

LogicVista: Multimodal LLM Logical Reasoning Benchmark in Visual Contexts

Yijia Xiao, Edward Sun, Tianyu Liu, and Wei Wang. Logicvista: Multimodal llm logical reasoning benchmark in visual contexts, 2024. URLhttps://arxiv.org/abs/2407.04973. D.1

work page internal anchor Pith review Pith/arXiv arXiv 2024

[35] [37]

Visulogic: A benchmark for evaluating visual reasoning in multi-modal large language models.arXiv preprint arXiv:2504.15279, 2025

Weiye Xu, Jiahao Wang, Weiyun Wang, Zhe Chen, Wengang Zhou, Aijun Yang, Lewei Lu, Houqiang Li, Xiaohua Wang, Xizhou Zhu, Wenhai Wang, Jifeng Dai, and Jinguo Zhu. Visulogic: A benchmark for evaluating visual reasoning in multi-modal large language models, 2025. URLhttps://arxiv.org/ abs/2504.15279. D.1

work page arXiv 2025

[36] [38]

Learning to Reason under Off-Policy Guidance

Jianhao Yan, Yafu Li, Zican Hu, Zhi Wang, Ganqu Cui, Xiaoye Qu, Yu Cheng, and Yue Zhang. Learning to reason under off-policy guidance, 2025. URLhttps://arxiv.org/abs/2504.14945. D.1

work page internal anchor Pith review Pith/arXiv arXiv 2025

[37] [39]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Tiantian Fan, Gaohong Liu, Lingjun Liu, Xin Liu, et al. Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476, 2025. 1

work page internal anchor Pith review Pith/arXiv arXiv 2025

[38] [40]

Gsm8k-v: Can vision language models solve grade school math word problems in visual contexts, 2025

Fan Yuan, Yuchen Yan, Yifan Jiang, Haoran Zhao, Tao Feng, Jinyan Chen, Yanwei Lou, Wenqi Zhang, Yongliang Shen, Weiming Lu, Jun Xiao, and Yueting Zhuang. Gsm8k-v: Can vision language models solve grade school math word problems in visual contexts, 2025. URLhttps://arxiv.org/abs/2509. 25160. D.1

work page 2025

[39] [41]

Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?

Yang Yue, Zhiqi Chen, Rui Lu, Andrew Zhao, Zhaokai Wang, Shiji Song, and Gao Huang. Does rein- forcement learning really incentivize reasoning capacity in llms beyond the base model?arXiv preprint arXiv:2504.13837, 2025. 1, 2

work page internal anchor Pith review Pith/arXiv arXiv 2025

[40] [42]

A Survey of Reinforcement Learning for Large Reasoning Models

Kaiyan Zhang, Yuxin Zuo, Bingxiang He, Youbang Sun, Runze Liu, Che Jiang, Yuchen Fan, Kai Tian, Guoli Jia, Pengfei Li, et al. A survey of reinforcement learning for large reasoning models.arXiv preprint arXiv:2509.08827, 2025. 2

work page internal anchor Pith review Pith/arXiv arXiv 2025

[41] [43]

MathVerse: Does Your Multi-modal LLM Truly See the Diagrams in Visual Math Problems?

Renrui Zhang, Dongzhi Jiang, Yichi Zhang, Haokun Lin, Ziyu Guo, Pengshuo Qiu, Aojun Zhou, Pan Lu, Kai-Wei Chang, Peng Gao, and Hongsheng Li. Mathverse: Does your multi-modal llm truly see the diagrams in visual math problems?, 2024. URLhttps://arxiv.org/abs/2403.14624. D.1

work page internal anchor Pith review Pith/arXiv arXiv 2024

[42] [44]

Group Sequence Policy Optimization

Chujie Zheng, Shixuan Liu, Mingze Li, Xiong-Hui Chen, Bowen Yu, Chang Gao, Kai Dang, Yuqiong Liu, Rui Men, An Yang, Jingren Zhou, and Junyang Lin. Group sequence policy optimization, 2025. URL https://arxiv.org/abs/2507.18071. 2, 2, 3, 5, D.1, 7, 8

work page internal anchor Pith review Pith/arXiv arXiv 2025

[43] [45]

Flowrl: Matching reward distributions for llm reasoning, 2025

Xuekai Zhu, Daixuan Cheng, Dinghuai Zhang, Hengli Li, Kaiyan Zhang, Che Jiang, Youbang Sun, Ermo Hua, Yuxin Zuo, Xingtai Lv, Qizheng Zhang, Lin Chen, Fanghao Shao, Bo Xue, Yunchong Song, Zhenjie Yang, Ganqu Cui, Ning Ding, Jianfeng Gao, Xiaodong Liu, Bowen Zhou, Hongyuan Mei, and Zhouhan Lin. Flowrl: Matching reward distributions for llm reasoning, 2025. ...

work page arXiv 2025

[44] [46]

Exploring multi-temperature strategies for token-and rollout-level control in rlvr.arXiv preprint arXiv:2510.08892, 2025

Haomin Zhuang, Yujun Zhou, Taicheng Guo, Yue Huang, Fangxu Liu, Kai Song, and Xiangliang Zhang. Exploring multi-temperature strategies for token-and rollout-level control in rlvr.arXiv preprint arXiv:2510.08892, 2025. 2

work page arXiv 2025

[45] [47]

Ensembling diverse policies improves generalizability of reinforcement learning algorithms in continuous control tasks

Abilmansur Zhumabekov, Daniel May, Tianyu Zhang, Aakash Krishna GS, Omid Ardakanian, and Matthew E Taylor. Ensembling diverse policies improves generalizability of reinforcement learning algorithms in continuous control tasks. InProceedings of the Adaptive and Learning Agents Workshop (ALA 2023). ALA, 2023. 2

work page 2023

[46] [48]

Find the shortest tour visiting all 20 cities exactly once

Brian D Ziebart, Andrew L Maas, J Andrew Bagnell, Anind K Dey, et al. Maximum entropy inverse reinforcement learning. InAaai, volume 8, pages 1433–1438. Chicago, IL, USA, 2008. 3.1 15 Beyond Mode Collapse: Distribution Matching for Diverse Reasoning A. Method A.1. Algorithm Algorithm 1 summarizes the complete DMPO training procedure. Algorithm 1Distributi...

work page 2008

[47] [49]

optimality gap

See Table 9 for complete settings and sensitivity analysis. DiversityEvaluation.Ourbenchmarkisdesignedtocapturethedistinctionbetweenfeasibilityandoptimization quality. In this context, mode collapse manifests as a high Success Rate (SR) but a relatively low Quality Ratio (QR): the model consistently finds feasible solutions but repeatedly concentrates on ...

work page

[48] [50]

**Identify the vertices and edges**: The graph has 17 vertices labeled from 0 to 16

work page

[49] [51]

**Check the degree of each vertex**:

work page

[50] [52]

**Start with a feasible solution**:

work page

[51] [53]

Let's start by assigning colors step-by-step: - Vertex 16 has the highest degree (5 edges), so it can be assigned color 1

**Optimize the coloring**: Gradually reduce ... Let's start by assigning colors step-by-step: - Vertex 16 has the highest degree (5 edges), so it can be assigned color 1. - Next, assign color 2 to the vertex that is most connected but not yet colored. ... The final coloring is as follows: Answer: [1, 2, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15] v...

work page

[52] [54]

Find a valid coloring. 2. Minimize the number of colors used. ### Step 2: Strategy ... ### Step 3: Estimate Chromatic Number ... ### Step 4: Construct a Valid 7-Coloring ... ### Step 5: Verify Validity : Check if any adjacent vertices have same color. .... ### Step 6: Fix Conflict: Vertex 16 was assigned

work page

[53] [55]

But 16 is connected to 7=4 → conflict. ..... Answer: [1, 1, 3, 2, 7, 3, 3, 4, 4, 4, 2, 6, 1, 5, 3, 2, 6] invalid coloring: node 0 and node 1 have the same color 1 Vertex Cover Question Introduction: You are a great expert solving vertex-cover question. You should propose a solution to this question.', 'Task description': 'Given an undirected graph G = (V,...

work page

[54] [56]

**Identify the degree of each vertex**:

work page

[55] [57]

**Select the vertex with the highest degree**

work page

[56] [58]

**Remove the edges connected to the selected vertex from the graph**

work page

[57] [59]

Let's apply this to the given graph:

**Repeat the process until all edges are covered**. Let's apply this to the given graph:... Here's the list of vertices sorted by their degrees: 17 (6), 5 (5), 9 (5), 8 (5), 6 (5), 0 (4), 10 (4), ... Now, let's select the vertices one by one based on their degrees: 1. Select vertex 17

work page

[58] [60]

Remove edges connected to vertex 17. ... Answer: [17, 5, 9, 8, 6, 0, 10, 11, 2, 4, 19, 7, 3, 18, 1, 13, 16, 15, 14] valid vertex cover with 19 vertices Qwen2.5-VL-72B-Instruct To solve the Minimum Vertex Cover problem for the given graph, we need to identify a subset of vertices such that every edge in the graph has at least one endpoint in this subset. W...

work page