arxiv: 2604.18530 · v1 · submitted 2026-04-20 · 💻 cs.AI

Recognition: unknown

OGER: A Robust Offline-Guided Exploration Reward for Hybrid Reinforcement Learning

Xinyu Ma , Mingzhou Xu , Xuebo Liu , Chang Jin , Qiang Wang , Derek F. Wong , Min Zhang

Authors on Pith no claims yet

Pith reviewed 2026-05-10 04:26 UTC · model grok-4.3

classification 💻 cs.AI

keywords hybrid reinforcement learningLLM reasoningoffline guidanceexploration rewardentropy modulationmathematical reasoninggeneralization

0 comments

The pith

The OGER framework improves LLM reasoning by integrating offline guidance with an entropy-based exploration reward in hybrid reinforcement learning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes OGER as a way to help large language models explore new reasoning paths during reinforcement learning training. It unifies offline data from teachers with online learning by designing a special reward that uses the model's entropy to promote exploration. This addresses the problem of models not venturing beyond their starting knowledge. Experiments show gains in math tasks and good performance on other domains. The analysis validates the role of the entropy component in the reward.

Core claim

By constructing an auxiliary exploration reward from multi-teacher offline trajectories and the model's entropy, OGER incentivizes autonomous exploration in a hybrid offline-online RL setup for LLMs, leading to substantial improvements in mathematical reasoning and robust generalization to out-of-domain tasks.

What carries the argument

The auxiliary exploration reward that leverages both offline trajectories and the model's own entropy for modulation.

If this is right

LLMs trained with OGER achieve higher scores on mathematical reasoning benchmarks compared to standard RLVR approaches.
The method maintains strong performance on general reasoning tasks outside the training distribution.
Multi-teacher collaborative training combined with entropy modulation proves more effective than either alone.
Training dynamics indicate increased exploration without loss of stability.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar entropy-guided rewards could enhance exploration in non-reasoning tasks like dialogue or planning.
Reducing reliance on multiple teachers might be possible by using synthetic offline data.
Testing on larger models could reveal if the gains scale with model size.

Load-bearing premise

The entropy-aware reward modulation combined with multi-teacher offline guidance reliably promotes useful exploration instead of noise or overfitting.

What would settle it

Running OGER on a math reasoning benchmark and finding it performs no better than or worse than competitive baselines without the exploration reward.

Figures

Figures reproduced from arXiv: 2604.18530 by Chang Jin, Derek F. Wong, Mingzhou Xu, Min Zhang, Qiang Wang, Xinyu Ma, Xuebo Liu.

**Figure 1.** Figure 1: The overall architecture of the OGER framework. We first construct a comprehensive, high-quality offline [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗

**Figure 2.** Figure 2: Distribution of sequence lengths for trajec [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Comparative analysis of training dynamics across OGER, its variant OGER w/o Refinement, and baselines [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

**Figure 4.** Figure 4: pass@k performance on AIME 2024 and AIME 2025 using 256 rollouts. Our proposed OGER method consistently outperforms all baselines across various k values, demonstrating a significantly higher convergence rate in solvability coverage. These results demonstrate that our entropyaware reward refinement not only enhances the precision of individual reasoning trajectories but also significantly expands the agg… view at source ↗

**Figure 5.** Figure 5: The evolution of the OGER exploration re [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗

**Figure 6.** Figure 6: The pass@8 performance across different inference temperatures on AIME 2024 and AIME 2025, we illustrate the average score. high-quality reasoning patterns from the offline teacher trajectories. As training progresses into the mid-to-late stages, the model’s intrinsic reasoning proficiency matures, leading to an increase in exploratory signals. The OGER reward subsequently stabilizes at a specific platea… view at source ↗

read the original abstract

Recent advancements in Reinforcement Learning with Verifiable Rewards (RLVR) have significantly improved Large Language Model (LLM) reasoning, yet models often struggle to explore novel trajectories beyond their initial latent space. While offline teacher guidance and entropy-driven strategies have been proposed to address this, they often lack deep integration or are constrained by the model's inherent capacity. In this paper, we propose OGER, a novel framework that unifies offline teacher guidance and online reinforcement learning through a specialized reward modeling lens. OGER employs multi-teacher collaborative training and constructs an auxiliary exploration reward that leverages both offline trajectories and the model's own entropy to incentivize autonomous exploration. Extensive experiments across mathematical and general reasoning benchmarks demonstrate that OGER significantly outperforms competitive baselines, achieving substantial gains in mathematical reasoning while maintaining robust generalization to out-of-domain tasks. We provide a comprehensive analysis of training dynamics and conduct detailed ablation studies to validate the effectiveness of our entropy-aware reward modulation. Our code is available at https://github.com/ecoli-hit/OGER.git.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

OGER combines multi-teacher offline trajectories with an entropy-modulated auxiliary reward for hybrid LLM RL, delivering incremental gains on math benchmarks but leaving the actual exploration benefit under-isolated.

read the letter

OGER's core move is to fold offline teacher trajectories and the policy's own entropy into a single auxiliary reward term on top of the verifiable outcome reward. This is meant to push the model toward novel reasoning paths during online RL while staying anchored by the teachers. The abstract and experiments show gains over standard RLVR baselines on math reasoning tasks, with some carryover to out-of-domain checks, plus ablations that separate the offline and entropy pieces and a code release for inspection. That combination and the practical focus are the parts worth noting; prior work has used offline guidance or entropy bonuses separately, so the joint formulation inside one reward model is the concrete extension here. The training dynamics analysis and multi-teacher fusion are also straightforward additions that make the method easier to replicate. The soft spot is the exploration claim itself. The entropy term can be satisfied by any increase in output variance, and nothing in the setup explicitly penalizes high-entropy paths that still lead to wrong answers or collapse back to the teacher distribution. If the offline data already covers most of the test distribution, the reported out-of-domain robustness becomes hard to attribute to the new reward rather than to the teachers. The weighting coefficients between offline guidance and entropy look like free parameters that could be tuned on the same benchmarks used for final numbers. This is a common pattern in these papers and does not invalidate the results, but it does mean the gains are more engineering than independent prediction. Readers working on RL for verifiable LLM reasoning will find the method easy to try and the ablations useful for their own setups. The work is not a big theoretical step, but the empirical package is complete enough on the surface to merit referee time rather than a desk reject.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes OGER, a hybrid RL framework for improving LLM reasoning that integrates multi-teacher offline guidance with an auxiliary exploration reward derived from offline trajectories and the policy's own entropy. It claims that this unified offline-online approach yields substantial gains over competitive baselines on mathematical reasoning benchmarks while preserving robust generalization to out-of-domain tasks, supported by training-dynamics analysis and ablation studies.

Significance. If the empirical results hold under scrutiny, the work could advance hybrid RL methods for verifiable-reward LLM training by offering a concrete mechanism to balance offline teacher signals with autonomous exploration. The public release of code at https://github.com/ecoli-hit/OGER.git is a clear strength that supports reproducibility and further investigation.

major comments (2)

[§3.2] §3.2 (Auxiliary Reward Construction): The entropy-aware modulation is described at a high level as combining offline trajectories with model entropy, yet the manuscript does not provide an explicit term that penalizes low-utility high-entropy paths or an independent diversity metric decoupled from the reward itself. Without this, the reported gains could arise from increased stochasticity rather than directed exploration, directly undermining the central claim that the reward 'incentivizes autonomous exploration.'
[§4] §4 (Experiments and Ablations): The out-of-domain generalization and multi-teacher fusion results rest on weighting coefficients and normalization choices whose fitting procedure is not detailed. If these hyperparameters were tuned on the same benchmark suites used for final evaluation, the performance margins may reflect post-hoc optimization rather than an independently predictive reward design, as flagged by the potential circularity in the auxiliary reward.

minor comments (2)

[Abstract / §1] The abstract and §1 could more explicitly separate the individual contributions of multi-teacher collaborative training versus the entropy modulation to clarify which component drives the reported gains.
[Figures in §4] Figure captions and training-dynamics plots would benefit from explicit axis labels and error-bar reporting to allow readers to assess the stability of the claimed improvements.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. We address each major comment point by point below, indicating revisions where the manuscript requires clarification or additional detail.

read point-by-point responses

Referee: [§3.2] §3.2 (Auxiliary Reward Construction): The entropy-aware modulation is described at a high level as combining offline trajectories with model entropy, yet the manuscript does not provide an explicit term that penalizes low-utility high-entropy paths or an independent diversity metric decoupled from the reward itself. Without this, the reported gains could arise from increased stochasticity rather than directed exploration, directly undermining the central claim that the reward 'incentivizes autonomous exploration.'

Authors: We appreciate the referee's observation on the need for greater precision in the reward formulation. The auxiliary reward in §3.2 is constructed as r_aux = α · r_offline(τ) + β · H(π_θ), where r_offline is computed from multi-teacher offline trajectories to provide a utility signal and H(π_θ) is the policy entropy. The offline component is intended to anchor exploration to high-utility regions, while entropy modulates the degree of deviation. However, we acknowledge that an explicit penalty term for low-utility high-entropy trajectories is not isolated as a separate diversity metric. The ablation studies in §4.3 demonstrate that removing the offline guidance component degrades performance more than entropy scaling alone, suggesting the gains are not solely from increased stochasticity. In the revised manuscript we will add an explicit mathematical expression for the modulation, include a decoupled diversity metric (e.g., trajectory variance across offline seeds), and report an additional ablation isolating entropy from the offline signal. revision: partial
Referee: [§4] §4 (Experiments and Ablations): The out-of-domain generalization and multi-teacher fusion results rest on weighting coefficients and normalization choices whose fitting procedure is not detailed. If these hyperparameters were tuned on the same benchmark suites used for final evaluation, the performance margins may reflect post-hoc optimization rather than an independently predictive reward design, as flagged by the potential circularity in the auxiliary reward.

Authors: We agree that the hyperparameter selection procedure must be fully transparent to rule out circularity. The weighting coefficients (α, β) and normalization constants were selected via grid search on a held-out validation split drawn from the training distribution but disjoint from the reported test and out-of-domain benchmarks. The same fixed values were then used across all experiments. In the revised version we will expand §4.1 and add an appendix table listing the exact search ranges, the validation performance surface, and the final chosen values, together with a statement confirming the validation set was never used for final reporting. revision: yes

Circularity Check

0 steps flagged

No significant circularity; claims rest on empirical results rather than self-referential derivation

full rationale

The provided abstract and description outline OGER as a proposed framework that constructs an auxiliary exploration reward from offline trajectories and model entropy, then reports empirical gains on benchmarks. No equations, derivation steps, or self-citation chains are supplied that reduce any claimed prediction or result to its own inputs by construction. The performance improvements are presented as outcomes of experiments and ablations, not as logically forced by the reward definition itself. This matches the default case of a standard method proposal whose central claims remain independent of the inputs.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The framework assumes standard RL convergence properties and that entropy serves as a reliable proxy for useful exploration; no new physical entities are introduced, but several weighting hyperparameters for the reward components are required.

free parameters (2)

entropy modulation coefficient
Scaling factor that balances the entropy term against the offline teacher reward; must be chosen or tuned.
multi-teacher fusion weights
Parameters controlling how outputs from multiple offline teachers are combined into the guidance signal.

axioms (2)

domain assumption Entropy of the policy distribution is a monotonic indicator of exploration value
Invoked when constructing the auxiliary reward; standard in information-theoretic RL but not proven for LLM reasoning trajectories.
domain assumption Offline trajectories provide unbiased guidance for online exploration
Core premise of the hybrid setup; appears in the unification claim.

pith-pipeline@v0.9.0 · 5489 in / 1361 out tokens · 42879 ms · 2026-05-10T04:26:14.828634+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

52 extracted references · 43 canonical work pages · 18 internal anchors

[1]

DeepSeek-V3 Technical Report

Liu, Aixin and Feng, Bei and Xue, Bing and Wang, Bingxuan and Wu, Bochao and Lu, Chengda and Zhao, Chenggang and Deng, Chengqi and Zhang, Chenyu and Ruan, Chong and others. Deepseek-v3 technical report. arXiv preprint arXiv:2412.19437. 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[3]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Guo, Daya and Yang, Dejian and Zhang, Haowei and Song, Junxiao and Zhang, Ruoyu and Xu, Runxin and Zhu, Qihao and Ma, Shirong and Wang, Peiyi and Bi, Xiao and others. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948. 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Shao, Zhihong and Wang, Peiyi and Zhu, Qihao and Xu, Runxin and Song, Junxiao and Bi, Xiao and Zhang, Haowei and Zhang, Mingchuan and Li, YK and Wu, Yang and others. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300. 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[5]

Attention as a Compass: Efficient Exploration for Process-Supervised RL in Reasoning Models

Liu, Runze and Wang, Jiakang and Shi, Yuling and Xie, Zhihui and An, Chenxin and Zhang, Kaiyan and Zhao, Jian and Gu, Xiaodong and Lin, Lei and Hu, Wenping and others. Attention as a Compass: Efficient Exploration for Process-Supervised RL in Reasoning Models. arXiv preprint arXiv:2509.26628. 2025

work page arXiv 2025
[7]

Confidence as a Reward: Transforming LLMs into Reward Models

Du, He and Li, Bowen and Xie, Chengxing and Gao, Chang and Chen, Kai and Tao, Dacheng. Confidence as a Reward: Transforming LLMs into Reward Models. arXiv preprint arXiv:2510.13501. 2025

work page arXiv 2025
[8]

arXiv preprint arXiv:2510.06062 , year=

Wang, Jiakang and Liu, Runze and Lin, Lei and Hu, Wenping and Li, Xiu and Zhang, Fuzheng and Zhou, Guorui and Gai, Kun. ASPO: Asymmetric Importance Sampling Policy Optimization. arXiv preprint arXiv:2510.06062. 2025

work page arXiv 2025
[9]

Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning

Wang, Shenzhi and Yu, Le and Gao, Chang and Zheng, Chujie and Liu, Shixuan and Lu, Rui and Dang, Kai and Chen, Xionghui and Yang, Jianxin and Zhang, Zhenru and others. Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning. arXiv preprint arXiv:2506.01939. 2025

work page internal anchor Pith review arXiv 2025
[10]

Unlocking Exploration in RLVR: Uncertainty-aware Advantage Shaping for Deeper Reasoning

Xie, Can and Pan, Ruotong and Wu, Xiangyu and Zhang, Yunfei and Fu, Jiayi and Gao, Tingting and Zhou, Guorui. Unlocking Exploration in RLVR: Uncertainty-aware Advantage Shaping for Deeper Reasoning. arXiv preprint arXiv:2510.10649. 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[11]

The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models

Cui, Ganqu and Zhang, Yuchen and Chen, Jiacheng and Yuan, Lifan and Wang, Zhi and Zuo, Yuxin and Li, Haozhan and Fan, Yuchen and Chen, Huayu and Chen, Weize and others. The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models. arXiv preprint arXiv:2505.22617. 2025

work page internal anchor Pith review arXiv 2025
[12]

Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?

Yue, Yang and Chen, Zhiqi and Lu, Rui and Zhao, Andrew and Wang, Zhaokai and Song, Shiji and Huang, Gao. Does reinforcement learning really incentivize reasoning capacity in llms beyond the base model?. arXiv preprint arXiv:2504.13837. 2025

work page internal anchor Pith review arXiv 2025
[13]

Bridging Offline and Online Reinforcement Learning for LLMs

Lanchantin, Jack and Chen, Angelica and Lan, Janice and Li, Xian and Saha, Swarnadeep and Wang, Tianlu and Xu, Jing and Yu, Ping and Yuan, Weizhe and Weston, Jason E and others. Bridging Offline and Online Reinforcement Learning for LLMs. arXiv preprint arXiv:2506.21495. 2025

work page arXiv 2025
[14]

Improving data efficiency for llm reinforcement fine-tuning through difficulty-targeted online data selection and rollout replay, 2026

Sun, Yifan and Shen, Jingyan and Wang, Yibin and Chen, Tianyu and Wang, Zhendong and Zhou, Mingyuan and Zhang, Huan. Improving Data Efficiency for LLM Reinforcement Fine-tuning Through Difficulty-targeted Online Data Selection and Rollout Replay. arXiv preprint arXiv:2506.05316. 2025

work page arXiv 2025
[15]

arXiv preprint arXiv:2504.14945 , year =

Yan, Jianhao and Li, Yafu and Hu, Zican and Wang, Zhi and Cui, Ganqu and Qu, Xiaoye and Cheng, Yu and Zhang, Yue. Learning to Reason under Off-Policy Guidance. arXiv preprint arXiv:2504.14945. 2025

work page arXiv 2025
[16]

On-policy rl meets off-policy experts: Harmonizing supervised fine- tuning and reinforcement learning via dynamic weighting, 2026

Zhang, Wenhao and Xie, Yuexiang and Sun, Yuchang and Chen, Yanxi and Wang, Guoyin and Li, Yaliang and Ding, Bolin and Zhou, Jingren. On-Policy RL Meets Off-Policy Experts: Harmonizing Supervised Fine-Tuning and Reinforcement Learning via Dynamic Weighting. arXiv preprint arXiv:2508.11408. 2025

work page arXiv 2025
[17]

Exgrpo: Learning to reason from experience

Zhan, Runzhe and Li, Yafu and Wang, Zhi and Qu, Xiaoye and Liu, Dongrui and Shao, Jing and Wong, Derek F. and Cheng, Yu. ExGRPO: Learning to Reason from Experience. arXiv preprint arXiv:2510.02245. 2025

work page arXiv 2025
[18]

Group Sequence Policy Optimization

Zheng, Chujie and Liu, Shixuan and Li, Mingze and Chen, Xiong-Hui and Yu, Bowen and Gao, Chang and Dang, Kai and Liu, Yuqiong and Men, Rui and Yang, An and others. Group sequence policy optimization. arXiv preprint arXiv:2507.18071. 2025

work page internal anchor Pith review arXiv 2025
[19]

Jiaxuan Gao, Shusheng Xu, Wenjie Ye, Weilin Liu, Chuyi He, Wei Fu, Zhiyu Mei, Guangju Wang, and Yi Wu

Feng, Yunzhen and Jain, Parag and Hartshorn, Anthony and Duan, Yaqi and Kempe, Julia. Don't Waste Mistakes: Leveraging Negative RL-Groups via Confidence Reweighting. arXiv preprint arXiv:2510.08696. 2025

work page arXiv 2025
[20]

Proximal Policy Optimization Algorithms

Schulman, John and Wolski, Filip and Dhariwal, Prafulla and Radford, Alec and Klimov, Oleg. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347. 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[21]

Process Reinforcement through Implicit Rewards

Cui, Ganqu and Yuan, Lifan and Wang, Zefan and Wang, Hanbin and Zhang, Yuchen and Chen, Jiacheng and Li, Wendi and He, Bingxiang and Fan, Yuchen and Yu, Tianyu and others. Process reinforcement through implicit rewards. arXiv preprint arXiv:2502.01456. 2025

work page internal anchor Pith review arXiv 2025
[22]

DAPO : An Open-Source LLM Reinforcement Learning System at Scale

Yu, Qiying and Zhang, Zheng and Zhu, Ruofei and Yuan, Yufeng and Zuo, Xiaochen and YuYue and Dai, Weinan and Fan, Tiantian and Liu, Gaohong and Liu, Juncai and others. DAPO : An Open-Source LLM Reinforcement Learning System at Scale. The Thirty-ninth Annual Conference on Neural Information Processing Systems. 2025

2025
[23]

Open-Reasoner-Zero: An Open Source Approach to Scaling Up Reinforcement Learning on the Base Model

Hu, Jingcheng and Zhang, Yinmin and Han, Qi and Jiang, Daxin and Zhang, Xiangyu and Shum, Heung-Yeung. Open-reasoner-zero: An open source approach to scaling up reinforcement learning on the base model. arXiv preprint arXiv:2503.24290. 2025

work page internal anchor Pith review arXiv 2025
[24]

Kakade, Cengiz Pehlevan, Samy Jelassi, and Eran Malach

Zhao, Rosie and Meterez, Alexandru and Kakade, Sham and Pehlevan, Cengiz and Jelassi, Samy and Malach, Eran. Echo Chamber: RL Post-training Amplifies Behaviors Learned in Pretraining. arXiv preprint arXiv:2504.07912. 2025

work page arXiv 2025
[25]

arXiv preprint arXiv:2510.02230

Nguyen, Phuc Minh and La, Chinh D. and Nguyen, Duy M. H. and Chawla, Nitesh V. and Nguyen, Binh T. and Doan, Khoa D. The Reasoning Boundary Paradox: How Reinforcement Learning Constrains Language Models. arXiv preprint arXiv:2510.02230. 2025

work page arXiv 2025
[26]

Qwen3 Technical Report

Team, Qwen. Qwen3 Technical Report. arXiv preprint arXiv:2505.09388. 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[27]

GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models

Team, GLM and Zeng, Aohan and Lv, Xin and Zheng, Qinkai and Hou, Zhenyu and Chen, Bin and Xie, Chengxing and Wang, Cunxiang and Yin, Da and Zeng, Hao and others. GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models. arXiv preprint arXiv:2508.06471. 2025

work page internal anchor Pith review arXiv 2025
[28]

NuminaMath

LI, Jia and Beeching, Edward and Tunstall, Lewis and Lipkin, Ben and Soletskyi, Roman and Huang, Shengyi Costa and Rasul, Kashif and Yu, Longhui and Jiang, Albert and Shen, Ziju and others. NuminaMath. Hugging Face repository. 2024

2024
[29]

Exploration vs exploitation: Rethinking RLVR through clipping, entropy, and spurious reward

Chen, Peter and Li, Xiaopeng and Li, Ziniu and Yin, Wotao and Chen, Xi and Lin, Tianyi. Exploration vs Exploitation: Rethinking RLVR through Clipping, Entropy, and Spurious Reward. arXiv preprint arXiv:2512.16912. 2025

work page arXiv 2025
[30]

Rethinking entropy regularization in large reasoning models.arXiv preprint arXiv:2509.25133, 2025

Jiang, Yuxian and Li, Yafu and Chen, Guanxu and Liu, Dongrui and Cheng, Yu and Shao, Jing. Rethinking Entropy Regularization in Large Reasoning Models. arXiv preprint arXiv:2509.25133. 2025

work page arXiv 2025
[31]

Skalse, Joar and Howe, Nikolaus H. R. and Krasheninnikov, Dmitrii and Krueger, David. Defining and Characterizing Reward Gaming. Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022. 2022

2022
[32]

LaSeR: Reinforcement Learning with Last-Token Self-Rewarding

Yang, Wenkai and Liu, Weijie and Xie, Ruobing and Guo, Yiju and Wu, Lulu and Yang, Saiyong and Lin, Yankai. LaSeR: Reinforcement Learning with Last-Token Self-Rewarding. arXiv preprint arXiv:2510.14943. 2025

work page arXiv 2025
[33]

Supervised reinforcement learning: From expert trajectories to step-wise reasoning.CoRR, abs/2510.25992, 2025

Deng, Yihe and Hsu, I-Hung and Yan, Jun and Wang, Zifeng and Han, Rujun and Zhang, Gufeng and Chen, Yanfei and Wang, Wei and Pfister, Tomas and Lee, Chen-Yu. Supervised Reinforcement Learning: From Expert Trajectories to Step-wise Reasoning. arXiv preprint arXiv:2510.25992. 2025

work page arXiv 2025
[34]

Reasoning with exploration: An entropy perspective.arXiv preprint arXiv:2506.14758, 2025

Cheng, Daixuan and Huang, Shaohan and Zhu, Xuekai and Dai, Bo and Zhao, Wayne Xin and Zhang, Zhenliang and Wei, Furu. Reasoning with Exploration: An Entropy Perspective. arXiv preprint arXiv:2506.14758. 2025

work page arXiv 2025
[35]

CE-GPPO: Coordinating Entropy via Gradient-Preserving Clipping Policy Optimization in Reinforcement Learning

Su, Zhenpeng and Pan, Leiyu and Lv, Minxuan and Li, Yuntao and Hu, Wenping and Zhang, Fuzheng and Gai, Kun and Zhou, Guorui. CE-GPPO: Controlling Entropy via Gradient-Preserving Clipping Policy Optimization in Reinforcement Learning. arXiv preprint arXiv:2509.20712. 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[36]

arXiv preprint arXiv:2510.08141 , year=

Wang, Chen and Li, Zhaochun and Bai, Jionghao and Zhang, Yuzhi and Cui, Shisheng and Zhao, Zhou and Wang, Yue. Arbitrary Entropy Policy Optimization Breaks The Exploration Bottleneck of Reinforcement Learning. arXiv preprint arXiv:2510.08141. 2025

work page arXiv 2025
[37]

C-Pack: Packed Resources For General Chinese Embeddings

Xiao, Shitao and Liu, Zheng and Zhang, Peitian and Muennighoff, Niklas. C-Pack: Packaged Resources To Advance General Chinese Embedding. arXiv preprint arXiv:2309.07597. 2023

work page internal anchor Pith review arXiv 2023
[38]

Improving RL Exploration for LLM Reasoning through Retrospective Replay

Dou, Shihan and Wu, Muling and Xu, Jingwen and Zheng, Rui and Gui, Tao and Zhang, Qi and Huang, Xuanjing. Improving RL Exploration for LLM Reasoning through Retrospective Replay. arXiv preprint arXiv:2504.14363. 2025

work page arXiv 2025
[39]

Lu Ma, Hao Liang, Meiyi Qiang, Lexiang Tang, Xiaochen Ma, Zhen Hao Wong, Junbo Niu, Chengyu Shen, Runming He, Yanhao Li, Bin Cui, and Wentao Zhang

Lv, Xingtai and Zuo, Yuxin and Sun, Youbang and Liu, Hongyi and Wei, Yuntian and Chen, Zhekai and He, Lixuan and Zhu, Xuekai and Zhang, Kaiyan and Wang, Bingning and others. Towards a Unified View of Large Language Model Post-Training. arXiv preprint arXiv:2509.04419. 2025

work page arXiv 2025
[40]

Trajectory balance with asynchrony: Decoupling exploration and learning for fast, scalable llm post-training

Bartoldson, Brian and Venkatraman, Siddarth and Diffenderfer, James and Jain, Moksh and Ben-Nun, Tal and Lee, Seanie and Kim, Minsu and Obando-Ceron, Johan and Bengio, Yoshua and Kailkhura, Bhavya. Trajectory Balance with Asynchrony: Decoupling Exploration and Learning for Fast, Scalable LLM Post-Training. arXiv preprint arXiv:2503.18929. 2025

work page arXiv 2025
[41]

Reasoning with sampling: Your base model is smarter than you think.arXiv preprint arXiv:2510.14901,

Karan, Aayush and Du, Yilun. Reasoning with Sampling: Your Base Model is Smarter Than You Think. arXiv preprint arXiv:2510.14901. 2025

work page arXiv 2025
[42]

Revisiting Entropy Regularization: Adaptive Coefficient Unlocks Its Potential for LLM Reinforcement Learning

Zhang, Xiaoyun and Yuan, Xiaojian and Huang, Di and You, Wang and Hu, Chen and Ruan, Jingqing and Chen, Kejiang and Hu, Xing. Rediscovering Entropy Regularization: Adaptive Coefficient Unlocks Its Potential for LLM Reinforcement Learning. arXiv preprint arXiv:2510.10959. 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[43]

Numinamath: The largest public dataset in ai4maths with 860k pairs of competition math problems and solutions

Li, Jia and Beeching, Edward and Tunstall, Lewis and Lipkin, Ben and Soletskyi, Roman and Huang, Shengyi and Rasul, Kashif and Yu, Longhui and Jiang, Albert Q and Shen, Ziju and others. Numinamath: The largest public dataset in ai4maths with 860k pairs of competition math problems and solutions. Hugging Face repository. 2024

2024
[44]

https://proceedings.neurips.cc/paper_files/paper/2022/file/18abbeef8cfe9203fdf9053c9c4fe191-Paper-Conference.pdf

Lewkowycz, Aitor and Andreassen, Anders and Dohan, David and Dyer, Ethan and Michalewski, Henryk and Ramasesh, Vinay and Slone, Ambrose and Anil, Cem and Schlag, Imanol and Gutman-Solo, Theo and Wu, Yuhuai and Neyshabur, Behnam and Gur-Ari, Guy and Misra, Vedant , booktitle =. Solving Quantitative Reasoning Problems with Language Models , url = "https://p...

2022
[45]

Olympiadbench: A challenging benchmark for promoting AGI with olympiad-level bilingual multimodal scientific problems

He, Chaoqun and Luo, Renjie and Bai, Yuzhuo and Hu, Shengding and Thai, Zhen and Shen, Junhao and Hu, Jinyi and Han, Xu and Huang, Yujie and Zhang, Yuxiang and others. O lympiad B ench: A Challenging Benchmark for Promoting AGI with Olympiad-Level Bilingual Multimodal Scientific Problems. Proceedings of the 62nd Annual Meeting of the Association for Compu...

work page doi:10.18653/v1/2024.acl-long.211 2024
[46]

Measuring Mathematical Problem Solving With the MATH Dataset

Hendrycks, Dan and Burns, Collin and Kadavath, Saurav and Arora, Akul and Basart, Steven and Tang, Eric and Song, Dawn and Steinhardt, Jacob. Measuring Mathematical Problem Solving With the MATH Dataset. Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks 1, NeurIPS Datasets and Benchmarks 2021, December 2021, virtual. 2021

2021
[47]

Qwen2.5-Math Technical Report: Toward Mathematical Expert Model via Self-Improvement

Yang, An and Zhang, Beichen and Hui, Binyuan and Gao, Bofei and Yu, Bowen and Li, Chengpeng and Liu, Dayiheng and Tu, Jianhong and Zhou, Jingren and Lin, Junyang and others. Qwen2.5-Math Technical Report: Toward Mathematical Expert Model via Self-Improvement. arXiv preprint arXiv:2409.12122. 2024

work page internal anchor Pith review arXiv 2024
[48]

arXiv preprint arXiv:2507.06892 (2025) 3

Liang, Jing and Tang, Hongyao and Ma, Yi and Liu, Jinyi and Zheng, Yan and Hu, Shuyue and Bai, Lei and Hao, Jianye. Squeeze the soaked sponge: Efficient off-policy reinforcement finetuning for large language model. arXiv preprint arXiv:2507.06892. 2025

work page arXiv 2025
[49]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Clark, Peter and Cowhey, Isaac and Etzioni, Oren and Khot, Tushar and Sabharwal, Ashish and Schoenick, Carissa and Tafjord, Oyvind. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457. 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[50]

Gpqa: A graduate-level google-proof q&a benchmark

Rein, David and Hou, Betty Li and Stickland, Asa Cooper and Petty, Jackson and Pang, Richard Yuanzhe and Dirani, Julien and Michael, Julian and Bowman, Samuel R. Gpqa: A graduate-level google-proof q&a benchmark. First Conference on Language Modeling. 2024

2024
[51]

MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark

Wang, Yubo and Ma, Xueguang and Zhang, Ge and Ni, Yuansheng and Chandra, Abhranil and Guo, Shiguang and Ren, Weiming and Arulraj, Aaran and He, Xuan and Jiang, Ziyan and others. MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark. Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information ...

2024
[52]

2026 , url=

Hexuan Deng and Wenxiang Jiao and Xuebo Liu and Jun Rao and Min Zhang , booktitle=. 2026 , url=

2026
[53]

Kargaran, A

Ke, Xiaopeng and Deng, Hexuan and Liu, Xuebo and Rao, Jun and Song, Zhenxi and Yu, Jun and Zhang, Min. AQ uilt: Weaving Logic and Self-Inspection into Low-Cost, High-Relevance Data Synthesis for Specialist LLM s. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. 2025. doi:10.18653/v1/2025.emnlp-main.293

work page doi:10.18653/v1/2025.emnlp-main.293 2025
[54]

DRP runing: Efficient Large Language Model Pruning through Distributionally Robust Optimization

Deng, Hexuan and Jiao, Wenxiang and Liu, Xuebo and Li, Jing and Zhang, Min and Tu, Zhaopeng. DRP runing: Efficient Large Language Model Pruning through Distributionally Robust Optimization. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025. doi:10.18653/v1/2025.acl-long.1414

work page doi:10.18653/v1/2025.acl-long.1414 2025