Recognition: unknown
OGER: A Robust Offline-Guided Exploration Reward for Hybrid Reinforcement Learning
Pith reviewed 2026-05-10 04:26 UTC · model grok-4.3
The pith
The OGER framework improves LLM reasoning by integrating offline guidance with an entropy-based exploration reward in hybrid reinforcement learning.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By constructing an auxiliary exploration reward from multi-teacher offline trajectories and the model's entropy, OGER incentivizes autonomous exploration in a hybrid offline-online RL setup for LLMs, leading to substantial improvements in mathematical reasoning and robust generalization to out-of-domain tasks.
What carries the argument
The auxiliary exploration reward that leverages both offline trajectories and the model's own entropy for modulation.
If this is right
- LLMs trained with OGER achieve higher scores on mathematical reasoning benchmarks compared to standard RLVR approaches.
- The method maintains strong performance on general reasoning tasks outside the training distribution.
- Multi-teacher collaborative training combined with entropy modulation proves more effective than either alone.
- Training dynamics indicate increased exploration without loss of stability.
Where Pith is reading between the lines
- Similar entropy-guided rewards could enhance exploration in non-reasoning tasks like dialogue or planning.
- Reducing reliance on multiple teachers might be possible by using synthetic offline data.
- Testing on larger models could reveal if the gains scale with model size.
Load-bearing premise
The entropy-aware reward modulation combined with multi-teacher offline guidance reliably promotes useful exploration instead of noise or overfitting.
What would settle it
Running OGER on a math reasoning benchmark and finding it performs no better than or worse than competitive baselines without the exploration reward.
Figures
read the original abstract
Recent advancements in Reinforcement Learning with Verifiable Rewards (RLVR) have significantly improved Large Language Model (LLM) reasoning, yet models often struggle to explore novel trajectories beyond their initial latent space. While offline teacher guidance and entropy-driven strategies have been proposed to address this, they often lack deep integration or are constrained by the model's inherent capacity. In this paper, we propose OGER, a novel framework that unifies offline teacher guidance and online reinforcement learning through a specialized reward modeling lens. OGER employs multi-teacher collaborative training and constructs an auxiliary exploration reward that leverages both offline trajectories and the model's own entropy to incentivize autonomous exploration. Extensive experiments across mathematical and general reasoning benchmarks demonstrate that OGER significantly outperforms competitive baselines, achieving substantial gains in mathematical reasoning while maintaining robust generalization to out-of-domain tasks. We provide a comprehensive analysis of training dynamics and conduct detailed ablation studies to validate the effectiveness of our entropy-aware reward modulation. Our code is available at https://github.com/ecoli-hit/OGER.git.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes OGER, a hybrid RL framework for improving LLM reasoning that integrates multi-teacher offline guidance with an auxiliary exploration reward derived from offline trajectories and the policy's own entropy. It claims that this unified offline-online approach yields substantial gains over competitive baselines on mathematical reasoning benchmarks while preserving robust generalization to out-of-domain tasks, supported by training-dynamics analysis and ablation studies.
Significance. If the empirical results hold under scrutiny, the work could advance hybrid RL methods for verifiable-reward LLM training by offering a concrete mechanism to balance offline teacher signals with autonomous exploration. The public release of code at https://github.com/ecoli-hit/OGER.git is a clear strength that supports reproducibility and further investigation.
major comments (2)
- [§3.2] §3.2 (Auxiliary Reward Construction): The entropy-aware modulation is described at a high level as combining offline trajectories with model entropy, yet the manuscript does not provide an explicit term that penalizes low-utility high-entropy paths or an independent diversity metric decoupled from the reward itself. Without this, the reported gains could arise from increased stochasticity rather than directed exploration, directly undermining the central claim that the reward 'incentivizes autonomous exploration.'
- [§4] §4 (Experiments and Ablations): The out-of-domain generalization and multi-teacher fusion results rest on weighting coefficients and normalization choices whose fitting procedure is not detailed. If these hyperparameters were tuned on the same benchmark suites used for final evaluation, the performance margins may reflect post-hoc optimization rather than an independently predictive reward design, as flagged by the potential circularity in the auxiliary reward.
minor comments (2)
- [Abstract / §1] The abstract and §1 could more explicitly separate the individual contributions of multi-teacher collaborative training versus the entropy modulation to clarify which component drives the reported gains.
- [Figures in §4] Figure captions and training-dynamics plots would benefit from explicit axis labels and error-bar reporting to allow readers to assess the stability of the claimed improvements.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive feedback. We address each major comment point by point below, indicating revisions where the manuscript requires clarification or additional detail.
read point-by-point responses
-
Referee: [§3.2] §3.2 (Auxiliary Reward Construction): The entropy-aware modulation is described at a high level as combining offline trajectories with model entropy, yet the manuscript does not provide an explicit term that penalizes low-utility high-entropy paths or an independent diversity metric decoupled from the reward itself. Without this, the reported gains could arise from increased stochasticity rather than directed exploration, directly undermining the central claim that the reward 'incentivizes autonomous exploration.'
Authors: We appreciate the referee's observation on the need for greater precision in the reward formulation. The auxiliary reward in §3.2 is constructed as r_aux = α · r_offline(τ) + β · H(π_θ), where r_offline is computed from multi-teacher offline trajectories to provide a utility signal and H(π_θ) is the policy entropy. The offline component is intended to anchor exploration to high-utility regions, while entropy modulates the degree of deviation. However, we acknowledge that an explicit penalty term for low-utility high-entropy trajectories is not isolated as a separate diversity metric. The ablation studies in §4.3 demonstrate that removing the offline guidance component degrades performance more than entropy scaling alone, suggesting the gains are not solely from increased stochasticity. In the revised manuscript we will add an explicit mathematical expression for the modulation, include a decoupled diversity metric (e.g., trajectory variance across offline seeds), and report an additional ablation isolating entropy from the offline signal. revision: partial
-
Referee: [§4] §4 (Experiments and Ablations): The out-of-domain generalization and multi-teacher fusion results rest on weighting coefficients and normalization choices whose fitting procedure is not detailed. If these hyperparameters were tuned on the same benchmark suites used for final evaluation, the performance margins may reflect post-hoc optimization rather than an independently predictive reward design, as flagged by the potential circularity in the auxiliary reward.
Authors: We agree that the hyperparameter selection procedure must be fully transparent to rule out circularity. The weighting coefficients (α, β) and normalization constants were selected via grid search on a held-out validation split drawn from the training distribution but disjoint from the reported test and out-of-domain benchmarks. The same fixed values were then used across all experiments. In the revised version we will expand §4.1 and add an appendix table listing the exact search ranges, the validation performance surface, and the final chosen values, together with a statement confirming the validation set was never used for final reporting. revision: yes
Circularity Check
No significant circularity; claims rest on empirical results rather than self-referential derivation
full rationale
The provided abstract and description outline OGER as a proposed framework that constructs an auxiliary exploration reward from offline trajectories and model entropy, then reports empirical gains on benchmarks. No equations, derivation steps, or self-citation chains are supplied that reduce any claimed prediction or result to its own inputs by construction. The performance improvements are presented as outcomes of experiments and ablations, not as logically forced by the reward definition itself. This matches the default case of a standard method proposal whose central claims remain independent of the inputs.
Axiom & Free-Parameter Ledger
free parameters (2)
- entropy modulation coefficient
- multi-teacher fusion weights
axioms (2)
- domain assumption Entropy of the policy distribution is a monotonic indicator of exploration value
- domain assumption Offline trajectories provide unbiased guidance for online exploration
Reference graph
Works this paper leans on
-
[1]
Liu, Aixin and Feng, Bei and Xue, Bing and Wang, Bingxuan and Wu, Bochao and Lu, Chengda and Zhao, Chenggang and Deng, Chengqi and Zhang, Chenyu and Ruan, Chong and others. Deepseek-v3 technical report. arXiv preprint arXiv:2412.19437. 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[3]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Guo, Daya and Yang, Dejian and Zhang, Haowei and Song, Junxiao and Zhang, Ruoyu and Xu, Runxin and Zhu, Qihao and Ma, Shirong and Wang, Peiyi and Bi, Xiao and others. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948. 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[4]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Shao, Zhihong and Wang, Peiyi and Zhu, Qihao and Xu, Runxin and Song, Junxiao and Bi, Xiao and Zhang, Haowei and Zhang, Mingchuan and Li, YK and Wu, Yang and others. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300. 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[5]
Attention as a Compass: Efficient Exploration for Process-Supervised RL in Reasoning Models
Liu, Runze and Wang, Jiakang and Shi, Yuling and Xie, Zhihui and An, Chenxin and Zhang, Kaiyan and Zhao, Jian and Gu, Xiaodong and Lin, Lei and Hu, Wenping and others. Attention as a Compass: Efficient Exploration for Process-Supervised RL in Reasoning Models. arXiv preprint arXiv:2509.26628. 2025
-
[7]
Confidence as a Reward: Transforming LLMs into Reward Models
Du, He and Li, Bowen and Xie, Chengxing and Gao, Chang and Chen, Kai and Tao, Dacheng. Confidence as a Reward: Transforming LLMs into Reward Models. arXiv preprint arXiv:2510.13501. 2025
-
[8]
arXiv preprint arXiv:2510.06062 , year=
Wang, Jiakang and Liu, Runze and Lin, Lei and Hu, Wenping and Li, Xiu and Zhang, Fuzheng and Zhou, Guorui and Gai, Kun. ASPO: Asymmetric Importance Sampling Policy Optimization. arXiv preprint arXiv:2510.06062. 2025
-
[9]
Wang, Shenzhi and Yu, Le and Gao, Chang and Zheng, Chujie and Liu, Shixuan and Lu, Rui and Dang, Kai and Chen, Xionghui and Yang, Jianxin and Zhang, Zhenru and others. Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning. arXiv preprint arXiv:2506.01939. 2025
work page internal anchor Pith review arXiv 2025
-
[10]
Unlocking Exploration in RLVR: Uncertainty-aware Advantage Shaping for Deeper Reasoning
Xie, Can and Pan, Ruotong and Wu, Xiangyu and Zhang, Yunfei and Fu, Jiayi and Gao, Tingting and Zhou, Guorui. Unlocking Exploration in RLVR: Uncertainty-aware Advantage Shaping for Deeper Reasoning. arXiv preprint arXiv:2510.10649. 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[11]
The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models
Cui, Ganqu and Zhang, Yuchen and Chen, Jiacheng and Yuan, Lifan and Wang, Zhi and Zuo, Yuxin and Li, Haozhan and Fan, Yuchen and Chen, Huayu and Chen, Weize and others. The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models. arXiv preprint arXiv:2505.22617. 2025
work page internal anchor Pith review arXiv 2025
-
[12]
Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?
Yue, Yang and Chen, Zhiqi and Lu, Rui and Zhao, Andrew and Wang, Zhaokai and Song, Shiji and Huang, Gao. Does reinforcement learning really incentivize reasoning capacity in llms beyond the base model?. arXiv preprint arXiv:2504.13837. 2025
work page internal anchor Pith review arXiv 2025
-
[13]
Bridging Offline and Online Reinforcement Learning for LLMs
Lanchantin, Jack and Chen, Angelica and Lan, Janice and Li, Xian and Saha, Swarnadeep and Wang, Tianlu and Xu, Jing and Yu, Ping and Yuan, Weizhe and Weston, Jason E and others. Bridging Offline and Online Reinforcement Learning for LLMs. arXiv preprint arXiv:2506.21495. 2025
-
[14]
Sun, Yifan and Shen, Jingyan and Wang, Yibin and Chen, Tianyu and Wang, Zhendong and Zhou, Mingyuan and Zhang, Huan. Improving Data Efficiency for LLM Reinforcement Fine-tuning Through Difficulty-targeted Online Data Selection and Rollout Replay. arXiv preprint arXiv:2506.05316. 2025
-
[15]
arXiv preprint arXiv:2504.14945 , year =
Yan, Jianhao and Li, Yafu and Hu, Zican and Wang, Zhi and Cui, Ganqu and Qu, Xiaoye and Cheng, Yu and Zhang, Yue. Learning to Reason under Off-Policy Guidance. arXiv preprint arXiv:2504.14945. 2025
-
[16]
Zhang, Wenhao and Xie, Yuexiang and Sun, Yuchang and Chen, Yanxi and Wang, Guoyin and Li, Yaliang and Ding, Bolin and Zhou, Jingren. On-Policy RL Meets Off-Policy Experts: Harmonizing Supervised Fine-Tuning and Reinforcement Learning via Dynamic Weighting. arXiv preprint arXiv:2508.11408. 2025
-
[17]
Exgrpo: Learning to reason from experience
Zhan, Runzhe and Li, Yafu and Wang, Zhi and Qu, Xiaoye and Liu, Dongrui and Shao, Jing and Wong, Derek F. and Cheng, Yu. ExGRPO: Learning to Reason from Experience. arXiv preprint arXiv:2510.02245. 2025
-
[18]
Group Sequence Policy Optimization
Zheng, Chujie and Liu, Shixuan and Li, Mingze and Chen, Xiong-Hui and Yu, Bowen and Gao, Chang and Dang, Kai and Liu, Yuqiong and Men, Rui and Yang, An and others. Group sequence policy optimization. arXiv preprint arXiv:2507.18071. 2025
work page internal anchor Pith review arXiv 2025
-
[19]
Feng, Yunzhen and Jain, Parag and Hartshorn, Anthony and Duan, Yaqi and Kempe, Julia. Don't Waste Mistakes: Leveraging Negative RL-Groups via Confidence Reweighting. arXiv preprint arXiv:2510.08696. 2025
-
[20]
Proximal Policy Optimization Algorithms
Schulman, John and Wolski, Filip and Dhariwal, Prafulla and Radford, Alec and Klimov, Oleg. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347. 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[21]
Process Reinforcement through Implicit Rewards
Cui, Ganqu and Yuan, Lifan and Wang, Zefan and Wang, Hanbin and Zhang, Yuchen and Chen, Jiacheng and Li, Wendi and He, Bingxiang and Fan, Yuchen and Yu, Tianyu and others. Process reinforcement through implicit rewards. arXiv preprint arXiv:2502.01456. 2025
work page internal anchor Pith review arXiv 2025
-
[22]
DAPO : An Open-Source LLM Reinforcement Learning System at Scale
Yu, Qiying and Zhang, Zheng and Zhu, Ruofei and Yuan, Yufeng and Zuo, Xiaochen and YuYue and Dai, Weinan and Fan, Tiantian and Liu, Gaohong and Liu, Juncai and others. DAPO : An Open-Source LLM Reinforcement Learning System at Scale. The Thirty-ninth Annual Conference on Neural Information Processing Systems. 2025
2025
-
[23]
Open-Reasoner-Zero: An Open Source Approach to Scaling Up Reinforcement Learning on the Base Model
Hu, Jingcheng and Zhang, Yinmin and Han, Qi and Jiang, Daxin and Zhang, Xiangyu and Shum, Heung-Yeung. Open-reasoner-zero: An open source approach to scaling up reinforcement learning on the base model. arXiv preprint arXiv:2503.24290. 2025
work page internal anchor Pith review arXiv 2025
-
[24]
Kakade, Cengiz Pehlevan, Samy Jelassi, and Eran Malach
Zhao, Rosie and Meterez, Alexandru and Kakade, Sham and Pehlevan, Cengiz and Jelassi, Samy and Malach, Eran. Echo Chamber: RL Post-training Amplifies Behaviors Learned in Pretraining. arXiv preprint arXiv:2504.07912. 2025
-
[25]
arXiv preprint arXiv:2510.02230
Nguyen, Phuc Minh and La, Chinh D. and Nguyen, Duy M. H. and Chawla, Nitesh V. and Nguyen, Binh T. and Doan, Khoa D. The Reasoning Boundary Paradox: How Reinforcement Learning Constrains Language Models. arXiv preprint arXiv:2510.02230. 2025
-
[26]
Team, Qwen. Qwen3 Technical Report. arXiv preprint arXiv:2505.09388. 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[27]
GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models
Team, GLM and Zeng, Aohan and Lv, Xin and Zheng, Qinkai and Hou, Zhenyu and Chen, Bin and Xie, Chengxing and Wang, Cunxiang and Yin, Da and Zeng, Hao and others. GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models. arXiv preprint arXiv:2508.06471. 2025
work page internal anchor Pith review arXiv 2025
-
[28]
NuminaMath
LI, Jia and Beeching, Edward and Tunstall, Lewis and Lipkin, Ben and Soletskyi, Roman and Huang, Shengyi Costa and Rasul, Kashif and Yu, Longhui and Jiang, Albert and Shen, Ziju and others. NuminaMath. Hugging Face repository. 2024
2024
-
[29]
Exploration vs exploitation: Rethinking RLVR through clipping, entropy, and spurious reward
Chen, Peter and Li, Xiaopeng and Li, Ziniu and Yin, Wotao and Chen, Xi and Lin, Tianyi. Exploration vs Exploitation: Rethinking RLVR through Clipping, Entropy, and Spurious Reward. arXiv preprint arXiv:2512.16912. 2025
-
[30]
Rethinking entropy regularization in large reasoning models.arXiv preprint arXiv:2509.25133, 2025
Jiang, Yuxian and Li, Yafu and Chen, Guanxu and Liu, Dongrui and Cheng, Yu and Shao, Jing. Rethinking Entropy Regularization in Large Reasoning Models. arXiv preprint arXiv:2509.25133. 2025
-
[31]
Skalse, Joar and Howe, Nikolaus H. R. and Krasheninnikov, Dmitrii and Krueger, David. Defining and Characterizing Reward Gaming. Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022. 2022
2022
-
[32]
LaSeR: Reinforcement Learning with Last-Token Self-Rewarding
Yang, Wenkai and Liu, Weijie and Xie, Ruobing and Guo, Yiju and Wu, Lulu and Yang, Saiyong and Lin, Yankai. LaSeR: Reinforcement Learning with Last-Token Self-Rewarding. arXiv preprint arXiv:2510.14943. 2025
-
[33]
Deng, Yihe and Hsu, I-Hung and Yan, Jun and Wang, Zifeng and Han, Rujun and Zhang, Gufeng and Chen, Yanfei and Wang, Wei and Pfister, Tomas and Lee, Chen-Yu. Supervised Reinforcement Learning: From Expert Trajectories to Step-wise Reasoning. arXiv preprint arXiv:2510.25992. 2025
-
[34]
Reasoning with exploration: An entropy perspective.arXiv preprint arXiv:2506.14758, 2025
Cheng, Daixuan and Huang, Shaohan and Zhu, Xuekai and Dai, Bo and Zhao, Wayne Xin and Zhang, Zhenliang and Wei, Furu. Reasoning with Exploration: An Entropy Perspective. arXiv preprint arXiv:2506.14758. 2025
-
[35]
Su, Zhenpeng and Pan, Leiyu and Lv, Minxuan and Li, Yuntao and Hu, Wenping and Zhang, Fuzheng and Gai, Kun and Zhou, Guorui. CE-GPPO: Controlling Entropy via Gradient-Preserving Clipping Policy Optimization in Reinforcement Learning. arXiv preprint arXiv:2509.20712. 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[36]
arXiv preprint arXiv:2510.08141 , year=
Wang, Chen and Li, Zhaochun and Bai, Jionghao and Zhang, Yuzhi and Cui, Shisheng and Zhao, Zhou and Wang, Yue. Arbitrary Entropy Policy Optimization Breaks The Exploration Bottleneck of Reinforcement Learning. arXiv preprint arXiv:2510.08141. 2025
-
[37]
C-Pack: Packed Resources For General Chinese Embeddings
Xiao, Shitao and Liu, Zheng and Zhang, Peitian and Muennighoff, Niklas. C-Pack: Packaged Resources To Advance General Chinese Embedding. arXiv preprint arXiv:2309.07597. 2023
work page internal anchor Pith review arXiv 2023
-
[38]
Improving RL Exploration for LLM Reasoning through Retrospective Replay
Dou, Shihan and Wu, Muling and Xu, Jingwen and Zheng, Rui and Gui, Tao and Zhang, Qi and Huang, Xuanjing. Improving RL Exploration for LLM Reasoning through Retrospective Replay. arXiv preprint arXiv:2504.14363. 2025
-
[39]
Lv, Xingtai and Zuo, Yuxin and Sun, Youbang and Liu, Hongyi and Wei, Yuntian and Chen, Zhekai and He, Lixuan and Zhu, Xuekai and Zhang, Kaiyan and Wang, Bingning and others. Towards a Unified View of Large Language Model Post-Training. arXiv preprint arXiv:2509.04419. 2025
-
[40]
Bartoldson, Brian and Venkatraman, Siddarth and Diffenderfer, James and Jain, Moksh and Ben-Nun, Tal and Lee, Seanie and Kim, Minsu and Obando-Ceron, Johan and Bengio, Yoshua and Kailkhura, Bhavya. Trajectory Balance with Asynchrony: Decoupling Exploration and Learning for Fast, Scalable LLM Post-Training. arXiv preprint arXiv:2503.18929. 2025
-
[41]
Reasoning with sampling: Your base model is smarter than you think.arXiv preprint arXiv:2510.14901,
Karan, Aayush and Du, Yilun. Reasoning with Sampling: Your Base Model is Smarter Than You Think. arXiv preprint arXiv:2510.14901. 2025
-
[42]
Zhang, Xiaoyun and Yuan, Xiaojian and Huang, Di and You, Wang and Hu, Chen and Ruan, Jingqing and Chen, Kejiang and Hu, Xing. Rediscovering Entropy Regularization: Adaptive Coefficient Unlocks Its Potential for LLM Reinforcement Learning. arXiv preprint arXiv:2510.10959. 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[43]
Numinamath: The largest public dataset in ai4maths with 860k pairs of competition math problems and solutions
Li, Jia and Beeching, Edward and Tunstall, Lewis and Lipkin, Ben and Soletskyi, Roman and Huang, Shengyi and Rasul, Kashif and Yu, Longhui and Jiang, Albert Q and Shen, Ziju and others. Numinamath: The largest public dataset in ai4maths with 860k pairs of competition math problems and solutions. Hugging Face repository. 2024
2024
-
[44]
https://proceedings.neurips.cc/paper_files/paper/2022/file/18abbeef8cfe9203fdf9053c9c4fe191-Paper-Conference.pdf
Lewkowycz, Aitor and Andreassen, Anders and Dohan, David and Dyer, Ethan and Michalewski, Henryk and Ramasesh, Vinay and Slone, Ambrose and Anil, Cem and Schlag, Imanol and Gutman-Solo, Theo and Wu, Yuhuai and Neyshabur, Behnam and Gur-Ari, Guy and Misra, Vedant , booktitle =. Solving Quantitative Reasoning Problems with Language Models , url = "https://p...
2022
-
[45]
He, Chaoqun and Luo, Renjie and Bai, Yuzhuo and Hu, Shengding and Thai, Zhen and Shen, Junhao and Hu, Jinyi and Han, Xu and Huang, Yujie and Zhang, Yuxiang and others. O lympiad B ench: A Challenging Benchmark for Promoting AGI with Olympiad-Level Bilingual Multimodal Scientific Problems. Proceedings of the 62nd Annual Meeting of the Association for Compu...
-
[46]
Measuring Mathematical Problem Solving With the MATH Dataset
Hendrycks, Dan and Burns, Collin and Kadavath, Saurav and Arora, Akul and Basart, Steven and Tang, Eric and Song, Dawn and Steinhardt, Jacob. Measuring Mathematical Problem Solving With the MATH Dataset. Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks 1, NeurIPS Datasets and Benchmarks 2021, December 2021, virtual. 2021
2021
-
[47]
Qwen2.5-Math Technical Report: Toward Mathematical Expert Model via Self-Improvement
Yang, An and Zhang, Beichen and Hui, Binyuan and Gao, Bofei and Yu, Bowen and Li, Chengpeng and Liu, Dayiheng and Tu, Jianhong and Zhou, Jingren and Lin, Junyang and others. Qwen2.5-Math Technical Report: Toward Mathematical Expert Model via Self-Improvement. arXiv preprint arXiv:2409.12122. 2024
work page internal anchor Pith review arXiv 2024
-
[48]
arXiv preprint arXiv:2507.06892 (2025) 3
Liang, Jing and Tang, Hongyao and Ma, Yi and Liu, Jinyi and Zheng, Yan and Hu, Shuyue and Bai, Lei and Hao, Jianye. Squeeze the soaked sponge: Efficient off-policy reinforcement finetuning for large language model. arXiv preprint arXiv:2507.06892. 2025
-
[49]
Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge
Clark, Peter and Cowhey, Isaac and Etzioni, Oren and Khot, Tushar and Sabharwal, Ashish and Schoenick, Carissa and Tafjord, Oyvind. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457. 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[50]
Gpqa: A graduate-level google-proof q&a benchmark
Rein, David and Hou, Betty Li and Stickland, Asa Cooper and Petty, Jackson and Pang, Richard Yuanzhe and Dirani, Julien and Michael, Julian and Bowman, Samuel R. Gpqa: A graduate-level google-proof q&a benchmark. First Conference on Language Modeling. 2024
2024
-
[51]
MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark
Wang, Yubo and Ma, Xueguang and Zhang, Ge and Ni, Yuansheng and Chandra, Abhranil and Guo, Shiguang and Ren, Weiming and Arulraj, Aaran and He, Xuan and Jiang, Ziyan and others. MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark. Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information ...
2024
-
[52]
2026 , url=
Hexuan Deng and Wenxiang Jiao and Xuebo Liu and Jun Rao and Min Zhang , booktitle=. 2026 , url=
2026
-
[53]
Ke, Xiaopeng and Deng, Hexuan and Liu, Xuebo and Rao, Jun and Song, Zhenxi and Yu, Jun and Zhang, Min. AQ uilt: Weaving Logic and Self-Inspection into Low-Cost, High-Relevance Data Synthesis for Specialist LLM s. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. 2025. doi:10.18653/v1/2025.emnlp-main.293
-
[54]
DRP runing: Efficient Large Language Model Pruning through Distributionally Robust Optimization
Deng, Hexuan and Jiao, Wenxiang and Liu, Xuebo and Li, Jing and Zhang, Min and Tu, Zhaopeng. DRP runing: Efficient Large Language Model Pruning through Distributionally Robust Optimization. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025. doi:10.18653/v1/2025.acl-long.1414
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.