Recognition: unknown
From P(y|x) to P(y): Investigating Reinforcement Learning in Pre-train Space
Pith reviewed 2026-05-10 12:47 UTC · model grok-4.3
The pith
Reinforcement learning applied to the pre-training marginal distribution P(y) serves as a viable surrogate for standard RL on P(y|x) via strong gradient alignment.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that PreRL applies reward-driven online updates directly to P(y) and that the strong gradient alignment between log P(y) and log P(y|x) makes it a viable surrogate for standard RLVR. NSR-PreRL rapidly prunes incorrect reasoning spaces while stimulating endogenous reflective behaviors, increasing transition thoughts by 14.89x and reflection thoughts by 6.54x. This enables DSRL, a policy reincarnation strategy that initializes with NSR-PreRL to expand the reasoning horizon before transitioning to standard RL for fine-grained optimization, consistently outperforming baselines by steering toward a refined correct reasoning subspace.
What carries the argument
PreRL applies reward-driven online updates directly to the marginal P(y); its viability rests on the gradient alignment between log P(y) and log P(y|x), with NSR serving as the driver that prunes incorrect reasoning spaces.
If this is right
- PreRL functions as a direct surrogate for standard conditional RL without being limited by the base model's existing output distribution.
- NSR-PreRL increases transition thoughts by 14.89x and reflection thoughts by 6.54x while pruning incorrect reasoning spaces.
- DSRL, by first applying NSR-PreRL then standard RL, steers the policy into a refined correct reasoning subspace and outperforms strong baselines.
- Pre-train space optimization addresses the fundamental bottleneck where RLVR is bounded by the base model's output distribution.
Where Pith is reading between the lines
- The same pruning mechanism could be applied to other generative tasks to reduce exploration of low-value outputs early in training.
- Starting with marginal-space updates may help retain broad capabilities longer before conditional specialization.
- The approach suggests pre-training itself could incorporate targeted reward signals to produce more capable starting models.
Load-bearing premise
The gradient alignment between log P(y) and log P(y|x) remains strong enough under realistic pre-training data and reward signals to serve as a surrogate without introducing harmful distribution shift or forgetting of general capabilities.
What would settle it
An experiment in which PreRL or NSR-PreRL produces lower final reasoning accuracy or measurable degradation on unrelated general capabilities tasks compared to standard RLVR on identical base models and rewards.
Figures
read the original abstract
While reinforcement learning with verifiable rewards (RLVR) significantly enhances LLM reasoning by optimizing the conditional distribution P(y|x), its potential is fundamentally bounded by the base model's existing output distribution. Optimizing the marginal distribution P(y) in the Pre-train Space addresses this bottleneck by encoding reasoning ability and preserving broad exploration capacity. Yet, conventional pre-training relies on static corpora for passive learning, leading to a distribution shift that hinders targeted reasoning enhancement. In this paper, we introduce PreRL (Pre-train Space RL), which applies reward-driven online updates directly to P(y). We theoretically and empirically validate the strong gradient alignment between log P(y) and log P(y|x), establishing PreRL as a viable surrogate for standard RL. Furthermore, we uncover a critical mechanism: Negative Sample Reinforcement (NSR) within PreRL serves as an exceptionally effective driver for reasoning. NSR-PreRL rapidly prunes incorrect reasoning spaces while stimulating endogenous reflective behaviors, increasing transition and reflection thoughts by 14.89x and 6.54x, respectively. Leveraging these insights, we propose Dual Space RL (DSRL), a Policy Reincarnation strategy that initializes models with NSR-PreRL to expand the reasoning horizon before transitioning to standard RL for fine-grained optimization. Extensive experiments demonstrate that DSRL consistently outperforms strong baselines, proving that pre-train space pruning effectively steers the policy toward a refined correct reasoning subspace.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces PreRL, which performs reward-driven RL directly on the marginal distribution P(y) in pre-train space rather than the conditional P(y|x) used in standard RLVR. It claims to theoretically and empirically establish strong gradient alignment between log P(y) and log P(y|x), making PreRL a viable surrogate; introduces Negative Sample Reinforcement (NSR) that prunes incorrect reasoning spaces and increases transition and reflection thoughts by 14.89x and 6.54x; and proposes Dual Space RL (DSRL), a reincarnation strategy that initializes with NSR-PreRL before switching to standard RL, yielding consistent gains over baselines on reasoning tasks.
Significance. If the gradient alignment holds robustly under realistic pre-training distributions and sparse reasoning rewards, and if the reported thought-process gains prove reproducible, this work offers a meaningful new direction for expanding LLM reasoning capacity beyond the limits of the base model's conditional output distribution. The identification of NSR as a driver of endogenous reflection is a concrete mechanistic insight. Credit is due for attempting a theoretical derivation of the alignment and for the DSRL policy-reincarnation idea, both of which could influence subsequent RL-for-LLM research if the supporting evidence is strengthened.
major comments (2)
- [§3 (Theoretical Analysis)] §3 (Theoretical Analysis): The central claim that PreRL is a viable surrogate rests on the asserted strong gradient alignment between ∇ log P(y) and ∇ log P(y|x). The derivation implicitly assumes that the pre-training marginal over x remains representative and that the reward r(y) does not induce strong x-dependence; yet reasoning rewards are typically sparse, binary, and conditioned on narrow (x, y) pairs. Without the explicit assumptions, the exact reward formulation, and a robustness check showing that alignment does not degrade under these conditions, it is impossible to rule out that the alignment is partly tautological with the surrogate objective or that harmful distribution shift occurs.
- [§5.2–5.3 (Empirical Validation and Ablations)] §5.2–5.3 (Empirical Validation and Ablations): The 14.89x and 6.54x increases in transition and reflection thoughts are load-bearing for the NSR mechanism claim. These figures must be accompanied by the precise definition of “transition” and “reflection” thoughts, the full set of baselines (including standard RLVR without NSR), and controls for post-hoc selection or prompt sensitivity. The DSRL results similarly require an ablation isolating the contribution of the NSR-PreRL initialization phase versus the subsequent fine-grained RL stage.
minor comments (3)
- [§2] Notation: The distinction between P(y) (marginal) and P(y|x) (conditional) is introduced clearly in the abstract but should be restated with explicit probability expressions at the beginning of §2 to avoid any ambiguity for readers unfamiliar with the pre-train-space framing.
- [Figure 4] Figure clarity: The plots showing thought-type counts (transition/reflection) would benefit from error bars across multiple random seeds and an explicit statement of the number of evaluation samples per condition.
- [§1.2] Related work: The discussion of prior RLVR methods (e.g., those optimizing P(y|x) directly) is brief; adding one or two sentences contrasting the gradient-alignment approach with existing surrogate-objective literature would strengthen context.
Simulated Author's Rebuttal
We are grateful to the referee for providing a thorough review and insightful comments that have helped us improve the clarity and rigor of our work. Below, we address each major comment in detail. We have made revisions to the manuscript to incorporate the suggested clarifications and additional analyses where feasible.
read point-by-point responses
-
Referee: [§3 (Theoretical Analysis)] §3 (Theoretical Analysis): The central claim that PreRL is a viable surrogate rests on the asserted strong gradient alignment between ∇ log P(y) and ∇ log P(y|x). The derivation implicitly assumes that the pre-training marginal over x remains representative and that the reward r(y) does not induce strong x-dependence; yet reasoning rewards are typically sparse, binary, and conditioned on narrow (x, y) pairs. Without the explicit assumptions, the exact reward formulation, and a robustness check showing that alignment does not degrade under these conditions, it is impossible to rule out that the alignment is partly tautological with the surrogate objective or that harmful distribution shift occurs.
Authors: We acknowledge the referee's concern about the implicit assumptions in our theoretical analysis. In the revised manuscript, we have explicitly listed the assumptions in a new paragraph in §3: namely, that the pre-training marginal distribution over x is representative for the downstream tasks, and that the reward function r(y) depends primarily on the quality of y rather than specific x-y interactions for the reasoning problems considered. We have also provided the exact reward formulation, which is a binary verifiable reward based on the correctness of the final answer for mathematical and coding tasks. Regarding the robustness check, we have added an analysis in the appendix demonstrating that the gradient alignment remains strong even under increased reward sparsity in our experimental settings. We agree that this strengthens the claim that PreRL serves as a viable surrogate without harmful distribution shift. revision: yes
-
Referee: [§5.2–5.3 (Empirical Validation and Ablations)] §5.2–5.3 (Empirical Validation and Ablations): The 14.89x and 6.54x increases in transition and reflection thoughts are load-bearing for the NSR mechanism claim. These figures must be accompanied by the precise definition of “transition” and “reflection” thoughts, the full set of baselines (including standard RLVR without NSR), and controls for post-hoc selection or prompt sensitivity. The DSRL results similarly require an ablation isolating the contribution of the NSR-PreRL initialization phase versus the subsequent fine-grained RL stage.
Authors: We appreciate this feedback on the empirical sections. In the revised manuscript, we have included precise definitions: 'transition thoughts' refer to intermediate reasoning steps that mark a shift from an incorrect path to a correct one, and 'reflection thoughts' are those involving explicit reconsideration or self-correction of prior steps. We now report the full set of baselines, including standard RLVR without NSR, in Table 2 and Figure 3. To address potential post-hoc selection and prompt sensitivity, we have added results averaged over 5 different prompts and 3 random seeds, with standard deviations. Additionally, we have included a new ablation study for DSRL in §5.3 that isolates the effect of the NSR-PreRL initialization phase by comparing it to direct standard RL and to a version without the reincarnation step. These changes clarify the contribution of each component. revision: yes
Circularity Check
No significant circularity; derivation chain remains self-contained
full rationale
The paper asserts a theoretical and empirical validation of gradient alignment between log P(y) and log P(y|x) to position PreRL as a surrogate for standard RL, followed by NSR-PreRL pruning and DSRL reincarnation. No equations, self-citations, or derivations are exhibited in the provided sections that reduce the alignment claim to a fitted input, self-definition, or prior author result by construction. The central premise draws on independent empirical observations of behavior changes (e.g., 14.89x increase in transition thoughts) and the proposed dual-space strategy, which do not collapse back into the alignment statement itself. Absent explicit load-bearing reductions or ansatz smuggling, the chain does not exhibit the enumerated circularity patterns.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Strong gradient alignment exists between log P(y) and log P(y|x) under the chosen reward model
invented entities (1)
-
Negative Sample Reinforcement (NSR)
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023. URL https://arxiv.org/abs/2303.08774
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[2]
Reincarnating reinforcement learning: Reusing prior computation to accelerate progress
Rishabh Agarwal, Max Schwarzer, Pablo Samuel Castro, Aaron C Courville, and Marc Bellemare. Reincarnating reinforcement learning: Reusing prior computation to accelerate progress. Advances in neural information processing systems, 35: 0 28955--28971, 2022. URL https://proceedings.neurips.cc/paper_files/paper/2022/hash/ba1c5356d9164bb64c446a4b690226b0-Abst...
2022
-
[3]
Back to basics: Revisiting REINFORCE -style optimization for learning from human feedback in LLM s
Arash Ahmadian, Chris Cremer, Matthias Gall \'e , Marzieh Fadaee, Julia Kreutzer, Olivier Pietquin, Ahmet \"U st \"u n, and Sara Hooker. Back to basics: Revisiting REINFORCE -style optimization for learning from human feedback in LLM s. In Proceedings of ACL, pp.\ 12248--12267, 2024. URL https://aclanthology.org/2024.acl-long.662/
2024
-
[4]
Marcin Andrychowicz, Anton Raichuk, Piotr Sta \'n czyk, Manu Orsini, Sertan Girgin, Raphael Marinier, L \'e onard Hussenot, Matthieu Geist, Olivier Pietquin, Marcin Michalski, et al. What matters in on-policy reinforcement learning? a large-scale empirical study. arXiv preprint arXiv:2006.05990, 2020. URL https://arxiv.org/abs/2006.05990
-
[5]
Language models are few-shot learners
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33: 0 1877--1901, 2020. URL https://proceedings.neurips.cc/paper_files/paper/2020/hash/1457c0d6bfcb4967418...
1901
-
[6]
Vi-curl: Stabilizing verifier-independent rl reasoning via confidence-guided variance reduction
Xin-Qiang Cai and Masashi Sugiyama. Vi-curl: Stabilizing verifier-independent rl reasoning via confidence-guided variance reduction. arXiv preprint arXiv:2602.12579, 2026. URL https://arxiv.org/abs/2602.12579
-
[7]
Evaluating Large Language Models Trained on Code
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021. URL https://arxiv.org/abs/2107.03374
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[8]
Seal: Steer- able reasoning calibration of large language models for free
Runjin Chen, Zhenyu Zhang, Junyuan Hong, Souvik Kundu, and Zhangyang Wang. Seal: Steerable reasoning calibration of large language models for free. arXiv preprint arXiv:2504.07986, 2025 a . URL https://arxiv.org/abs/2504.07986
-
[9]
Do NOT Think That Much for 2+3=? On the Overthinking of o1-Like LLMs
Xingyu Chen, Jiahao Xu, Tian Liang, Zhiwei He, Jianhui Pang, Dian Yu, Linfeng Song, Qiuzhi Liu, Mengfei Zhou, Zhuosheng Zhang, et al. Do not think that much for 2+ 3=? on the overthinking of o1-like llms. arXiv preprint arXiv:2412.21187, 2024. URL https://arxiv.org/abs/2412.21187
work page internal anchor Pith review arXiv 2024
-
[10]
Zhipeng Chen, Xiaobo Qin, Youbin Wu, Yue Ling, Qinghao Ye, Wayne Xin Zhao, and Guang Shi. Pass@ k training for adaptively balancing exploration and exploitation of large reasoning models. arXiv preprint arXiv:2508.10751, 2025 b . URL https://arxiv.org/abs/2508.10751
-
[11]
Continual pre-training mitigates forgetting in language and vision
Andrea Cossu, Antonio Carta, Lucia Passaro, Vincenzo Lomonaco, Tinne Tuytelaars, and Davide Bacciu. Continual pre-training mitigates forgetting in language and vision. Neural Networks, 179: 0 106492, 2024. URL https://www.sciencedirect.com/science/article/pii/S0893608024004167
2024
-
[12]
Gemini 2.0 flash thinking, 2024
Google DeepMind. Gemini 2.0 flash thinking, 2024. URL https://deepmind.google/technologies/gemini/flash-thinking/
2024
-
[13]
Qingxiu Dong, Li Dong, Yao Tang, Tianzhu Ye, Yutao Sun, Zhifang Sui, and Furu Wei. Reinforcement pre-training. arXiv preprint arXiv:2506.08007, 2025. URL https://arxiv.org/abs/2506.08007
-
[14]
Yangyi Fang, Jiaye Lin, Xiaoliang Fu, Cong Qin, Haolin Shi, Chaowen Hu, Lu Pan, Ke Zeng, and Xunliang Cai. How to allocate, how to learn? dynamic rollout allocation and advantage modulation for policy optimization. arXiv preprint arXiv:2602.19208, 2026 a . URL https://arxiv.org/abs/2602.19208
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[15]
Junbo Li, Peng Zhou, Rui Meng, Meet P
Yangyi Fang, Jiaye Lin, Xiaoliang Fu, Cong Qin, Haolin Shi, Chang Liu, and Peilin Zhao. Proximity-based multi-turn optimization: Practical credit assignment for llm agent training. arXiv preprint arXiv:2602.19225, 2026 b . URL https://arxiv.org/abs/2602.19225
-
[16]
Deepseek-r1 incentivizes reasoning in llms through reinforcement learning
Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1 incentivizes reasoning in llms through reinforcement learning. Nature, 645 0 (8081): 0 633--638, 2025 a . URL https://www.nature.com/articles/s41586-025-09422-z
2025
-
[17]
Tree-based dialogue reinforced policy optimization for red-teaming attacks
Ruohao Guo, Afshin Oroojlooy, Roshan Sridhar, Miguel Ballesteros, Alan Ritter, and Dan Roth. Tree-based dialogue reinforced policy optimization for red-teaming attacks. arXiv preprint arXiv:2510.02286, 2025 b . URL https://arxiv.org/abs/2510.02286
-
[18]
Richter, Quentin An- thony, Eugene Belilovsky, Irina Rish, and Timothée Lesort
Benjamin Gupta, Kshitij ou2025llmsand Th \'e rien, Adam Ibrahim, Mats L Richter, Quentin Anthony, Eugene Belilovsky, Irina Rish, and Timoth \'e e Lesort. Continual pre-training of large language models: How to (re) warm your model? arXiv preprint arXiv:2308.04014, 2023. URL https://arxiv.org/abs/2308.04014
-
[19]
RLP: Reinforcement as a pretraining objective
Ali Hatamizadeh, Syeda Nahida Akter, Shrimai Prabhumoye, Jan Kautz, Mostofa Patwary, Mohammad Shoeybi, Bryan Catanzaro, and Yejin Choi. Rlp: Reinforcement as a pretraining objective. arXiv preprint arXiv:2510.01265, 2025. URL https://arxiv.org/abs/2510.01265
-
[20]
Olympiadbench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems
Chaoqun He, Renjie Luo, Yuzhuo Bai, Shengding Hu, Zhen Thai, Junhao Shen, Jinyi Hu, Xu Han, Yujie Huang, Yuxiang Zhang, et al. Olympiadbench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Pape...
2024
-
[21]
REINFORCE++: Stabilizing Critic-Free Policy Optimization with Global Advantage Normalization
Jian Hu. Reinforce++: A simple and efficient approach for aligning large language models. arXiv preprint arXiv:2501.03262, 2025. URL https://arxiv.org/abs/2501.03262
work page internal anchor Pith review arXiv 2025
-
[22]
Open-Reasoner-Zero: An Open Source Approach to Scaling Up Reinforcement Learning on the Base Model
Jingcheng Hu, Yinmin Zhang, Qi Han, Daxin Jiang, Xiangyu Zhang, and Heung-Yeung Shum. Open-reasoner-zero: An open source approach to scaling up reinforcement learning on the base model. arXiv preprint arXiv:2503.24290, 2025 a . URL https://arxiv.org/abs/2503.24290
work page internal anchor Pith review arXiv 2025
-
[23]
Test-time learning for large language models.arXiv preprint arXiv:2505.20633, 2025
Jinwu Hu, Zhitian Zhang, Guohao Chen, Xutao Wen, Chao Shuai, Wei Luo, Bin Xiao, Yuanqing Li, and Mingkui Tan. Test-time learning for large language models. arXiv preprint arXiv:2505.20633, 2025 b . URL https://arxiv.org/abs/2505.20633
-
[24]
Remit: Rl-guided mid-training for iterative llm evolution
Junjie Huang, Jiarui Qin, Di Yin, Weiwen Liu, Yong Yu, Xing Sun, and Weinan Zhang. Remit: Rl-guided mid-training for iterative llm evolution. arXiv preprint arXiv:2602.03075, 2026. URL https://arxiv.org/abs/2602.03075
-
[25]
Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card. arXiv preprint arXiv:2410.21276, 2024. URL https://arxiv.org/abs/2410.21276
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[26]
InProceedings of the 29th Symposium on Operating Systems Principles(Koblenz, Germany)(SOSP ’23)
Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. In Proceedings of SOSP, pp.\ 611--626, 2023. URL https://dl.acm.org/doi/abs/10.1145/3600006.3613165
-
[27]
Solving quantitative reasoning problems with language models
Aitor Lewkowycz, Anders Johan Andreassen, David Dohan, Ethan Dyer, Henryk Michalewski, Vinay Venkatesh Ramasesh, Ambrose Slone, Cem Anil, Imanol Schlag, Theo Gutman-Solo, et al. Solving quantitative reasoning problems with language models. In Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=IFXTZERXdM7
2022
-
[28]
Long Li, Jiaran Hao, Jason Klein Liu, Zhijian Zhou, Yanting Miao, Wei Pang, Xiaoyu Tan, Wei Chu, Zhe Wang, Shirui Pan, et al. The choice of divergence: A neglected key to mitigating diversity collapse in reinforcement learning with verifiable reward. arXiv preprint arXiv:2509.07430, 2025 a . URL https://arxiv.org/abs/2509.07430
-
[29]
arXiv preprint arXiv:2509.19249 (2025) 7
Siheng Li, Kejiao Li, Zenan Xu, Guanhua Huang, Evander Yang, Kun Li, Haoyuan Wu, Jiajia Wu, Zihao Zheng, Chenchen Zhang, et al. Reinforcement learning on pre-training data. arXiv preprint arXiv:2509.19249, 2025 b . URL https://arxiv.org/abs/2509.19249
-
[30]
arXiv preprint arXiv:2507.06892 (2025) 3
Jing Liang, Hongyao Tang, Yi Ma, Jinyi Liu, Yan Zheng, Shuyue Hu, Lei Bai, and Jianye Hao. Squeeze the soaked sponge: Efficient off-policy reinforcement finetuning for large language model. arXiv preprint arXiv:2507.06892, 2025. URL https://arxiv.org/abs/2507.06892
-
[31]
Huanxuan Liao, Zhongtao Jiang, Yupu Hao, Yuqiao Tan, Shizhu He, Jun Zhao, Kun Xu, and Kang Liu. Resadapt: Adaptive resolution for efficient multimodal reasoning. arXiv preprint arXiv:2603.28610, 2026. URL https://arxiv.org/abs/2603.28610
-
[32]
Let's verify step by step
Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let's verify step by step. In Proceedings of ICLR, 2023. URL https://openreview.net/forum?id=v8L0pN6EOi
2023
-
[33]
Qfft, question-free fine-tuning for adaptive reasoning
Wanlong Liu, Junxiao Xu, Fei Yu, Yukang Lin, Ke Ji, Wenyu Chen, Lifeng Shang, Yasheng Wang, Yan Xu, and Benyou Wang. Qfft, question-free fine-tuning for adaptive reasoning. In The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025 a . URL https://openreview.net/forum?id=CrBWOjZoKc
2025
-
[34]
Automated optimization modeling via a localizable error-driven perspective
Weiting Liu, Han Wu, Yufei Kuang, Xiongwei Han, Tao Zhong, Jianfeng Feng, and Wenlian Lu. Automated optimization modeling via a localizable error-driven perspective. arXiv preprint arXiv:2602.11164, 2026. URL https://arxiv.org/abs/2602.11164
-
[35]
Understanding r1-zero-like training: A critical perspective
Zichen Liu, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, Chao Du, Wee Sun Lee, and Min Lin. Understanding r1-zero-like training: A critical perspective. In Proceedings of COLM, 2025 b . URL https://openreview.net/forum?id=5PAF7PAY2Y
2025
-
[36]
Reasoning models can be effective without thinking.arXiv preprint arXiv:2504.09858, 2025
Wenjie Ma, Jingxuan He, Charlie Snell, Tyler Griggs, Sewon Min, and Matei Zaharia. Reasoning models can be effective without thinking. arXiv preprint arXiv:2504.09858, 2025. URL https://arxiv.org/abs/2504.09858
-
[37]
American mathematics contest 12 (amc 12), November 2023
MAA . American mathematics contest 12 (amc 12), November 2023. URL https://artofproblemsolving.com/wiki/index.php/AMC_12_Problems_and_Solutions
2023
-
[38]
American invitational mathematics examination (aime), February 2024
MAA . American invitational mathematics examination (aime), February 2024. URL https://artofproblemsolving.com/wiki/index.php/AIME_Problems_and_Solutions
2024
-
[39]
American invitational mathematics examination (aime), February 2025
MAA . American invitational mathematics examination (aime), February 2025. URL https://artofproblemsolving.com/wiki/index.php/AIME_Problems_and_Solutions
2025
-
[40]
How do llms acquire new knowledge? a knowledge circuits perspective on continual pre-training
Yixin Ou, Yunzhi Yao, Ningyu Zhang, Hui Jin, Jiacheng Sun, Shumin Deng, Zhenguo Li, and Huajun Chen. How do llms acquire new knowledge? a knowledge circuits perspective on continual pre-training. In Findings of the Association for Computational Linguistics: ACL 2025, pp.\ 19889--19913, 2025. URL https://aclanthology.org/2025.findings-acl.1021/
2025
-
[41]
Openwebmath: An open dataset of high-quality mathematical web text
Keiran Paster, Marco Dos Santos, Zhangir Azerbayev, and Jimmy Ba. Openwebmath: An open dataset of high-quality mathematical web text. In The Twelfth International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=jKHmjlpViu
2023
-
[42]
Simko: Simple pass@ k policy optimization.arXiv preprint arXiv:2510.14807, 2025
Ruotian Peng, Yi Ren, Zhouliang Yu, Weiyang Liu, and Yandong Wen. Simko: Simple pass@ k policy optimization. arXiv preprint arXiv:2510.14807, 2025. URL https://arxiv.org/abs/2510.14807
-
[43]
Language models are unsupervised multitask learners
Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners. OpenAI blog, 1 0 (8): 0 9, 2019. URL https://storage.prod.researchhub.com/uploads/papers/2020/06/01/language-models.pdf
2019
-
[44]
Exploring the limits of transfer learning with a unified text-to-text transformer
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of machine learning research, 21 0 (140): 0 1--67, 2020. URL http://www.jmlr.org/papers/v21/20-074.html
2020
-
[45]
Gpqa: A graduate-level google-proof q&a benchmark
David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R Bowman. Gpqa: A graduate-level google-proof q&a benchmark. In First conference on language modeling, 2024. URL https://openreview.net/forum?id=Ti67584b98&utm_campaign=The
2024
-
[46]
High-Dimensional Continuous Control Using Generalized Advantage Estimation
John Schulman, Philipp Moritz, Sergey Levine, Michael Jordan, and Pieter Abbeel. High-dimensional continuous control using generalized advantage estimation. arXiv preprint arXiv:1506.02438, 2015. URL https://arxiv.org/abs/1506.02438
work page internal anchor Pith review arXiv 2015
-
[47]
Proximal Policy Optimization Algorithms
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017. URL https://arxiv.org/abs/1707.06347
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[48]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024. URL https://arxiv.org/abs/2402.03300
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[49]
Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework. In Proceedings of the Twentieth European Conference on Computer Systems, pp.\ 1279--1297, 2025. URL https://dl.acm.org/doi/abs/10.1145/3689031.3696075
-
[50]
Scaling agents via continual pre-training.arXiv preprint arXiv:2509.13310, 2025
Liangcai Su, Zhen Zhang, Guangyu Li, Zhuo Chen, Chenxi Wang, Maojia Song, Xinyu Wang, Kuan Li, Jialong Wu, Xuanzhong Chen, et al. Scaling agents via continual pre-training. arXiv preprint arXiv:2509.13310, 2025. URL https://arxiv.org/abs/2509.13310
-
[51]
Ernie 2.0: A continual pre-training framework for language understanding
Yu Sun, Shuohuan Wang, Yukun Li, Shikun Feng, Hao Tian, Hua Wu, and Haifeng Wang. Ernie 2.0: A continual pre-training framework for language understanding. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pp.\ 8968--8975, 2020. URL https://ojs.aaai.org/index.php/aaai/article/view/6428
2020
-
[52]
Reinforcement learning: An introduction, volume 1
Richard S Sutton, Andrew G Barto, et al. Reinforcement learning: An introduction, volume 1. MIT press Cambridge, 1998
1998
-
[53]
Challenging big-bench tasks and whether chain-of-thought can solve them
Mirac Suzgun, Nathan Scales, Nathanael Sch \"a rli, Sebastian Gehrmann, Yi Tay, Hyung Won Chung, Aakanksha Chowdhery, Quoc Le, Ed Chi, Denny Zhou, et al. Challenging big-bench tasks and whether chain-of-thought can solve them. In Findings of the Association for Computational Linguistics: ACL 2023, pp.\ 13003--13051, 2023. URL https://aclanthology.org/2023...
2023
-
[54]
The zero-step thinking: An empirical study of mode selection as harder early exit in reasoning models
Yuqiao Tan, Shizhu He, Kang Liu, and Jun Zhao. The zero-step thinking: An empirical study of mode selection as harder early exit in reasoning models. In NeurIPS 2025 Workshop on Efficient Reasoning, 2025 a . URL https://openreview.net/forum?id=CPXmurtK0H
2025
-
[55]
Bottom-up policy optimization: Your language model policy secretly contains internal policies
Yuqiao Tan, Minzheng Wang, Shizhu He, Huanxuan Liao, Chengfeng Zhao, Qiunan Lu, Tian Liang, Jun Zhao, and Kang Liu. Bottom-up policy optimization: Your language model policy secretly contains internal policies. arXiv preprint arXiv:2512.19673, 2025 b . URL https://arxiv.org/abs/2512.19673
-
[56]
Kimi Team, Tongtong Bai, Yifan Bai, Yiping Bao, SH Cai, Yuan Cao, Y Charles, HS Che, Cheng Chen, Guanduo Chen, et al. Kimi k2. 5: Visual agentic intelligence. arXiv preprint arXiv:2602.02276, 2026. URL https://arxiv.org/abs/2602.02276
work page internal anchor Pith review arXiv 2026
-
[57]
Adaptive social learning via mode policy optimization for language agents
Minzheng Wang, Yongbin Li, Haobo Wang, Xinghua Zhang, Nan Xu, Bingli Wu, Fei Huang, Haiyang Yu, and Wenji Mao. Adaptive social learning via mode policy optimization for language agents. In The Fourteenth International Conference on Learning Representations, 2026 a . URL https://openreview.net/forum?id=GG7YQnsdhp
2026
-
[58]
Anchored policy optimization: Mitigating exploration collapse via support-constrained rectification
Tianyi Wang, Long Li, Hongcan Guo, Yibiao Chen, Yixia Li, Yong Wang, Yun Chen, and Guanhua Chen. Anchored policy optimization: Mitigating exploration collapse via support-constrained rectification. arXiv preprint arXiv:2602.05717, 2026 b . URL https://arxiv.org/abs/2602.05717
-
[59]
A comprehensive survey on trustworthiness in reasoning with large language models
Yanbo Wang, Yongcan Yu, Jian Liang, and Ran He. A comprehensive survey on trustworthiness in reasoning with large language models, 2025 a . URL https://arxiv.org/abs/2509.03871
-
[60]
Mitigating the safety-utility trade-off in llm alignment via adaptive safe context learning, 2026 c
Yanbo Wang, Minzheng Wang, Jian Liang, Lu Wang, Yongcan Yu, and Ran He. Mitigating the safety-utility trade-off in llm alignment via adaptive safe context learning, 2026 c . URL https://arxiv.org/abs/2602.13562
-
[61]
Mmlu-pro: A more robust and challenging multi-task language understanding benchmark
Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, et al. Mmlu-pro: A more robust and challenging multi-task language understanding benchmark. Advances in Neural Information Processing Systems, 37: 0 95266--95290, 2024 a . URL https://proceedings.neurips.cc/paper_files/paper/20...
2024
-
[62]
Mathpile: A billion-token-scale pretraining corpus for math
Zengzhi Wang, Xuefeng Li, Rui Xia, and Pengfei Liu. Mathpile: A billion-token-scale pretraining corpus for math. Advances in Neural Information Processing Systems, 37: 0 25426--25468, 2024 b . URL https://proceedings.neurips.cc/paper_files/paper/2024/hash/2d0be3cd5173c10b6ec075d1c393a13d-Abstract-Datasets_and_Benchmarks_Track.html
2024
-
[63]
Zengzhi Wang, Fan Zhou, Xuefeng Li, and Pengfei Liu. Octothinker: Mid-training incentivizes reinforcement learning scaling. arXiv preprint arXiv:2506.20512, 2025 b . URL https://arxiv.org/abs/2506.20512
-
[64]
Pretrainzero: Reinforcement active pretraining
Xingrun Xing, Zhiyuan Fan, Jie Lou, Guoqi Li, Jiajun Zhang, and Debing Zhang. Pretrainzero: Reinforcement active pretraining. arXiv preprint arXiv:2512.03442, 2025. URL https://arxiv.org/abs/2512.03442
-
[65]
arXiv preprint arXiv:2504.14945 , year =
Jianhao Yan, Yafu Li, Zican Hu, Zhi Wang, Ganqu Cui, Xiaoye Qu, Yu Cheng, and Yue Zhang. Learning to reason under off-policy guidance. arXiv preprint arXiv:2504.14945, 2025. URL https://arxiv.org/abs/2504.14945
-
[66]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report. arXiv preprint arXiv:2505.09388, 2025. URL https://arxiv.org/abs/2505.09388
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[67]
DAPO: An Open-Source LLM Reinforcement Learning System at Scale
Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, et al. Dapo: An open-source llm reinforcement learning system at scale. arXiv preprint arXiv:2503.14476, 2025. URL https://arxiv.org/abs/2503.14476
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[68]
Unveiling implicit advantage symmetry: Why grpo struggles with exploration and difficulty adaptation
Zhiqi Yu, Zhangquan Chen, Mengting Liu, Heye Zhang, and Liangqiong Qu. Unveiling implicit advantage symmetry: Why grpo struggles with exploration and difficulty adaptation. arXiv preprint arXiv:2602.05548, 2026. URL https://arxiv.org/abs/2602.05548
-
[69]
Yang Yue, Zhiqi Chen, Rui Lu, Andrew Zhao, Zhaokai Wang, Shiji Song, and Gao Huang. Does reinforcement learning really incentivize reasoning capacity in llms beyond the base model? In The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. URL https://openreview.net/forum?id=4OsgYD7em5
2025
-
[70]
SimpleRL-Zoo: Investigating and Taming Zero Reinforcement Learning for Open Base Models in the Wild
Weihao Zeng, Yuzhen Huang, Qian Liu, Wei Liu, Keqing He, Zejun Ma, and Junxian He. Simplerl-zoo: Investigating and taming zero reinforcement learning for open base models in the wild. arXiv preprint arXiv:2503.18892, 2025. URL https://arxiv.org/abs/2503.18892
work page internal anchor Pith review arXiv 2025
-
[71]
Charlie Zhang, Graham Neubig, and Xiang Yue. On the interplay of pre-training, mid-training, and rl on reasoning language models. arXiv preprint arXiv:2512.07783, 2025. URL https://arxiv.org/abs/2512.07783
-
[72]
Redone: Revealing domain-specific llm post-training in social networking services
Fei Zhao, Chonggang Lu, Zheyong Xie, Ziyan Liu, Haofu Qian, Jianzhao Huang, Fangcheng Shi, Zijie Meng, Hongcheng Guo, Mingqian He, et al. Redone: Revealing domain-specific llm post-training in social networking services. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track, pp.\ 2648--2674, 2025. URL ht...
2025
-
[73]
Megamath: Pushing the limits of open math corpora
Fan Zhou, Zengzhi Wang, Nikhil Ranjan, Zhoujun Cheng, Liping Tang, Guowei He, Zhengzhong Liu, and Eric P Xing. Megamath: Pushing the limits of open math corpora. In Second Conference on Language Modeling, 2025. URL https://openreview.net/forum?id=SHB0sLrZrh
2025
-
[74]
The surprising effectiveness of negative reinforcement in LLM reasoning
Xinyu Zhu, Mengzhou Xia, Zhepei Wei, Wei-Lin Chen, Danqi Chen, and Yu Meng. The surprising effectiveness of negative reinforcement in LLM reasoning. In Proceedings of NeurIPS, 2025. URL https://openreview.net/forum?id=ftVlLG9cks
2025
-
[75]
How far can unsupervised rlvr scale llm training? In The Fourteenth International Conference on Learning Representations, 2026
Yuxin Zuo, Bingxiang He, Zeyuan Liu, Shangziqi Zhao, Zixuan Fu, Junlin Yang, Kaiyan Zhang, Yuchen Fan, Ganqu Cui, Cheng Qian, et al. How far can unsupervised rlvr scale llm training? In The Fourteenth International Conference on Learning Representations, 2026. URL https://openreview.net/forum?id=VesLZukY5E
2026
-
[76]
write newline
" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...
-
[77]
@esa (Ref
\@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...
-
[78]
\@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...
-
[79]
@open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.