pith. machine review for the scientific record. sign in

arxiv: 2604.14142 · v1 · submitted 2026-04-15 · 💻 cs.LG · cs.AI· cs.CL

Recognition: unknown

From P(y|x) to P(y): Investigating Reinforcement Learning in Pre-train Space

Authors on Pith no claims yet

Pith reviewed 2026-05-10 12:47 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL
keywords reinforcement learningpre-train spaceLLM reasoninggradient alignmentnegative sample reinforcementdual space RLpolicy reincarnation
0
0 comments X

The pith

Reinforcement learning applied to the pre-training marginal distribution P(y) serves as a viable surrogate for standard RL on P(y|x) via strong gradient alignment.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper investigates shifting reinforcement learning for LLM reasoning from optimizing the conditional distribution P(y|x) to the marginal P(y) in the pre-train space. It introduces PreRL to apply reward-driven updates directly to P(y), backed by theoretical and empirical evidence of strong gradient alignment between log P(y) and log P(y|x). The work shows that Negative Sample Reinforcement in PreRL effectively prunes incorrect reasoning spaces and boosts endogenous reflective behaviors. It proposes Dual Space RL to combine pre-train space initialization with standard RL for better performance.

Core claim

The central claim is that PreRL applies reward-driven online updates directly to P(y) and that the strong gradient alignment between log P(y) and log P(y|x) makes it a viable surrogate for standard RLVR. NSR-PreRL rapidly prunes incorrect reasoning spaces while stimulating endogenous reflective behaviors, increasing transition thoughts by 14.89x and reflection thoughts by 6.54x. This enables DSRL, a policy reincarnation strategy that initializes with NSR-PreRL to expand the reasoning horizon before transitioning to standard RL for fine-grained optimization, consistently outperforming baselines by steering toward a refined correct reasoning subspace.

What carries the argument

PreRL applies reward-driven online updates directly to the marginal P(y); its viability rests on the gradient alignment between log P(y) and log P(y|x), with NSR serving as the driver that prunes incorrect reasoning spaces.

If this is right

  • PreRL functions as a direct surrogate for standard conditional RL without being limited by the base model's existing output distribution.
  • NSR-PreRL increases transition thoughts by 14.89x and reflection thoughts by 6.54x while pruning incorrect reasoning spaces.
  • DSRL, by first applying NSR-PreRL then standard RL, steers the policy into a refined correct reasoning subspace and outperforms strong baselines.
  • Pre-train space optimization addresses the fundamental bottleneck where RLVR is bounded by the base model's output distribution.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same pruning mechanism could be applied to other generative tasks to reduce exploration of low-value outputs early in training.
  • Starting with marginal-space updates may help retain broad capabilities longer before conditional specialization.
  • The approach suggests pre-training itself could incorporate targeted reward signals to produce more capable starting models.

Load-bearing premise

The gradient alignment between log P(y) and log P(y|x) remains strong enough under realistic pre-training data and reward signals to serve as a surrogate without introducing harmful distribution shift or forgetting of general capabilities.

What would settle it

An experiment in which PreRL or NSR-PreRL produces lower final reasoning accuracy or measurable degradation on unrelated general capabilities tasks compared to standard RLVR on identical base models and rewards.

Figures

Figures reproduced from arXiv: 2604.14142 by Bo Liu, Jun Zhao, Kang Liu, Minzheng Wang, Shizhu He, Tian Liang, Yuqiao Tan, Zichen Liu.

Figure 1
Figure 1. Figure 1: (a) Gradient objectives of Post-train Space RL, Pre-train Space RL and their [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Synergistic effect analysis of log P(y|x) and log P(y) using Qwen3-4B on AMC23. (a) Gradient dot product between ∇θ log πθ (y) and ∇θ log πθ (y|x). (b) Gradient cosine similarity. (c) Per-token log probability difference between log P(y|x) and log P(y). each input query x, GRPO samples a group of G responses {y1, . . . , yG} from the old policy πθold and computes their corresponding returns R = {R1, . . . … view at source ↗
Figure 3
Figure 3. Figure 3: (a) Training dynamics of PreRL v.s. RL. Effects on reward, response length and [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Pass@K performance comparison between DSRL and GRPO across LLMs. 0 50 100 150 200 250 0 2 4 6 8 Avg Count Subgoal Setting 0 50 100 150 200 250 0 2 4 6 8 Enumeration 0 50 100 150 200 250 Training Steps 0 2 4 6 Avg Count Verification 0 50 100 150 200 250 Training Steps 0 3 6 9 12 15 18 Backtracking GRPO DSRL [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 6
Figure 6. Figure 6: Evolution of problem-solving status (Solved vs. Unsolved) on training dataset. applied to negative samples to prune incorrect reasoning paths, while standard RL utilizes all samples in s > S. In Eq. 6, we employ GRPO as the representative RL algorithm. 4 Experiments 4.1 Experimental Setup Training Setup. We employ Qwen3-4B and Qwen3-8B (Yang et al., 2025) as base models and train on the MATH dataset (Lewko… view at source ↗
Figure 7
Figure 7. Figure 7: Ablation on warmup steps. Bar color intensity reflects score mag￾nitude. To comprehensively evaluate DSRL, we compare its Pass@K performance against GRPO ( [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: The Qwen3-NoThinking prompt template used for inference. [PITH_FULL_IMAGE:figures/full_fig_p018_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: System prompt used to evaluate beneficial reasoning behaviors in model-generated [PITH_FULL_IMAGE:figures/full_fig_p019_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Aligned case: Token probability distribution with and without input conditioning under the same context. x is a math problem and y is a partially generated answer. The two distributions P(y|x) and P(y) show similar token rankings, suggesting the post-train space conditional distribution largely inherits the structure of the pre-train space marginal distribution. �: In a table tennis tournament every parti… view at source ↗
Figure 11
Figure 11. Figure 11: Misaligned case: Token probability distribution with and without input condi￾tioning under the same context. x is a math problem and y is a partially generated answer. The top-1 token ”total” under P(y|x) receives near-zero probability under P(y), indicating a significant discrepancy between the two distributions. 20 [PITH_FULL_IMAGE:figures/full_fig_p020_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Comparison of token-level generation log-probabilities with and without input [PITH_FULL_IMAGE:figures/full_fig_p021_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Training dynamics of NSR-PreRL warmup vs. NSR-RL warmup. [PITH_FULL_IMAGE:figures/full_fig_p021_13.png] view at source ↗
read the original abstract

While reinforcement learning with verifiable rewards (RLVR) significantly enhances LLM reasoning by optimizing the conditional distribution P(y|x), its potential is fundamentally bounded by the base model's existing output distribution. Optimizing the marginal distribution P(y) in the Pre-train Space addresses this bottleneck by encoding reasoning ability and preserving broad exploration capacity. Yet, conventional pre-training relies on static corpora for passive learning, leading to a distribution shift that hinders targeted reasoning enhancement. In this paper, we introduce PreRL (Pre-train Space RL), which applies reward-driven online updates directly to P(y). We theoretically and empirically validate the strong gradient alignment between log P(y) and log P(y|x), establishing PreRL as a viable surrogate for standard RL. Furthermore, we uncover a critical mechanism: Negative Sample Reinforcement (NSR) within PreRL serves as an exceptionally effective driver for reasoning. NSR-PreRL rapidly prunes incorrect reasoning spaces while stimulating endogenous reflective behaviors, increasing transition and reflection thoughts by 14.89x and 6.54x, respectively. Leveraging these insights, we propose Dual Space RL (DSRL), a Policy Reincarnation strategy that initializes models with NSR-PreRL to expand the reasoning horizon before transitioning to standard RL for fine-grained optimization. Extensive experiments demonstrate that DSRL consistently outperforms strong baselines, proving that pre-train space pruning effectively steers the policy toward a refined correct reasoning subspace.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The manuscript introduces PreRL, which performs reward-driven RL directly on the marginal distribution P(y) in pre-train space rather than the conditional P(y|x) used in standard RLVR. It claims to theoretically and empirically establish strong gradient alignment between log P(y) and log P(y|x), making PreRL a viable surrogate; introduces Negative Sample Reinforcement (NSR) that prunes incorrect reasoning spaces and increases transition and reflection thoughts by 14.89x and 6.54x; and proposes Dual Space RL (DSRL), a reincarnation strategy that initializes with NSR-PreRL before switching to standard RL, yielding consistent gains over baselines on reasoning tasks.

Significance. If the gradient alignment holds robustly under realistic pre-training distributions and sparse reasoning rewards, and if the reported thought-process gains prove reproducible, this work offers a meaningful new direction for expanding LLM reasoning capacity beyond the limits of the base model's conditional output distribution. The identification of NSR as a driver of endogenous reflection is a concrete mechanistic insight. Credit is due for attempting a theoretical derivation of the alignment and for the DSRL policy-reincarnation idea, both of which could influence subsequent RL-for-LLM research if the supporting evidence is strengthened.

major comments (2)
  1. [§3 (Theoretical Analysis)] §3 (Theoretical Analysis): The central claim that PreRL is a viable surrogate rests on the asserted strong gradient alignment between ∇ log P(y) and ∇ log P(y|x). The derivation implicitly assumes that the pre-training marginal over x remains representative and that the reward r(y) does not induce strong x-dependence; yet reasoning rewards are typically sparse, binary, and conditioned on narrow (x, y) pairs. Without the explicit assumptions, the exact reward formulation, and a robustness check showing that alignment does not degrade under these conditions, it is impossible to rule out that the alignment is partly tautological with the surrogate objective or that harmful distribution shift occurs.
  2. [§5.2–5.3 (Empirical Validation and Ablations)] §5.2–5.3 (Empirical Validation and Ablations): The 14.89x and 6.54x increases in transition and reflection thoughts are load-bearing for the NSR mechanism claim. These figures must be accompanied by the precise definition of “transition” and “reflection” thoughts, the full set of baselines (including standard RLVR without NSR), and controls for post-hoc selection or prompt sensitivity. The DSRL results similarly require an ablation isolating the contribution of the NSR-PreRL initialization phase versus the subsequent fine-grained RL stage.
minor comments (3)
  1. [§2] Notation: The distinction between P(y) (marginal) and P(y|x) (conditional) is introduced clearly in the abstract but should be restated with explicit probability expressions at the beginning of §2 to avoid any ambiguity for readers unfamiliar with the pre-train-space framing.
  2. [Figure 4] Figure clarity: The plots showing thought-type counts (transition/reflection) would benefit from error bars across multiple random seeds and an explicit statement of the number of evaluation samples per condition.
  3. [§1.2] Related work: The discussion of prior RLVR methods (e.g., those optimizing P(y|x) directly) is brief; adding one or two sentences contrasting the gradient-alignment approach with existing surrogate-objective literature would strengthen context.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We are grateful to the referee for providing a thorough review and insightful comments that have helped us improve the clarity and rigor of our work. Below, we address each major comment in detail. We have made revisions to the manuscript to incorporate the suggested clarifications and additional analyses where feasible.

read point-by-point responses
  1. Referee: [§3 (Theoretical Analysis)] §3 (Theoretical Analysis): The central claim that PreRL is a viable surrogate rests on the asserted strong gradient alignment between ∇ log P(y) and ∇ log P(y|x). The derivation implicitly assumes that the pre-training marginal over x remains representative and that the reward r(y) does not induce strong x-dependence; yet reasoning rewards are typically sparse, binary, and conditioned on narrow (x, y) pairs. Without the explicit assumptions, the exact reward formulation, and a robustness check showing that alignment does not degrade under these conditions, it is impossible to rule out that the alignment is partly tautological with the surrogate objective or that harmful distribution shift occurs.

    Authors: We acknowledge the referee's concern about the implicit assumptions in our theoretical analysis. In the revised manuscript, we have explicitly listed the assumptions in a new paragraph in §3: namely, that the pre-training marginal distribution over x is representative for the downstream tasks, and that the reward function r(y) depends primarily on the quality of y rather than specific x-y interactions for the reasoning problems considered. We have also provided the exact reward formulation, which is a binary verifiable reward based on the correctness of the final answer for mathematical and coding tasks. Regarding the robustness check, we have added an analysis in the appendix demonstrating that the gradient alignment remains strong even under increased reward sparsity in our experimental settings. We agree that this strengthens the claim that PreRL serves as a viable surrogate without harmful distribution shift. revision: yes

  2. Referee: [§5.2–5.3 (Empirical Validation and Ablations)] §5.2–5.3 (Empirical Validation and Ablations): The 14.89x and 6.54x increases in transition and reflection thoughts are load-bearing for the NSR mechanism claim. These figures must be accompanied by the precise definition of “transition” and “reflection” thoughts, the full set of baselines (including standard RLVR without NSR), and controls for post-hoc selection or prompt sensitivity. The DSRL results similarly require an ablation isolating the contribution of the NSR-PreRL initialization phase versus the subsequent fine-grained RL stage.

    Authors: We appreciate this feedback on the empirical sections. In the revised manuscript, we have included precise definitions: 'transition thoughts' refer to intermediate reasoning steps that mark a shift from an incorrect path to a correct one, and 'reflection thoughts' are those involving explicit reconsideration or self-correction of prior steps. We now report the full set of baselines, including standard RLVR without NSR, in Table 2 and Figure 3. To address potential post-hoc selection and prompt sensitivity, we have added results averaged over 5 different prompts and 3 random seeds, with standard deviations. Additionally, we have included a new ablation study for DSRL in §5.3 that isolates the effect of the NSR-PreRL initialization phase by comparing it to direct standard RL and to a version without the reincarnation step. These changes clarify the contribution of each component. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation chain remains self-contained

full rationale

The paper asserts a theoretical and empirical validation of gradient alignment between log P(y) and log P(y|x) to position PreRL as a surrogate for standard RL, followed by NSR-PreRL pruning and DSRL reincarnation. No equations, self-citations, or derivations are exhibited in the provided sections that reduce the alignment claim to a fitted input, self-definition, or prior author result by construction. The central premise draws on independent empirical observations of behavior changes (e.g., 14.89x increase in transition thoughts) and the proposed dual-space strategy, which do not collapse back into the alignment statement itself. Absent explicit load-bearing reductions or ansatz smuggling, the chain does not exhibit the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the unproven assumption that gradient alignment between the marginal and conditional distributions survives realistic pre-training corpora and reward models; no free parameters or new physical entities are introduced, only algorithmic constructs.

axioms (1)
  • domain assumption Strong gradient alignment exists between log P(y) and log P(y|x) under the chosen reward model
    Invoked to justify PreRL as surrogate; location: abstract claim of theoretical validation
invented entities (1)
  • Negative Sample Reinforcement (NSR) no independent evidence
    purpose: Prune incorrect reasoning spaces and stimulate reflective behaviors
    New mechanism introduced to drive the pre-train updates

pith-pipeline@v0.9.0 · 5576 in / 1413 out tokens · 37634 ms · 2026-05-10T12:47:19.714468+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

79 extracted references · 44 canonical work pages · 15 internal anchors

  1. [1]

    GPT-4 Technical Report

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023. URL https://arxiv.org/abs/2303.08774

  2. [2]

    Reincarnating reinforcement learning: Reusing prior computation to accelerate progress

    Rishabh Agarwal, Max Schwarzer, Pablo Samuel Castro, Aaron C Courville, and Marc Bellemare. Reincarnating reinforcement learning: Reusing prior computation to accelerate progress. Advances in neural information processing systems, 35: 0 28955--28971, 2022. URL https://proceedings.neurips.cc/paper_files/paper/2022/hash/ba1c5356d9164bb64c446a4b690226b0-Abst...

  3. [3]

    Back to basics: Revisiting REINFORCE -style optimization for learning from human feedback in LLM s

    Arash Ahmadian, Chris Cremer, Matthias Gall \'e , Marzieh Fadaee, Julia Kreutzer, Olivier Pietquin, Ahmet \"U st \"u n, and Sara Hooker. Back to basics: Revisiting REINFORCE -style optimization for learning from human feedback in LLM s. In Proceedings of ACL, pp.\ 12248--12267, 2024. URL https://aclanthology.org/2024.acl-long.662/

  4. [4]

    Andrychowicz, A

    Marcin Andrychowicz, Anton Raichuk, Piotr Sta \'n czyk, Manu Orsini, Sertan Girgin, Raphael Marinier, L \'e onard Hussenot, Matthieu Geist, Olivier Pietquin, Marcin Michalski, et al. What matters in on-policy reinforcement learning? a large-scale empirical study. arXiv preprint arXiv:2006.05990, 2020. URL https://arxiv.org/abs/2006.05990

  5. [5]

    Language models are few-shot learners

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33: 0 1877--1901, 2020. URL https://proceedings.neurips.cc/paper_files/paper/2020/hash/1457c0d6bfcb4967418...

  6. [6]

    Vi-curl: Stabilizing verifier-independent rl reasoning via confidence-guided variance reduction

    Xin-Qiang Cai and Masashi Sugiyama. Vi-curl: Stabilizing verifier-independent rl reasoning via confidence-guided variance reduction. arXiv preprint arXiv:2602.12579, 2026. URL https://arxiv.org/abs/2602.12579

  7. [7]

    Evaluating Large Language Models Trained on Code

    Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021. URL https://arxiv.org/abs/2107.03374

  8. [8]

    Seal: Steer- able reasoning calibration of large language models for free

    Runjin Chen, Zhenyu Zhang, Junyuan Hong, Souvik Kundu, and Zhangyang Wang. Seal: Steerable reasoning calibration of large language models for free. arXiv preprint arXiv:2504.07986, 2025 a . URL https://arxiv.org/abs/2504.07986

  9. [9]

    Do NOT Think That Much for 2+3=? On the Overthinking of o1-Like LLMs

    Xingyu Chen, Jiahao Xu, Tian Liang, Zhiwei He, Jianhui Pang, Dian Yu, Linfeng Song, Qiuzhi Liu, Mengfei Zhou, Zhuosheng Zhang, et al. Do not think that much for 2+ 3=? on the overthinking of o1-like llms. arXiv preprint arXiv:2412.21187, 2024. URL https://arxiv.org/abs/2412.21187

  10. [10]

    X., and Shi, G

    Zhipeng Chen, Xiaobo Qin, Youbin Wu, Yue Ling, Qinghao Ye, Wayne Xin Zhao, and Guang Shi. Pass@ k training for adaptively balancing exploration and exploitation of large reasoning models. arXiv preprint arXiv:2508.10751, 2025 b . URL https://arxiv.org/abs/2508.10751

  11. [11]

    Continual pre-training mitigates forgetting in language and vision

    Andrea Cossu, Antonio Carta, Lucia Passaro, Vincenzo Lomonaco, Tinne Tuytelaars, and Davide Bacciu. Continual pre-training mitigates forgetting in language and vision. Neural Networks, 179: 0 106492, 2024. URL https://www.sciencedirect.com/science/article/pii/S0893608024004167

  12. [12]

    Gemini 2.0 flash thinking, 2024

    Google DeepMind. Gemini 2.0 flash thinking, 2024. URL https://deepmind.google/technologies/gemini/flash-thinking/

  13. [13]

    Reinforcement pre-training

    Qingxiu Dong, Li Dong, Yao Tang, Tianzhu Ye, Yutao Sun, Zhifang Sui, and Furu Wei. Reinforcement pre-training. arXiv preprint arXiv:2506.08007, 2025. URL https://arxiv.org/abs/2506.08007

  14. [14]

    How to Allocate, How to Learn? Dynamic Rollout Allocation and Advantage Modulation for Policy Optimization

    Yangyi Fang, Jiaye Lin, Xiaoliang Fu, Cong Qin, Haolin Shi, Chaowen Hu, Lu Pan, Ke Zeng, and Xunliang Cai. How to allocate, how to learn? dynamic rollout allocation and advantage modulation for policy optimization. arXiv preprint arXiv:2602.19208, 2026 a . URL https://arxiv.org/abs/2602.19208

  15. [15]

    Junbo Li, Peng Zhou, Rui Meng, Meet P

    Yangyi Fang, Jiaye Lin, Xiaoliang Fu, Cong Qin, Haolin Shi, Chang Liu, and Peilin Zhao. Proximity-based multi-turn optimization: Practical credit assignment for llm agent training. arXiv preprint arXiv:2602.19225, 2026 b . URL https://arxiv.org/abs/2602.19225

  16. [16]

    Deepseek-r1 incentivizes reasoning in llms through reinforcement learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1 incentivizes reasoning in llms through reinforcement learning. Nature, 645 0 (8081): 0 633--638, 2025 a . URL https://www.nature.com/articles/s41586-025-09422-z

  17. [17]

    Tree-based dialogue reinforced policy optimization for red-teaming attacks

    Ruohao Guo, Afshin Oroojlooy, Roshan Sridhar, Miguel Ballesteros, Alan Ritter, and Dan Roth. Tree-based dialogue reinforced policy optimization for red-teaming attacks. arXiv preprint arXiv:2510.02286, 2025 b . URL https://arxiv.org/abs/2510.02286

  18. [18]

    Richter, Quentin An- thony, Eugene Belilovsky, Irina Rish, and Timothée Lesort

    Benjamin Gupta, Kshitij ou2025llmsand Th \'e rien, Adam Ibrahim, Mats L Richter, Quentin Anthony, Eugene Belilovsky, Irina Rish, and Timoth \'e e Lesort. Continual pre-training of large language models: How to (re) warm your model? arXiv preprint arXiv:2308.04014, 2023. URL https://arxiv.org/abs/2308.04014

  19. [19]

    RLP: Reinforcement as a pretraining objective

    Ali Hatamizadeh, Syeda Nahida Akter, Shrimai Prabhumoye, Jan Kautz, Mostofa Patwary, Mohammad Shoeybi, Bryan Catanzaro, and Yejin Choi. Rlp: Reinforcement as a pretraining objective. arXiv preprint arXiv:2510.01265, 2025. URL https://arxiv.org/abs/2510.01265

  20. [20]

    Olympiadbench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems

    Chaoqun He, Renjie Luo, Yuzhuo Bai, Shengding Hu, Zhen Thai, Junhao Shen, Jinyi Hu, Xu Han, Yujie Huang, Yuxiang Zhang, et al. Olympiadbench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Pape...

  21. [21]

    REINFORCE++: Stabilizing Critic-Free Policy Optimization with Global Advantage Normalization

    Jian Hu. Reinforce++: A simple and efficient approach for aligning large language models. arXiv preprint arXiv:2501.03262, 2025. URL https://arxiv.org/abs/2501.03262

  22. [22]

    Open-Reasoner-Zero: An Open Source Approach to Scaling Up Reinforcement Learning on the Base Model

    Jingcheng Hu, Yinmin Zhang, Qi Han, Daxin Jiang, Xiangyu Zhang, and Heung-Yeung Shum. Open-reasoner-zero: An open source approach to scaling up reinforcement learning on the base model. arXiv preprint arXiv:2503.24290, 2025 a . URL https://arxiv.org/abs/2503.24290

  23. [23]

    Test-time learning for large language models.arXiv preprint arXiv:2505.20633, 2025

    Jinwu Hu, Zhitian Zhang, Guohao Chen, Xutao Wen, Chao Shuai, Wei Luo, Bin Xiao, Yuanqing Li, and Mingkui Tan. Test-time learning for large language models. arXiv preprint arXiv:2505.20633, 2025 b . URL https://arxiv.org/abs/2505.20633

  24. [24]

    Remit: Rl-guided mid-training for iterative llm evolution

    Junjie Huang, Jiarui Qin, Di Yin, Weiwen Liu, Yong Yu, Xing Sun, and Weinan Zhang. Remit: Rl-guided mid-training for iterative llm evolution. arXiv preprint arXiv:2602.03075, 2026. URL https://arxiv.org/abs/2602.03075

  25. [25]

    GPT-4o System Card

    Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card. arXiv preprint arXiv:2410.21276, 2024. URL https://arxiv.org/abs/2410.21276

  26. [26]

    doi: 10.1145/3600006.3613165

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. In Proceedings of SOSP, pp.\ 611--626, 2023. URL https://dl.acm.org/doi/abs/10.1145/3600006.3613165

  27. [27]

    Solving quantitative reasoning problems with language models

    Aitor Lewkowycz, Anders Johan Andreassen, David Dohan, Ethan Dyer, Henryk Michalewski, Vinay Venkatesh Ramasesh, Ambrose Slone, Cem Anil, Imanol Schlag, Theo Gutman-Solo, et al. Solving quantitative reasoning problems with language models. In Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=IFXTZERXdM7

  28. [28]

    Convex and non-convex optimization under generalized smoothness.Advances in Neural Information Processing Systems, 36:40238–40271, 2023a

    Long Li, Jiaran Hao, Jason Klein Liu, Zhijian Zhou, Yanting Miao, Wei Pang, Xiaoyu Tan, Wei Chu, Zhe Wang, Shirui Pan, et al. The choice of divergence: A neglected key to mitigating diversity collapse in reinforcement learning with verifiable reward. arXiv preprint arXiv:2509.07430, 2025 a . URL https://arxiv.org/abs/2509.07430

  29. [29]

    arXiv preprint arXiv:2509.19249 (2025) 7

    Siheng Li, Kejiao Li, Zenan Xu, Guanhua Huang, Evander Yang, Kun Li, Haoyuan Wu, Jiajia Wu, Zihao Zheng, Chenchen Zhang, et al. Reinforcement learning on pre-training data. arXiv preprint arXiv:2509.19249, 2025 b . URL https://arxiv.org/abs/2509.19249

  30. [30]

    arXiv preprint arXiv:2507.06892 (2025) 3

    Jing Liang, Hongyao Tang, Yi Ma, Jinyi Liu, Yan Zheng, Shuyue Hu, Lei Bai, and Jianye Hao. Squeeze the soaked sponge: Efficient off-policy reinforcement finetuning for large language model. arXiv preprint arXiv:2507.06892, 2025. URL https://arxiv.org/abs/2507.06892

  31. [31]

    Resadapt: Adaptive resolution for efficient multimodal reasoning.arXiv preprint arXiv:2603.28610, 2026

    Huanxuan Liao, Zhongtao Jiang, Yupu Hao, Yuqiao Tan, Shizhu He, Jun Zhao, Kun Xu, and Kang Liu. Resadapt: Adaptive resolution for efficient multimodal reasoning. arXiv preprint arXiv:2603.28610, 2026. URL https://arxiv.org/abs/2603.28610

  32. [32]

    Let's verify step by step

    Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let's verify step by step. In Proceedings of ICLR, 2023. URL https://openreview.net/forum?id=v8L0pN6EOi

  33. [33]

    Qfft, question-free fine-tuning for adaptive reasoning

    Wanlong Liu, Junxiao Xu, Fei Yu, Yukang Lin, Ke Ji, Wenyu Chen, Lifeng Shang, Yasheng Wang, Yan Xu, and Benyou Wang. Qfft, question-free fine-tuning for adaptive reasoning. In The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025 a . URL https://openreview.net/forum?id=CrBWOjZoKc

  34. [34]

    Automated optimization modeling via a localizable error-driven perspective

    Weiting Liu, Han Wu, Yufei Kuang, Xiongwei Han, Tao Zhong, Jianfeng Feng, and Wenlian Lu. Automated optimization modeling via a localizable error-driven perspective. arXiv preprint arXiv:2602.11164, 2026. URL https://arxiv.org/abs/2602.11164

  35. [35]

    Understanding r1-zero-like training: A critical perspective

    Zichen Liu, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, Chao Du, Wee Sun Lee, and Min Lin. Understanding r1-zero-like training: A critical perspective. In Proceedings of COLM, 2025 b . URL https://openreview.net/forum?id=5PAF7PAY2Y

  36. [36]

    Reasoning models can be effective without thinking.arXiv preprint arXiv:2504.09858, 2025

    Wenjie Ma, Jingxuan He, Charlie Snell, Tyler Griggs, Sewon Min, and Matei Zaharia. Reasoning models can be effective without thinking. arXiv preprint arXiv:2504.09858, 2025. URL https://arxiv.org/abs/2504.09858

  37. [37]

    American mathematics contest 12 (amc 12), November 2023

    MAA . American mathematics contest 12 (amc 12), November 2023. URL https://artofproblemsolving.com/wiki/index.php/AMC_12_Problems_and_Solutions

  38. [38]

    American invitational mathematics examination (aime), February 2024

    MAA . American invitational mathematics examination (aime), February 2024. URL https://artofproblemsolving.com/wiki/index.php/AIME_Problems_and_Solutions

  39. [39]

    American invitational mathematics examination (aime), February 2025

    MAA . American invitational mathematics examination (aime), February 2025. URL https://artofproblemsolving.com/wiki/index.php/AIME_Problems_and_Solutions

  40. [40]

    How do llms acquire new knowledge? a knowledge circuits perspective on continual pre-training

    Yixin Ou, Yunzhi Yao, Ningyu Zhang, Hui Jin, Jiacheng Sun, Shumin Deng, Zhenguo Li, and Huajun Chen. How do llms acquire new knowledge? a knowledge circuits perspective on continual pre-training. In Findings of the Association for Computational Linguistics: ACL 2025, pp.\ 19889--19913, 2025. URL https://aclanthology.org/2025.findings-acl.1021/

  41. [41]

    Openwebmath: An open dataset of high-quality mathematical web text

    Keiran Paster, Marco Dos Santos, Zhangir Azerbayev, and Jimmy Ba. Openwebmath: An open dataset of high-quality mathematical web text. In The Twelfth International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=jKHmjlpViu

  42. [42]

    Simko: Simple pass@ k policy optimization.arXiv preprint arXiv:2510.14807,

    Ruotian Peng, Yi Ren, Zhouliang Yu, Weiyang Liu, and Yandong Wen. Simko: Simple pass@ k policy optimization. arXiv preprint arXiv:2510.14807, 2025. URL https://arxiv.org/abs/2510.14807

  43. [43]

    Language models are unsupervised multitask learners

    Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners. OpenAI blog, 1 0 (8): 0 9, 2019. URL https://storage.prod.researchhub.com/uploads/papers/2020/06/01/language-models.pdf

  44. [44]

    Exploring the limits of transfer learning with a unified text-to-text transformer

    Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of machine learning research, 21 0 (140): 0 1--67, 2020. URL http://www.jmlr.org/papers/v21/20-074.html

  45. [45]

    Gpqa: A graduate-level google-proof q&a benchmark

    David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R Bowman. Gpqa: A graduate-level google-proof q&a benchmark. In First conference on language modeling, 2024. URL https://openreview.net/forum?id=Ti67584b98&utm_campaign=The

  46. [46]

    High-Dimensional Continuous Control Using Generalized Advantage Estimation

    John Schulman, Philipp Moritz, Sergey Levine, Michael Jordan, and Pieter Abbeel. High-dimensional continuous control using generalized advantage estimation. arXiv preprint arXiv:1506.02438, 2015. URL https://arxiv.org/abs/1506.02438

  47. [47]

    Proximal Policy Optimization Algorithms

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017. URL https://arxiv.org/abs/1707.06347

  48. [48]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024. URL https://arxiv.org/abs/2402.03300

  49. [49]

    Sheng, C

    Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework. In Proceedings of the Twentieth European Conference on Computer Systems, pp.\ 1279--1297, 2025. URL https://dl.acm.org/doi/abs/10.1145/3689031.3696075

  50. [50]

    Scaling agents via continual pre-training.arXiv preprint arXiv:2509.13310, 2025

    Liangcai Su, Zhen Zhang, Guangyu Li, Zhuo Chen, Chenxi Wang, Maojia Song, Xinyu Wang, Kuan Li, Jialong Wu, Xuanzhong Chen, et al. Scaling agents via continual pre-training. arXiv preprint arXiv:2509.13310, 2025. URL https://arxiv.org/abs/2509.13310

  51. [51]

    Ernie 2.0: A continual pre-training framework for language understanding

    Yu Sun, Shuohuan Wang, Yukun Li, Shikun Feng, Hao Tian, Hua Wu, and Haifeng Wang. Ernie 2.0: A continual pre-training framework for language understanding. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pp.\ 8968--8975, 2020. URL https://ojs.aaai.org/index.php/aaai/article/view/6428

  52. [52]

    Reinforcement learning: An introduction, volume 1

    Richard S Sutton, Andrew G Barto, et al. Reinforcement learning: An introduction, volume 1. MIT press Cambridge, 1998

  53. [53]

    Challenging big-bench tasks and whether chain-of-thought can solve them

    Mirac Suzgun, Nathan Scales, Nathanael Sch \"a rli, Sebastian Gehrmann, Yi Tay, Hyung Won Chung, Aakanksha Chowdhery, Quoc Le, Ed Chi, Denny Zhou, et al. Challenging big-bench tasks and whether chain-of-thought can solve them. In Findings of the Association for Computational Linguistics: ACL 2023, pp.\ 13003--13051, 2023. URL https://aclanthology.org/2023...

  54. [54]

    The zero-step thinking: An empirical study of mode selection as harder early exit in reasoning models

    Yuqiao Tan, Shizhu He, Kang Liu, and Jun Zhao. The zero-step thinking: An empirical study of mode selection as harder early exit in reasoning models. In NeurIPS 2025 Workshop on Efficient Reasoning, 2025 a . URL https://openreview.net/forum?id=CPXmurtK0H

  55. [55]

    Bottom-up policy optimization: Your language model policy secretly contains internal policies

    Yuqiao Tan, Minzheng Wang, Shizhu He, Huanxuan Liao, Chengfeng Zhao, Qiunan Lu, Tian Liang, Jun Zhao, and Kang Liu. Bottom-up policy optimization: Your language model policy secretly contains internal policies. arXiv preprint arXiv:2512.19673, 2025 b . URL https://arxiv.org/abs/2512.19673

  56. [56]

    Kimi Team, Tongtong Bai, Yifan Bai, Yiping Bao, SH Cai, Yuan Cao, Y Charles, HS Che, Cheng Chen, Guanduo Chen, et al. Kimi k2. 5: Visual agentic intelligence. arXiv preprint arXiv:2602.02276, 2026. URL https://arxiv.org/abs/2602.02276

  57. [57]

    Adaptive social learning via mode policy optimization for language agents

    Minzheng Wang, Yongbin Li, Haobo Wang, Xinghua Zhang, Nan Xu, Bingli Wu, Fei Huang, Haiyang Yu, and Wenji Mao. Adaptive social learning via mode policy optimization for language agents. In The Fourteenth International Conference on Learning Representations, 2026 a . URL https://openreview.net/forum?id=GG7YQnsdhp

  58. [58]

    Anchored policy optimization: Mitigating exploration collapse via support-constrained rectification

    Tianyi Wang, Long Li, Hongcan Guo, Yibiao Chen, Yixia Li, Yong Wang, Yun Chen, and Guanhua Chen. Anchored policy optimization: Mitigating exploration collapse via support-constrained rectification. arXiv preprint arXiv:2602.05717, 2026 b . URL https://arxiv.org/abs/2602.05717

  59. [59]

    Do we really need curated malicious data for safety alignment in multi-modal large language models? InProceedings of the Computer Vision and Pattern Recognition Conference, pp

    Yanbo Wang, Yongcan Yu, Jian Liang, and Ran He. A comprehensive survey on trustworthiness in reasoning with large language models, 2025 a . URL https://arxiv.org/abs/2509.03871

  60. [60]

    Mitigating the safety-utility trade-off in llm alignment via adaptive safe context learning, 2026 c

    Yanbo Wang, Minzheng Wang, Jian Liang, Lu Wang, Yongcan Yu, and Ran He. Mitigating the safety-utility trade-off in llm alignment via adaptive safe context learning, 2026 c . URL https://arxiv.org/abs/2602.13562

  61. [61]

    Mmlu-pro: A more robust and challenging multi-task language understanding benchmark

    Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, et al. Mmlu-pro: A more robust and challenging multi-task language understanding benchmark. Advances in Neural Information Processing Systems, 37: 0 95266--95290, 2024 a . URL https://proceedings.neurips.cc/paper_files/paper/20...

  62. [62]

    Mathpile: A billion-token-scale pretraining corpus for math

    Zengzhi Wang, Xuefeng Li, Rui Xia, and Pengfei Liu. Mathpile: A billion-token-scale pretraining corpus for math. Advances in Neural Information Processing Systems, 37: 0 25426--25468, 2024 b . URL https://proceedings.neurips.cc/paper_files/paper/2024/hash/2d0be3cd5173c10b6ec075d1c393a13d-Abstract-Datasets_and_Benchmarks_Track.html

  63. [63]

    Octothinker: Mid-training incentivizes reinforcement learning scaling.arXiv preprint arXiv:2506.20512, 2025

    Zengzhi Wang, Fan Zhou, Xuefeng Li, and Pengfei Liu. Octothinker: Mid-training incentivizes reinforcement learning scaling. arXiv preprint arXiv:2506.20512, 2025 b . URL https://arxiv.org/abs/2506.20512

  64. [64]

    Pretrainzero: Reinforcement active pretraining

    Xingrun Xing, Zhiyuan Fan, Jie Lou, Guoqi Li, Jiajun Zhang, and Debing Zhang. Pretrainzero: Reinforcement active pretraining. arXiv preprint arXiv:2512.03442, 2025. URL https://arxiv.org/abs/2512.03442

  65. [65]

    Learning to Reason under Off-Policy Guidance

    Jianhao Yan, Yafu Li, Zican Hu, Zhi Wang, Ganqu Cui, Xiaoye Qu, Yu Cheng, and Yue Zhang. Learning to reason under off-policy guidance. arXiv preprint arXiv:2504.14945, 2025. URL https://arxiv.org/abs/2504.14945

  66. [66]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report. arXiv preprint arXiv:2505.09388, 2025. URL https://arxiv.org/abs/2505.09388

  67. [67]

    DAPO: An Open-Source LLM Reinforcement Learning System at Scale

    Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, et al. Dapo: An open-source llm reinforcement learning system at scale. arXiv preprint arXiv:2503.14476, 2025. URL https://arxiv.org/abs/2503.14476

  68. [68]

    Unveiling implicit advantage symmetry: Why grpo struggles with exploration and difficulty adaptation

    Zhiqi Yu, Zhangquan Chen, Mengting Liu, Heye Zhang, and Liangqiong Qu. Unveiling implicit advantage symmetry: Why grpo struggles with exploration and difficulty adaptation. arXiv preprint arXiv:2602.05548, 2026. URL https://arxiv.org/abs/2602.05548

  69. [69]

    Yang Yue, Zhiqi Chen, Rui Lu, Andrew Zhao, Zhaokai Wang, Shiji Song, and Gao Huang. Does reinforcement learning really incentivize reasoning capacity in llms beyond the base model? In The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. URL https://openreview.net/forum?id=4OsgYD7em5

  70. [70]

    SimpleRL-Zoo: Investigating and Taming Zero Reinforcement Learning for Open Base Models in the Wild

    Weihao Zeng, Yuzhen Huang, Qian Liu, Wei Liu, Keqing He, Zejun Ma, and Junxian He. Simplerl-zoo: Investigating and taming zero reinforcement learning for open base models in the wild. arXiv preprint arXiv:2503.18892, 2025. URL https://arxiv.org/abs/2503.18892

  71. [71]

    On the interplay of pre-training, mid-training, and rl on reasoning language models, 2025 a

    Charlie Zhang, Graham Neubig, and Xiang Yue. On the interplay of pre-training, mid-training, and rl on reasoning language models. arXiv preprint arXiv:2512.07783, 2025. URL https://arxiv.org/abs/2512.07783

  72. [72]

    Redone: Revealing domain-specific llm post-training in social networking services

    Fei Zhao, Chonggang Lu, Zheyong Xie, Ziyan Liu, Haofu Qian, Jianzhao Huang, Fangcheng Shi, Zijie Meng, Hongcheng Guo, Mingqian He, et al. Redone: Revealing domain-specific llm post-training in social networking services. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track, pp.\ 2648--2674, 2025. URL ht...

  73. [73]

    Megamath: Pushing the limits of open math corpora

    Fan Zhou, Zengzhi Wang, Nikhil Ranjan, Zhoujun Cheng, Liping Tang, Guowei He, Zhengzhong Liu, and Eric P Xing. Megamath: Pushing the limits of open math corpora. In Second Conference on Language Modeling, 2025. URL https://openreview.net/forum?id=SHB0sLrZrh

  74. [74]

    The surprising effectiveness of negative reinforcement in LLM reasoning

    Xinyu Zhu, Mengzhou Xia, Zhepei Wei, Wei-Lin Chen, Danqi Chen, and Yu Meng. The surprising effectiveness of negative reinforcement in LLM reasoning. In Proceedings of NeurIPS, 2025. URL https://openreview.net/forum?id=ftVlLG9cks

  75. [75]

    How far can unsupervised rlvr scale llm training? In The Fourteenth International Conference on Learning Representations, 2026

    Yuxin Zuo, Bingxiang He, Zeyuan Liu, Shangziqi Zhao, Zixuan Fu, Junlin Yang, Kaiyan Zhang, Yuchen Fan, Ganqu Cui, Cheng Qian, et al. How far can unsupervised rlvr scale llm training? In The Fourteenth International Conference on Learning Representations, 2026. URL https://openreview.net/forum?id=VesLZukY5E

  76. [76]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

  77. [77]

    @esa (Ref

    \@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...

  78. [78]

    \@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...

  79. [79]

    @open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...