arxiv: 2604.14142 · v1 · submitted 2026-04-15 · 💻 cs.LG · cs.AI· cs.CL

Recognition: unknown

From P(y|x) to P(y): Investigating Reinforcement Learning in Pre-train Space

Yuqiao Tan , Minzheng Wang , Bo Liu , Zichen Liu , Tian Liang , Shizhu He , Jun Zhao , Kang Liu

Authors on Pith no claims yet

Pith reviewed 2026-05-10 12:47 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL

keywords reinforcement learningpre-train spaceLLM reasoninggradient alignmentnegative sample reinforcementdual space RLpolicy reincarnation

0 comments

The pith

Reinforcement learning applied to the pre-training marginal distribution P(y) serves as a viable surrogate for standard RL on P(y|x) via strong gradient alignment.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper investigates shifting reinforcement learning for LLM reasoning from optimizing the conditional distribution P(y|x) to the marginal P(y) in the pre-train space. It introduces PreRL to apply reward-driven updates directly to P(y), backed by theoretical and empirical evidence of strong gradient alignment between log P(y) and log P(y|x). The work shows that Negative Sample Reinforcement in PreRL effectively prunes incorrect reasoning spaces and boosts endogenous reflective behaviors. It proposes Dual Space RL to combine pre-train space initialization with standard RL for better performance.

Core claim

The central claim is that PreRL applies reward-driven online updates directly to P(y) and that the strong gradient alignment between log P(y) and log P(y|x) makes it a viable surrogate for standard RLVR. NSR-PreRL rapidly prunes incorrect reasoning spaces while stimulating endogenous reflective behaviors, increasing transition thoughts by 14.89x and reflection thoughts by 6.54x. This enables DSRL, a policy reincarnation strategy that initializes with NSR-PreRL to expand the reasoning horizon before transitioning to standard RL for fine-grained optimization, consistently outperforming baselines by steering toward a refined correct reasoning subspace.

What carries the argument

PreRL applies reward-driven online updates directly to the marginal P(y); its viability rests on the gradient alignment between log P(y) and log P(y|x), with NSR serving as the driver that prunes incorrect reasoning spaces.

If this is right

PreRL functions as a direct surrogate for standard conditional RL without being limited by the base model's existing output distribution.
NSR-PreRL increases transition thoughts by 14.89x and reflection thoughts by 6.54x while pruning incorrect reasoning spaces.
DSRL, by first applying NSR-PreRL then standard RL, steers the policy into a refined correct reasoning subspace and outperforms strong baselines.
Pre-train space optimization addresses the fundamental bottleneck where RLVR is bounded by the base model's output distribution.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same pruning mechanism could be applied to other generative tasks to reduce exploration of low-value outputs early in training.
Starting with marginal-space updates may help retain broad capabilities longer before conditional specialization.
The approach suggests pre-training itself could incorporate targeted reward signals to produce more capable starting models.

Load-bearing premise

The gradient alignment between log P(y) and log P(y|x) remains strong enough under realistic pre-training data and reward signals to serve as a surrogate without introducing harmful distribution shift or forgetting of general capabilities.

What would settle it

An experiment in which PreRL or NSR-PreRL produces lower final reasoning accuracy or measurable degradation on unrelated general capabilities tasks compared to standard RLVR on identical base models and rewards.

Figures

Figures reproduced from arXiv: 2604.14142 by Bo Liu, Jun Zhao, Kang Liu, Minzheng Wang, Shizhu He, Tian Liang, Yuqiao Tan, Zichen Liu.

**Figure 2.** Figure 2: Synergistic effect analysis of log P(y|x) and log P(y) using Qwen3-4B on AMC23. (a) Gradient dot product between ∇θ log πθ (y) and ∇θ log πθ (y|x). (b) Gradient cosine similarity. (c) Per-token log probability difference between log P(y|x) and log P(y). each input query x, GRPO samples a group of G responses {y1, . . . , yG} from the old policy πθold and computes their corresponding returns R = {R1, . . . … view at source ↗

**Figure 3.** Figure 3: (a) Training dynamics of PreRL v.s. RL. Effects on reward, response length and [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Pass@K performance comparison between DSRL and GRPO across LLMs. 0 50 100 150 200 250 0 2 4 6 8 Avg Count Subgoal Setting 0 50 100 150 200 250 0 2 4 6 8 Enumeration 0 50 100 150 200 250 Training Steps 0 2 4 6 Avg Count Verification 0 50 100 150 200 250 Training Steps 0 3 6 9 12 15 18 Backtracking GRPO DSRL [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 6.** Figure 6: Evolution of problem-solving status (Solved vs. Unsolved) on training dataset. applied to negative samples to prune incorrect reasoning paths, while standard RL utilizes all samples in s > S. In Eq. 6, we employ GRPO as the representative RL algorithm. 4 Experiments 4.1 Experimental Setup Training Setup. We employ Qwen3-4B and Qwen3-8B (Yang et al., 2025) as base models and train on the MATH dataset (Lewko… view at source ↗

**Figure 7.** Figure 7: Ablation on warmup steps. Bar color intensity reflects score magnitude. To comprehensively evaluate DSRL, we compare its Pass@K performance against GRPO ( [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

**Figure 8.** Figure 8: The Qwen3-NoThinking prompt template used for inference. [PITH_FULL_IMAGE:figures/full_fig_p018_8.png] view at source ↗

**Figure 9.** Figure 9: System prompt used to evaluate beneficial reasoning behaviors in model-generated [PITH_FULL_IMAGE:figures/full_fig_p019_9.png] view at source ↗

**Figure 10.** Figure 10: Aligned case: Token probability distribution with and without input conditioning under the same context. x is a math problem and y is a partially generated answer. The two distributions P(y|x) and P(y) show similar token rankings, suggesting the post-train space conditional distribution largely inherits the structure of the pre-train space marginal distribution. �: In a table tennis tournament every parti… view at source ↗

**Figure 11.** Figure 11: Misaligned case: Token probability distribution with and without input conditioning under the same context. x is a math problem and y is a partially generated answer. The top-1 token ”total” under P(y|x) receives near-zero probability under P(y), indicating a significant discrepancy between the two distributions. 20 [PITH_FULL_IMAGE:figures/full_fig_p020_11.png] view at source ↗

**Figure 12.** Figure 12: Comparison of token-level generation log-probabilities with and without input [PITH_FULL_IMAGE:figures/full_fig_p021_12.png] view at source ↗

**Figure 13.** Figure 13: Training dynamics of NSR-PreRL warmup vs. NSR-RL warmup. [PITH_FULL_IMAGE:figures/full_fig_p021_13.png] view at source ↗

read the original abstract

While reinforcement learning with verifiable rewards (RLVR) significantly enhances LLM reasoning by optimizing the conditional distribution P(y|x), its potential is fundamentally bounded by the base model's existing output distribution. Optimizing the marginal distribution P(y) in the Pre-train Space addresses this bottleneck by encoding reasoning ability and preserving broad exploration capacity. Yet, conventional pre-training relies on static corpora for passive learning, leading to a distribution shift that hinders targeted reasoning enhancement. In this paper, we introduce PreRL (Pre-train Space RL), which applies reward-driven online updates directly to P(y). We theoretically and empirically validate the strong gradient alignment between log P(y) and log P(y|x), establishing PreRL as a viable surrogate for standard RL. Furthermore, we uncover a critical mechanism: Negative Sample Reinforcement (NSR) within PreRL serves as an exceptionally effective driver for reasoning. NSR-PreRL rapidly prunes incorrect reasoning spaces while stimulating endogenous reflective behaviors, increasing transition and reflection thoughts by 14.89x and 6.54x, respectively. Leveraging these insights, we propose Dual Space RL (DSRL), a Policy Reincarnation strategy that initializes models with NSR-PreRL to expand the reasoning horizon before transitioning to standard RL for fine-grained optimization. Extensive experiments demonstrate that DSRL consistently outperforms strong baselines, proving that pre-train space pruning effectively steers the policy toward a refined correct reasoning subspace.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper's main move is treating optimization over the marginal P(y) as a surrogate for conditional RL via claimed gradient alignment, then using NSR pruning to expand reasoning before switching to standard RL in DSRL.

read the letter

The one or two things to know are that this paper moves reward-driven updates into the pre-train marginal P(y) as a surrogate for the usual P(y|x) RL, and pairs it with negative sample reinforcement to prune bad paths and trigger more reflection before a final standard RL stage in their DSRL reincarnation strategy. They report large jumps in transition and reflection thoughts along with consistent outperformance on reasoning tasks. The motivation is clear and the limitation they target is real: standard RLVR stays trapped inside whatever the base model already produces. Shifting some of the work upstream to the broader pre-train distribution is a logical attempt to loosen that bound while keeping exploration capacity. The NSR mechanism, where incorrect samples drive pruning and endogenous reflection, is the freshest piece and could be useful if it generalizes. The dual-space handoff from PreRL to ordinary RL also feels like a practical engineering step rather than pure theory. The soft spots sit mostly on the alignment claim itself. They assert both theoretical and empirical validation that gradients of log P(y) and log P(y|x) stay close, but reasoning rewards are sparse, binary, and tightly tied to specific (x, y) pairs from narrow tasks. That setup can easily make the effective gradients diverge once the reward starts depending on x, which is exactly the stress-test worry. If the full derivation assumes the pre-train marginal over x is representative enough, it needs explicit checks against distribution shift and capability forgetting on non-reasoning tasks. The reported 14.89x and 6.54x multipliers are big enough that the counting of thoughts and the choice of baselines matter a lot; any post-hoc selection would inflate them. This is for readers who work on scaling LLM reasoning past current RL ceilings and are willing to test new pipelines. Someone looking for a concrete alternative to pure post-training RLVR would get value from the DSRL recipe and the NSR pruning idea. It deserves a serious referee because the bottleneck it attacks is fundamental and the proposed fix is new enough to warrant detailed scrutiny, even if the alignment evidence turns out to need strengthening. I would send it to peer review with requests for the gradient derivation, shift ablations, and clearer experimental controls.

Referee Report

2 major / 3 minor

Summary. The manuscript introduces PreRL, which performs reward-driven RL directly on the marginal distribution P(y) in pre-train space rather than the conditional P(y|x) used in standard RLVR. It claims to theoretically and empirically establish strong gradient alignment between log P(y) and log P(y|x), making PreRL a viable surrogate; introduces Negative Sample Reinforcement (NSR) that prunes incorrect reasoning spaces and increases transition and reflection thoughts by 14.89x and 6.54x; and proposes Dual Space RL (DSRL), a reincarnation strategy that initializes with NSR-PreRL before switching to standard RL, yielding consistent gains over baselines on reasoning tasks.

Significance. If the gradient alignment holds robustly under realistic pre-training distributions and sparse reasoning rewards, and if the reported thought-process gains prove reproducible, this work offers a meaningful new direction for expanding LLM reasoning capacity beyond the limits of the base model's conditional output distribution. The identification of NSR as a driver of endogenous reflection is a concrete mechanistic insight. Credit is due for attempting a theoretical derivation of the alignment and for the DSRL policy-reincarnation idea, both of which could influence subsequent RL-for-LLM research if the supporting evidence is strengthened.

major comments (2)

[§3 (Theoretical Analysis)] §3 (Theoretical Analysis): The central claim that PreRL is a viable surrogate rests on the asserted strong gradient alignment between ∇ log P(y) and ∇ log P(y|x). The derivation implicitly assumes that the pre-training marginal over x remains representative and that the reward r(y) does not induce strong x-dependence; yet reasoning rewards are typically sparse, binary, and conditioned on narrow (x, y) pairs. Without the explicit assumptions, the exact reward formulation, and a robustness check showing that alignment does not degrade under these conditions, it is impossible to rule out that the alignment is partly tautological with the surrogate objective or that harmful distribution shift occurs.
[§5.2–5.3 (Empirical Validation and Ablations)] §5.2–5.3 (Empirical Validation and Ablations): The 14.89x and 6.54x increases in transition and reflection thoughts are load-bearing for the NSR mechanism claim. These figures must be accompanied by the precise definition of “transition” and “reflection” thoughts, the full set of baselines (including standard RLVR without NSR), and controls for post-hoc selection or prompt sensitivity. The DSRL results similarly require an ablation isolating the contribution of the NSR-PreRL initialization phase versus the subsequent fine-grained RL stage.

minor comments (3)

[§2] Notation: The distinction between P(y) (marginal) and P(y|x) (conditional) is introduced clearly in the abstract but should be restated with explicit probability expressions at the beginning of §2 to avoid any ambiguity for readers unfamiliar with the pre-train-space framing.
[Figure 4] Figure clarity: The plots showing thought-type counts (transition/reflection) would benefit from error bars across multiple random seeds and an explicit statement of the number of evaluation samples per condition.
[§1.2] Related work: The discussion of prior RLVR methods (e.g., those optimizing P(y|x) directly) is brief; adding one or two sentences contrasting the gradient-alignment approach with existing surrogate-objective literature would strengthen context.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We are grateful to the referee for providing a thorough review and insightful comments that have helped us improve the clarity and rigor of our work. Below, we address each major comment in detail. We have made revisions to the manuscript to incorporate the suggested clarifications and additional analyses where feasible.

read point-by-point responses

Referee: [§3 (Theoretical Analysis)] §3 (Theoretical Analysis): The central claim that PreRL is a viable surrogate rests on the asserted strong gradient alignment between ∇ log P(y) and ∇ log P(y|x). The derivation implicitly assumes that the pre-training marginal over x remains representative and that the reward r(y) does not induce strong x-dependence; yet reasoning rewards are typically sparse, binary, and conditioned on narrow (x, y) pairs. Without the explicit assumptions, the exact reward formulation, and a robustness check showing that alignment does not degrade under these conditions, it is impossible to rule out that the alignment is partly tautological with the surrogate objective or that harmful distribution shift occurs.

Authors: We acknowledge the referee's concern about the implicit assumptions in our theoretical analysis. In the revised manuscript, we have explicitly listed the assumptions in a new paragraph in §3: namely, that the pre-training marginal distribution over x is representative for the downstream tasks, and that the reward function r(y) depends primarily on the quality of y rather than specific x-y interactions for the reasoning problems considered. We have also provided the exact reward formulation, which is a binary verifiable reward based on the correctness of the final answer for mathematical and coding tasks. Regarding the robustness check, we have added an analysis in the appendix demonstrating that the gradient alignment remains strong even under increased reward sparsity in our experimental settings. We agree that this strengthens the claim that PreRL serves as a viable surrogate without harmful distribution shift. revision: yes
Referee: [§5.2–5.3 (Empirical Validation and Ablations)] §5.2–5.3 (Empirical Validation and Ablations): The 14.89x and 6.54x increases in transition and reflection thoughts are load-bearing for the NSR mechanism claim. These figures must be accompanied by the precise definition of “transition” and “reflection” thoughts, the full set of baselines (including standard RLVR without NSR), and controls for post-hoc selection or prompt sensitivity. The DSRL results similarly require an ablation isolating the contribution of the NSR-PreRL initialization phase versus the subsequent fine-grained RL stage.

Authors: We appreciate this feedback on the empirical sections. In the revised manuscript, we have included precise definitions: 'transition thoughts' refer to intermediate reasoning steps that mark a shift from an incorrect path to a correct one, and 'reflection thoughts' are those involving explicit reconsideration or self-correction of prior steps. We now report the full set of baselines, including standard RLVR without NSR, in Table 2 and Figure 3. To address potential post-hoc selection and prompt sensitivity, we have added results averaged over 5 different prompts and 3 random seeds, with standard deviations. Additionally, we have included a new ablation study for DSRL in §5.3 that isolates the effect of the NSR-PreRL initialization phase by comparing it to direct standard RL and to a version without the reincarnation step. These changes clarify the contribution of each component. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation chain remains self-contained

full rationale

The paper asserts a theoretical and empirical validation of gradient alignment between log P(y) and log P(y|x) to position PreRL as a surrogate for standard RL, followed by NSR-PreRL pruning and DSRL reincarnation. No equations, self-citations, or derivations are exhibited in the provided sections that reduce the alignment claim to a fitted input, self-definition, or prior author result by construction. The central premise draws on independent empirical observations of behavior changes (e.g., 14.89x increase in transition thoughts) and the proposed dual-space strategy, which do not collapse back into the alignment statement itself. Absent explicit load-bearing reductions or ansatz smuggling, the chain does not exhibit the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the unproven assumption that gradient alignment between the marginal and conditional distributions survives realistic pre-training corpora and reward models; no free parameters or new physical entities are introduced, only algorithmic constructs.

axioms (1)

domain assumption Strong gradient alignment exists between log P(y) and log P(y|x) under the chosen reward model
Invoked to justify PreRL as surrogate; location: abstract claim of theoretical validation

invented entities (1)

Negative Sample Reinforcement (NSR) no independent evidence
purpose: Prune incorrect reasoning spaces and stimulate reflective behaviors
New mechanism introduced to drive the pre-train updates

pith-pipeline@v0.9.0 · 5576 in / 1413 out tokens · 37634 ms · 2026-05-10T12:47:19.714468+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

79 extracted references · 44 canonical work pages · 14 internal anchors

[1]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023. URL https://arxiv.org/abs/2303.08774

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

Reincarnating reinforcement learning: Reusing prior computation to accelerate progress

Rishabh Agarwal, Max Schwarzer, Pablo Samuel Castro, Aaron C Courville, and Marc Bellemare. Reincarnating reinforcement learning: Reusing prior computation to accelerate progress. Advances in neural information processing systems, 35: 0 28955--28971, 2022. URL https://proceedings.neurips.cc/paper_files/paper/2022/hash/ba1c5356d9164bb64c446a4b690226b0-Abst...

2022
[3]

Back to basics: Revisiting REINFORCE -style optimization for learning from human feedback in LLM s

Arash Ahmadian, Chris Cremer, Matthias Gall \'e , Marzieh Fadaee, Julia Kreutzer, Olivier Pietquin, Ahmet \"U st \"u n, and Sara Hooker. Back to basics: Revisiting REINFORCE -style optimization for learning from human feedback in LLM s. In Proceedings of ACL, pp.\ 12248--12267, 2024. URL https://aclanthology.org/2024.acl-long.662/

2024
[4]

Andrychowicz, A

Marcin Andrychowicz, Anton Raichuk, Piotr Sta \'n czyk, Manu Orsini, Sertan Girgin, Raphael Marinier, L \'e onard Hussenot, Matthieu Geist, Olivier Pietquin, Marcin Michalski, et al. What matters in on-policy reinforcement learning? a large-scale empirical study. arXiv preprint arXiv:2006.05990, 2020. URL https://arxiv.org/abs/2006.05990

work page arXiv 2006
[5]

Language models are few-shot learners

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33: 0 1877--1901, 2020. URL https://proceedings.neurips.cc/paper_files/paper/2020/hash/1457c0d6bfcb4967418...

1901
[6]

Vi-curl: Stabilizing verifier-independent rl reasoning via confidence-guided variance reduction

Xin-Qiang Cai and Masashi Sugiyama. Vi-curl: Stabilizing verifier-independent rl reasoning via confidence-guided variance reduction. arXiv preprint arXiv:2602.12579, 2026. URL https://arxiv.org/abs/2602.12579

work page arXiv 2026
[7]

Evaluating Large Language Models Trained on Code

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021. URL https://arxiv.org/abs/2107.03374

work page internal anchor Pith review Pith/arXiv arXiv 2021
[8]

Seal: Steer- able reasoning calibration of large language models for free

Runjin Chen, Zhenyu Zhang, Junyuan Hong, Souvik Kundu, and Zhangyang Wang. Seal: Steerable reasoning calibration of large language models for free. arXiv preprint arXiv:2504.07986, 2025 a . URL https://arxiv.org/abs/2504.07986

work page arXiv 2025
[9]

Do NOT Think That Much for 2+3=? On the Overthinking of o1-Like LLMs

Xingyu Chen, Jiahao Xu, Tian Liang, Zhiwei He, Jianhui Pang, Dian Yu, Linfeng Song, Qiuzhi Liu, Mengfei Zhou, Zhuosheng Zhang, et al. Do not think that much for 2+ 3=? on the overthinking of o1-like llms. arXiv preprint arXiv:2412.21187, 2024. URL https://arxiv.org/abs/2412.21187

work page internal anchor Pith review arXiv 2024
[10]

Pass@ k training for adaptively balancing exploration and exploitation of large reasoning models.arXiv preprint arXiv:2508.10751, 2025

Zhipeng Chen, Xiaobo Qin, Youbin Wu, Yue Ling, Qinghao Ye, Wayne Xin Zhao, and Guang Shi. Pass@ k training for adaptively balancing exploration and exploitation of large reasoning models. arXiv preprint arXiv:2508.10751, 2025 b . URL https://arxiv.org/abs/2508.10751

work page arXiv 2025
[11]

Continual pre-training mitigates forgetting in language and vision

Andrea Cossu, Antonio Carta, Lucia Passaro, Vincenzo Lomonaco, Tinne Tuytelaars, and Davide Bacciu. Continual pre-training mitigates forgetting in language and vision. Neural Networks, 179: 0 106492, 2024. URL https://www.sciencedirect.com/science/article/pii/S0893608024004167

2024
[12]

Gemini 2.0 flash thinking, 2024

Google DeepMind. Gemini 2.0 flash thinking, 2024. URL https://deepmind.google/technologies/gemini/flash-thinking/

2024
[13]

Reinforcement pre-training

Qingxiu Dong, Li Dong, Yao Tang, Tianzhu Ye, Yutao Sun, Zhifang Sui, and Furu Wei. Reinforcement pre-training. arXiv preprint arXiv:2506.08007, 2025. URL https://arxiv.org/abs/2506.08007

work page arXiv 2025
[14]

How to Allocate, How to Learn? Dynamic Rollout Allocation and Advantage Modulation for Policy Optimization

Yangyi Fang, Jiaye Lin, Xiaoliang Fu, Cong Qin, Haolin Shi, Chaowen Hu, Lu Pan, Ke Zeng, and Xunliang Cai. How to allocate, how to learn? dynamic rollout allocation and advantage modulation for policy optimization. arXiv preprint arXiv:2602.19208, 2026 a . URL https://arxiv.org/abs/2602.19208

work page internal anchor Pith review Pith/arXiv arXiv 2026
[15]

Junbo Li, Peng Zhou, Rui Meng, Meet P

Yangyi Fang, Jiaye Lin, Xiaoliang Fu, Cong Qin, Haolin Shi, Chang Liu, and Peilin Zhao. Proximity-based multi-turn optimization: Practical credit assignment for llm agent training. arXiv preprint arXiv:2602.19225, 2026 b . URL https://arxiv.org/abs/2602.19225

work page arXiv 2026
[16]

Deepseek-r1 incentivizes reasoning in llms through reinforcement learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1 incentivizes reasoning in llms through reinforcement learning. Nature, 645 0 (8081): 0 633--638, 2025 a . URL https://www.nature.com/articles/s41586-025-09422-z

2025
[17]

Tree-based dialogue reinforced policy optimization for red-teaming attacks

Ruohao Guo, Afshin Oroojlooy, Roshan Sridhar, Miguel Ballesteros, Alan Ritter, and Dan Roth. Tree-based dialogue reinforced policy optimization for red-teaming attacks. arXiv preprint arXiv:2510.02286, 2025 b . URL https://arxiv.org/abs/2510.02286

work page arXiv 2025
[18]

Richter, Quentin An- thony, Eugene Belilovsky, Irina Rish, and Timothée Lesort

Benjamin Gupta, Kshitij ou2025llmsand Th \'e rien, Adam Ibrahim, Mats L Richter, Quentin Anthony, Eugene Belilovsky, Irina Rish, and Timoth \'e e Lesort. Continual pre-training of large language models: How to (re) warm your model? arXiv preprint arXiv:2308.04014, 2023. URL https://arxiv.org/abs/2308.04014

work page arXiv 2023
[19]

RLP: Reinforcement as a pretraining objective

Ali Hatamizadeh, Syeda Nahida Akter, Shrimai Prabhumoye, Jan Kautz, Mostofa Patwary, Mohammad Shoeybi, Bryan Catanzaro, and Yejin Choi. Rlp: Reinforcement as a pretraining objective. arXiv preprint arXiv:2510.01265, 2025. URL https://arxiv.org/abs/2510.01265

work page arXiv 2025
[20]

Olympiadbench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems

Chaoqun He, Renjie Luo, Yuzhuo Bai, Shengding Hu, Zhen Thai, Junhao Shen, Jinyi Hu, Xu Han, Yujie Huang, Yuxiang Zhang, et al. Olympiadbench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Pape...

2024
[21]

REINFORCE++: Stabilizing Critic-Free Policy Optimization with Global Advantage Normalization

Jian Hu. Reinforce++: A simple and efficient approach for aligning large language models. arXiv preprint arXiv:2501.03262, 2025. URL https://arxiv.org/abs/2501.03262

work page internal anchor Pith review arXiv 2025
[22]

Open-Reasoner-Zero: An Open Source Approach to Scaling Up Reinforcement Learning on the Base Model

Jingcheng Hu, Yinmin Zhang, Qi Han, Daxin Jiang, Xiangyu Zhang, and Heung-Yeung Shum. Open-reasoner-zero: An open source approach to scaling up reinforcement learning on the base model. arXiv preprint arXiv:2503.24290, 2025 a . URL https://arxiv.org/abs/2503.24290

work page internal anchor Pith review arXiv 2025
[23]

Test-time learning for large language models.arXiv preprint arXiv:2505.20633, 2025

Jinwu Hu, Zhitian Zhang, Guohao Chen, Xutao Wen, Chao Shuai, Wei Luo, Bin Xiao, Yuanqing Li, and Mingkui Tan. Test-time learning for large language models. arXiv preprint arXiv:2505.20633, 2025 b . URL https://arxiv.org/abs/2505.20633

work page arXiv 2025
[24]

Remit: Rl-guided mid-training for iterative llm evolution

Junjie Huang, Jiarui Qin, Di Yin, Weiwen Liu, Yong Yu, Xing Sun, and Weinan Zhang. Remit: Rl-guided mid-training for iterative llm evolution. arXiv preprint arXiv:2602.03075, 2026. URL https://arxiv.org/abs/2602.03075

work page arXiv 2026
[25]

GPT-4o System Card

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card. arXiv preprint arXiv:2410.21276, 2024. URL https://arxiv.org/abs/2410.21276

work page internal anchor Pith review Pith/arXiv arXiv 2024
[26]

InProceedings of the 29th Symposium on Operating Systems Principles(Koblenz, Germany)(SOSP ’23)

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. In Proceedings of SOSP, pp.\ 611--626, 2023. URL https://dl.acm.org/doi/abs/10.1145/3600006.3613165

work page doi:10.1145/3600006.3613165 2023
[27]

Solving quantitative reasoning problems with language models

Aitor Lewkowycz, Anders Johan Andreassen, David Dohan, Ethan Dyer, Henryk Michalewski, Vinay Venkatesh Ramasesh, Ambrose Slone, Cem Anil, Imanol Schlag, Theo Gutman-Solo, et al. Solving quantitative reasoning problems with language models. In Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=IFXTZERXdM7

2022
[28]

Convex and non-convex optimization under generalized smoothness.Advances in Neural Information Processing Systems, 36:40238–40271, 2023a

Long Li, Jiaran Hao, Jason Klein Liu, Zhijian Zhou, Yanting Miao, Wei Pang, Xiaoyu Tan, Wei Chu, Zhe Wang, Shirui Pan, et al. The choice of divergence: A neglected key to mitigating diversity collapse in reinforcement learning with verifiable reward. arXiv preprint arXiv:2509.07430, 2025 a . URL https://arxiv.org/abs/2509.07430

work page arXiv 2025
[29]

arXiv preprint arXiv:2509.19249 (2025) 7

Siheng Li, Kejiao Li, Zenan Xu, Guanhua Huang, Evander Yang, Kun Li, Haoyuan Wu, Jiajia Wu, Zihao Zheng, Chenchen Zhang, et al. Reinforcement learning on pre-training data. arXiv preprint arXiv:2509.19249, 2025 b . URL https://arxiv.org/abs/2509.19249

work page arXiv 2025
[30]

arXiv preprint arXiv:2507.06892 (2025) 3

Jing Liang, Hongyao Tang, Yi Ma, Jinyi Liu, Yan Zheng, Shuyue Hu, Lei Bai, and Jianye Hao. Squeeze the soaked sponge: Efficient off-policy reinforcement finetuning for large language model. arXiv preprint arXiv:2507.06892, 2025. URL https://arxiv.org/abs/2507.06892

work page arXiv 2025
[31]

Resadapt: Adaptive resolution for efficient multimodal reasoning.arXiv preprint arXiv:2603.28610, 2026

Huanxuan Liao, Zhongtao Jiang, Yupu Hao, Yuqiao Tan, Shizhu He, Jun Zhao, Kun Xu, and Kang Liu. Resadapt: Adaptive resolution for efficient multimodal reasoning. arXiv preprint arXiv:2603.28610, 2026. URL https://arxiv.org/abs/2603.28610

work page arXiv 2026
[32]

Let's verify step by step

Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let's verify step by step. In Proceedings of ICLR, 2023. URL https://openreview.net/forum?id=v8L0pN6EOi

2023
[33]

Qfft, question-free fine-tuning for adaptive reasoning

Wanlong Liu, Junxiao Xu, Fei Yu, Yukang Lin, Ke Ji, Wenyu Chen, Lifeng Shang, Yasheng Wang, Yan Xu, and Benyou Wang. Qfft, question-free fine-tuning for adaptive reasoning. In The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025 a . URL https://openreview.net/forum?id=CrBWOjZoKc

2025
[34]

Automated optimization modeling via a localizable error-driven perspective

Weiting Liu, Han Wu, Yufei Kuang, Xiongwei Han, Tao Zhong, Jianfeng Feng, and Wenlian Lu. Automated optimization modeling via a localizable error-driven perspective. arXiv preprint arXiv:2602.11164, 2026. URL https://arxiv.org/abs/2602.11164

work page arXiv 2026
[35]

Understanding r1-zero-like training: A critical perspective

Zichen Liu, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, Chao Du, Wee Sun Lee, and Min Lin. Understanding r1-zero-like training: A critical perspective. In Proceedings of COLM, 2025 b . URL https://openreview.net/forum?id=5PAF7PAY2Y

2025
[36]

Reasoning models can be effective without thinking.arXiv preprint arXiv:2504.09858, 2025

Wenjie Ma, Jingxuan He, Charlie Snell, Tyler Griggs, Sewon Min, and Matei Zaharia. Reasoning models can be effective without thinking. arXiv preprint arXiv:2504.09858, 2025. URL https://arxiv.org/abs/2504.09858

work page arXiv 2025
[37]

American mathematics contest 12 (amc 12), November 2023

MAA . American mathematics contest 12 (amc 12), November 2023. URL https://artofproblemsolving.com/wiki/index.php/AMC_12_Problems_and_Solutions

2023
[38]

American invitational mathematics examination (aime), February 2024

MAA . American invitational mathematics examination (aime), February 2024. URL https://artofproblemsolving.com/wiki/index.php/AIME_Problems_and_Solutions

2024
[39]

American invitational mathematics examination (aime), February 2025

MAA . American invitational mathematics examination (aime), February 2025. URL https://artofproblemsolving.com/wiki/index.php/AIME_Problems_and_Solutions

2025
[40]

How do llms acquire new knowledge? a knowledge circuits perspective on continual pre-training

Yixin Ou, Yunzhi Yao, Ningyu Zhang, Hui Jin, Jiacheng Sun, Shumin Deng, Zhenguo Li, and Huajun Chen. How do llms acquire new knowledge? a knowledge circuits perspective on continual pre-training. In Findings of the Association for Computational Linguistics: ACL 2025, pp.\ 19889--19913, 2025. URL https://aclanthology.org/2025.findings-acl.1021/

2025
[41]

Openwebmath: An open dataset of high-quality mathematical web text

Keiran Paster, Marco Dos Santos, Zhangir Azerbayev, and Jimmy Ba. Openwebmath: An open dataset of high-quality mathematical web text. In The Twelfth International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=jKHmjlpViu

2023
[42]

Simko: Simple pass@ k policy optimization.arXiv preprint arXiv:2510.14807, 2025

Ruotian Peng, Yi Ren, Zhouliang Yu, Weiyang Liu, and Yandong Wen. Simko: Simple pass@ k policy optimization. arXiv preprint arXiv:2510.14807, 2025. URL https://arxiv.org/abs/2510.14807

work page arXiv 2025
[43]

Language models are unsupervised multitask learners

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners. OpenAI blog, 1 0 (8): 0 9, 2019. URL https://storage.prod.researchhub.com/uploads/papers/2020/06/01/language-models.pdf

2019
[44]

Exploring the limits of transfer learning with a unified text-to-text transformer

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of machine learning research, 21 0 (140): 0 1--67, 2020. URL http://www.jmlr.org/papers/v21/20-074.html

2020
[45]

Gpqa: A graduate-level google-proof q&a benchmark

David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R Bowman. Gpqa: A graduate-level google-proof q&a benchmark. In First conference on language modeling, 2024. URL https://openreview.net/forum?id=Ti67584b98&utm_campaign=The

2024
[46]

High-Dimensional Continuous Control Using Generalized Advantage Estimation

John Schulman, Philipp Moritz, Sergey Levine, Michael Jordan, and Pieter Abbeel. High-dimensional continuous control using generalized advantage estimation. arXiv preprint arXiv:1506.02438, 2015. URL https://arxiv.org/abs/1506.02438

work page internal anchor Pith review arXiv 2015
[47]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017. URL https://arxiv.org/abs/1707.06347

work page internal anchor Pith review Pith/arXiv arXiv 2017
[48]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024. URL https://arxiv.org/abs/2402.03300

work page internal anchor Pith review Pith/arXiv arXiv 2024
[49]

Sheng, C

Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework. In Proceedings of the Twentieth European Conference on Computer Systems, pp.\ 1279--1297, 2025. URL https://dl.acm.org/doi/abs/10.1145/3689031.3696075

work page doi:10.1145/3689031.3696075 2025
[50]

Scaling agents via continual pre-training.arXiv preprint arXiv:2509.13310, 2025

Liangcai Su, Zhen Zhang, Guangyu Li, Zhuo Chen, Chenxi Wang, Maojia Song, Xinyu Wang, Kuan Li, Jialong Wu, Xuanzhong Chen, et al. Scaling agents via continual pre-training. arXiv preprint arXiv:2509.13310, 2025. URL https://arxiv.org/abs/2509.13310

work page arXiv 2025
[51]

Ernie 2.0: A continual pre-training framework for language understanding

Yu Sun, Shuohuan Wang, Yukun Li, Shikun Feng, Hao Tian, Hua Wu, and Haifeng Wang. Ernie 2.0: A continual pre-training framework for language understanding. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pp.\ 8968--8975, 2020. URL https://ojs.aaai.org/index.php/aaai/article/view/6428

2020
[52]

Reinforcement learning: An introduction, volume 1

Richard S Sutton, Andrew G Barto, et al. Reinforcement learning: An introduction, volume 1. MIT press Cambridge, 1998

1998
[53]

Challenging big-bench tasks and whether chain-of-thought can solve them

Mirac Suzgun, Nathan Scales, Nathanael Sch \"a rli, Sebastian Gehrmann, Yi Tay, Hyung Won Chung, Aakanksha Chowdhery, Quoc Le, Ed Chi, Denny Zhou, et al. Challenging big-bench tasks and whether chain-of-thought can solve them. In Findings of the Association for Computational Linguistics: ACL 2023, pp.\ 13003--13051, 2023. URL https://aclanthology.org/2023...

2023
[54]

The zero-step thinking: An empirical study of mode selection as harder early exit in reasoning models

Yuqiao Tan, Shizhu He, Kang Liu, and Jun Zhao. The zero-step thinking: An empirical study of mode selection as harder early exit in reasoning models. In NeurIPS 2025 Workshop on Efficient Reasoning, 2025 a . URL https://openreview.net/forum?id=CPXmurtK0H

2025
[55]

Bottom-up policy optimization: Your language model policy secretly contains internal policies

Yuqiao Tan, Minzheng Wang, Shizhu He, Huanxuan Liao, Chengfeng Zhao, Qiunan Lu, Tian Liang, Jun Zhao, and Kang Liu. Bottom-up policy optimization: Your language model policy secretly contains internal policies. arXiv preprint arXiv:2512.19673, 2025 b . URL https://arxiv.org/abs/2512.19673

work page arXiv 2025
[56]

Kimi Team, Tongtong Bai, Yifan Bai, Yiping Bao, SH Cai, Yuan Cao, Y Charles, HS Che, Cheng Chen, Guanduo Chen, et al. Kimi k2. 5: Visual agentic intelligence. arXiv preprint arXiv:2602.02276, 2026. URL https://arxiv.org/abs/2602.02276

work page internal anchor Pith review arXiv 2026
[57]

Adaptive social learning via mode policy optimization for language agents

Minzheng Wang, Yongbin Li, Haobo Wang, Xinghua Zhang, Nan Xu, Bingli Wu, Fei Huang, Haiyang Yu, and Wenji Mao. Adaptive social learning via mode policy optimization for language agents. In The Fourteenth International Conference on Learning Representations, 2026 a . URL https://openreview.net/forum?id=GG7YQnsdhp

2026
[58]

Anchored policy optimization: Mitigating exploration collapse via support-constrained rectification

Tianyi Wang, Long Li, Hongcan Guo, Yibiao Chen, Yixia Li, Yong Wang, Yun Chen, and Guanhua Chen. Anchored policy optimization: Mitigating exploration collapse via support-constrained rectification. arXiv preprint arXiv:2602.05717, 2026 b . URL https://arxiv.org/abs/2602.05717

work page arXiv 2026
[59]

A comprehensive survey on trustworthiness in reasoning with large language models

Yanbo Wang, Yongcan Yu, Jian Liang, and Ran He. A comprehensive survey on trustworthiness in reasoning with large language models, 2025 a . URL https://arxiv.org/abs/2509.03871

work page arXiv 2025
[60]

Mitigating the safety-utility trade-off in llm alignment via adaptive safe context learning, 2026 c

Yanbo Wang, Minzheng Wang, Jian Liang, Lu Wang, Yongcan Yu, and Ran He. Mitigating the safety-utility trade-off in llm alignment via adaptive safe context learning, 2026 c . URL https://arxiv.org/abs/2602.13562

work page arXiv 2026
[61]

Mmlu-pro: A more robust and challenging multi-task language understanding benchmark

Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, et al. Mmlu-pro: A more robust and challenging multi-task language understanding benchmark. Advances in Neural Information Processing Systems, 37: 0 95266--95290, 2024 a . URL https://proceedings.neurips.cc/paper_files/paper/20...

2024
[62]

Mathpile: A billion-token-scale pretraining corpus for math

Zengzhi Wang, Xuefeng Li, Rui Xia, and Pengfei Liu. Mathpile: A billion-token-scale pretraining corpus for math. Advances in Neural Information Processing Systems, 37: 0 25426--25468, 2024 b . URL https://proceedings.neurips.cc/paper_files/paper/2024/hash/2d0be3cd5173c10b6ec075d1c393a13d-Abstract-Datasets_and_Benchmarks_Track.html

2024
[63]

Octothinker: Mid-training incentivizes reinforcement learning scaling.arXiv preprint arXiv:2506.20512, 2025

Zengzhi Wang, Fan Zhou, Xuefeng Li, and Pengfei Liu. Octothinker: Mid-training incentivizes reinforcement learning scaling. arXiv preprint arXiv:2506.20512, 2025 b . URL https://arxiv.org/abs/2506.20512

work page arXiv 2025
[64]

Pretrainzero: Reinforcement active pretraining

Xingrun Xing, Zhiyuan Fan, Jie Lou, Guoqi Li, Jiajun Zhang, and Debing Zhang. Pretrainzero: Reinforcement active pretraining. arXiv preprint arXiv:2512.03442, 2025. URL https://arxiv.org/abs/2512.03442

work page arXiv 2025
[65]

arXiv preprint arXiv:2504.14945 , year =

Jianhao Yan, Yafu Li, Zican Hu, Zhi Wang, Ganqu Cui, Xiaoye Qu, Yu Cheng, and Yue Zhang. Learning to reason under off-policy guidance. arXiv preprint arXiv:2504.14945, 2025. URL https://arxiv.org/abs/2504.14945

work page arXiv 2025
[66]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report. arXiv preprint arXiv:2505.09388, 2025. URL https://arxiv.org/abs/2505.09388

work page internal anchor Pith review Pith/arXiv arXiv 2025
[67]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, et al. Dapo: An open-source llm reinforcement learning system at scale. arXiv preprint arXiv:2503.14476, 2025. URL https://arxiv.org/abs/2503.14476

work page internal anchor Pith review Pith/arXiv arXiv 2025
[68]

Unveiling implicit advantage symmetry: Why grpo struggles with exploration and difficulty adaptation

Zhiqi Yu, Zhangquan Chen, Mengting Liu, Heye Zhang, and Liangqiong Qu. Unveiling implicit advantage symmetry: Why grpo struggles with exploration and difficulty adaptation. arXiv preprint arXiv:2602.05548, 2026. URL https://arxiv.org/abs/2602.05548

work page arXiv 2026
[69]

Yang Yue, Zhiqi Chen, Rui Lu, Andrew Zhao, Zhaokai Wang, Shiji Song, and Gao Huang. Does reinforcement learning really incentivize reasoning capacity in llms beyond the base model? In The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. URL https://openreview.net/forum?id=4OsgYD7em5

2025
[70]

SimpleRL-Zoo: Investigating and Taming Zero Reinforcement Learning for Open Base Models in the Wild

Weihao Zeng, Yuzhen Huang, Qian Liu, Wei Liu, Keqing He, Zejun Ma, and Junxian He. Simplerl-zoo: Investigating and taming zero reinforcement learning for open base models in the wild. arXiv preprint arXiv:2503.18892, 2025. URL https://arxiv.org/abs/2503.18892

work page internal anchor Pith review arXiv 2025
[71]

On the interplay of pre-training, mid-training, and rl on reasoning language models.arXiv preprint arXiv:2512.07783,

Charlie Zhang, Graham Neubig, and Xiang Yue. On the interplay of pre-training, mid-training, and rl on reasoning language models. arXiv preprint arXiv:2512.07783, 2025. URL https://arxiv.org/abs/2512.07783

work page arXiv 2025
[72]

Redone: Revealing domain-specific llm post-training in social networking services

Fei Zhao, Chonggang Lu, Zheyong Xie, Ziyan Liu, Haofu Qian, Jianzhao Huang, Fangcheng Shi, Zijie Meng, Hongcheng Guo, Mingqian He, et al. Redone: Revealing domain-specific llm post-training in social networking services. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track, pp.\ 2648--2674, 2025. URL ht...

2025
[73]

Megamath: Pushing the limits of open math corpora

Fan Zhou, Zengzhi Wang, Nikhil Ranjan, Zhoujun Cheng, Liping Tang, Guowei He, Zhengzhong Liu, and Eric P Xing. Megamath: Pushing the limits of open math corpora. In Second Conference on Language Modeling, 2025. URL https://openreview.net/forum?id=SHB0sLrZrh

2025
[74]

The surprising effectiveness of negative reinforcement in LLM reasoning

Xinyu Zhu, Mengzhou Xia, Zhepei Wei, Wei-Lin Chen, Danqi Chen, and Yu Meng. The surprising effectiveness of negative reinforcement in LLM reasoning. In Proceedings of NeurIPS, 2025. URL https://openreview.net/forum?id=ftVlLG9cks

2025
[75]

How far can unsupervised rlvr scale llm training? In The Fourteenth International Conference on Learning Representations, 2026

Yuxin Zuo, Bingxiang He, Zeyuan Liu, Shangziqi Zhao, Zixuan Fu, Junlin Yang, Kaiyan Zhang, Yuchen Fan, Ganqu Cui, Cheng Qian, et al. How far can unsupervised rlvr scale llm training? In The Fourteenth International Conference on Learning Representations, 2026. URL https://openreview.net/forum?id=VesLZukY5E

2026
[76]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...
[77]

@esa (Ref

\@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...
[78]

\@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...
[79]

@open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...