pith. machine review for the scientific record. sign in

arxiv: 2504.05118 · v3 · submitted 2025-04-07 · 💻 cs.AI

Recognition: 2 theorem links

· Lean Theorem

VAPO: Efficient and Reliable Reinforcement Learning for Advanced Reasoning Tasks

Authors on Pith no claims yet

Pith reviewed 2026-05-13 09:30 UTC · model grok-4.3

classification 💻 cs.AI
keywords VAPOvalue-based reinforcement learninglong chain-of-thoughtreasoning modelsAIME 2024proximal policy optimizationtraining stabilitysparse rewards
0
0 comments X

The pith

VAPO reaches 60.4 on AIME 2024 by fixing value bias, variable lengths, and sparse rewards in RL for reasoning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces VAPO, a value-based augmented proximal policy optimization framework for training large reasoning models on long chain-of-thought tasks. It targets three persistent issues in value-based reinforcement learning: bias in the value model, sequences that vary in length, and rewards that appear only rarely. By integrating targeted fixes for these issues, VAPO delivers higher accuracy on hard math reasoning benchmarks while training faster and without the crashes common in prior runs. A reader would care because reliable training of advanced reasoning could reduce the compute and instability barriers that currently limit such models.

Core claim

VAPO provides an integrated solution to value model bias, heterogeneous sequence lengths, and reward sparsity in long-CoT reasoning. Built on the Qwen 32B model, the framework attains a score of 60.4 on the AIME 2024 dataset, outperforming prior reported results for DeepSeek-R1-Zero-Qwen-32B and DAPO by more than 10 points under identical settings. It reaches this performance in only 5000 training steps and maintains stability with no crashes across multiple independent runs.

What carries the argument

The VAPO framework, which augments proximal policy optimization with value-based components to mitigate bias, length heterogeneity, and reward sparsity during reasoning model training.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same fixes for bias and sparsity could be tested on non-math reasoning domains such as code generation or scientific question answering.
  • If stability scales with model size, value-based methods might become the default for long-horizon language-model training where crashes currently waste compute.
  • The 5000-step convergence suggests future experiments could measure wall-clock time or total tokens processed to quantify efficiency gains beyond step count.

Load-bearing premise

The performance and stability gains come from the specific VAPO design choices rather than unreported differences in data, hyperparameters, model initialization, or evaluation protocols.

What would settle it

Reproduce the AIME 2024 experiments using identical training data, hyperparameters, model initialization, and evaluation code to verify whether the 10-point margin and zero-crash stability still appear.

read the original abstract

We present VAPO, Value-based Augmented Proximal Policy Optimization framework for reasoning models., a novel framework tailored for reasoning models within the value-based paradigm. Benchmarked the AIME 2024 dataset, VAPO, built on the Qwen 32B pre-trained model, attains a state-of-the-art score of $\mathbf{60.4}$. In direct comparison under identical experimental settings, VAPO outperforms the previously reported results of DeepSeek-R1-Zero-Qwen-32B and DAPO by more than 10 points. The training process of VAPO stands out for its stability and efficiency. It reaches state-of-the-art performance within a mere 5,000 steps. Moreover, across multiple independent runs, no training crashes occur, underscoring its reliability. This research delves into long chain-of-thought (long-CoT) reasoning using a value-based reinforcement learning framework. We pinpoint three key challenges that plague value-based methods: value model bias, the presence of heterogeneous sequence lengths, and the sparsity of reward signals. Through systematic design, VAPO offers an integrated solution that effectively alleviates these challenges, enabling enhanced performance in long-CoT reasoning tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces VAPO, a value-based augmented proximal policy optimization framework for long chain-of-thought reasoning in large language models. Built on Qwen-32B, it reports a state-of-the-art score of 60.4 on AIME 2024, outperforming DeepSeek-R1-Zero-Qwen-32B and DAPO by more than 10 points under identical experimental settings. The work highlights three challenges in value-based RL (value model bias, heterogeneous sequence lengths, reward sparsity) and claims an integrated design that yields stable, efficient training reaching SOTA performance in 5,000 steps with no crashes across multiple runs.

Significance. If the reported gains and stability are reproducible under truly matched conditions, the result would be significant for reliable RL-based reasoning, as it directly targets load-bearing issues in value-based methods for long-CoT tasks and demonstrates practical efficiency on a 32B model.

major comments (2)
  1. [Abstract] Abstract: The central claim that VAPO outperforms DeepSeek-R1-Zero-Qwen-32B and DAPO by >10 points 'under identical experimental settings' is load-bearing for the performance contribution, yet the manuscript provides no side-by-side hyperparameter table, data-mixture citation, or reproduction protocol that verifies exact matching of initialization, optimizer schedule, reward shaping, length filtering, and evaluation protocol with the cited baselines.
  2. [Methods] Methods (implied by abstract description): The integrated solutions for value-model bias, heterogeneous lengths, and reward sparsity are presented at a high level without quantitative ablations or controlled experiments isolating each component's contribution to the 60.4 score and crash-free training; this weakens attribution of the stability and efficiency gains specifically to VAPO design choices.
minor comments (1)
  1. [Abstract] Abstract: The phrasing 'Benchmarked the AIME 2024 dataset' is grammatically incomplete and should be revised for clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for your thorough review of our manuscript. We appreciate the feedback on clarifying the experimental comparisons and providing more detailed ablations. We address each major comment below.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claim that VAPO outperforms DeepSeek-R1-Zero-Qwen-32B and DAPO by >10 points 'under identical experimental settings' is load-bearing for the performance contribution, yet the manuscript provides no side-by-side hyperparameter table, data-mixture citation, or reproduction protocol that verifies exact matching of initialization, optimizer schedule, reward shaping, length filtering, and evaluation protocol with the cited baselines.

    Authors: We acknowledge the importance of verifiable identical settings. The comparisons were conducted by strictly following the hyperparameter settings, data mixtures, and protocols as described in the original DeepSeek-R1-Zero and DAPO papers. To address this, we will include a detailed side-by-side hyperparameter table in the revised manuscript, citing specific sections from the baseline papers for initialization, optimizer schedule, reward shaping, length filtering, and evaluation. Additionally, we plan to release our full training code and scripts to facilitate exact reproduction. revision: yes

  2. Referee: [Methods] Methods (implied by abstract description): The integrated solutions for value-model bias, heterogeneous lengths, and reward sparsity are presented at a high level without quantitative ablations or controlled experiments isolating each component's contribution to the 60.4 score and crash-free training; this weakens attribution of the stability and efficiency gains specifically to VAPO design choices.

    Authors: We agree that quantitative ablations would better isolate the contributions of each component. In the revised manuscript, we will add a new section with controlled ablation experiments. These will include variants where we disable the value bias correction, the heterogeneous length handling, and the reward sparsity mitigation one at a time, reporting the resulting performance on AIME 2024 and training stability metrics (e.g., crash rates and convergence speed). This will provide direct evidence for the impact of each design choice. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical benchmark results with independent design claims

full rationale

The paper presents VAPO as an integrated RL framework addressing value-model bias, heterogeneous lengths, and reward sparsity in long-CoT reasoning. Central claims consist of empirical AIME 2024 benchmarks (60.4 score, >10-point gains over DeepSeek-R1-Zero-Qwen-32B and DAPO under identical settings, stable 5k-step training with no crashes). No equations, predictions, or first-principles derivations are shown that reduce by construction to fitted inputs, self-citations, or ansatzes. The work is self-contained as an empirical contribution; design choices are described as systematic without load-bearing self-referential loops or renamed known results.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no identifiable free parameters, axioms, or invented entities; the work is presented as an empirical engineering contribution rather than a theoretical derivation.

pith-pipeline@v0.9.0 · 5605 in / 1074 out tokens · 43570 ms · 2026-05-13T09:30:15.431236+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • IndisputableMonolith.Foundation.DAlembert.Inevitability bilinear_family_forced unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    VAPO, built on the Qwen 32B pre-trained model, attains a state-of-the-art score of 60.4. In direct comparison under identical experimental settings, VAPO outperforms the previously reported results of DeepSeek-R1-Zero-Qwen-32B and DAPO by more than 10 points.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 30 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Lightning OPD: Efficient Post-Training for Large Reasoning Models with Offline On-Policy Distillation

    cs.LG 2026-04 unverdicted novelty 8.0

    Lightning OPD enforces teacher consistency by precomputing log-probabilities over SFT rollouts, matching standard OPD performance with bounded gradient discrepancy and achieving 4x speedup on math and code reasoning tasks.

  2. AIS: Adaptive Importance Sampling for Quantized RL

    stat.ML 2026-05 unverdicted novelty 7.0

    AIS adaptively corrects non-stationary policy gradient bias in quantized LLM RL, matching BF16 performance while retaining 1.5-2.76x FP8 rollout speedup.

  3. Entropy Polarity in Reinforcement Fine-Tuning: Direction, Asymmetry, and Control

    cs.LG 2026-05 unverdicted novelty 7.0

    Entropy polarity from a first-order entropy change approximation enables Polarity-Aware Policy Optimization (PAPO) that preserves complementary polarity branches and outperforms baselines on math and agentic RL fine-t...

  4. The Cancellation Hypothesis in Critic-Free RL: From Outcome Rewards to Token Credits

    cs.LG 2026-05 unverdicted novelty 7.0

    The cancellation hypothesis shows how rollout-level rewards produce token-level credit assignment in critic-free RL through cancellation of opposing signals on shared tokens, with empirical support and batching interv...

  5. Beyond Negative Rollouts: Positive-Only Policy Optimization with Implicit Negative Gradients

    cs.CL 2026-05 unverdicted novelty 7.0

    POPO uses bounded importance sampling on positive rollouts and a siamese policy network to achieve implicit negative gradients and stable optimization, matching or exceeding GRPO on math benchmarks such as 36.67% on A...

  6. Reference-Sampled Boltzmann Projection for KL-Regularized RLVR: Target-Matched Weighted SFT, Finite One-Shot Gaps, and Policy Mirror Descent

    cs.LG 2026-05 unverdicted novelty 7.0

    Reference-sampled weighted SFT with prompt-normalized Boltzmann weights induces the same policy as fixed-reference KL-regularized RLVR, with BOLT as the estimator and a finite one-shot error decomposition separating c...

  7. Bringing Value Models Back: Generative Critics for Value Modeling in LLM Reinforcement Learning

    cs.LG 2026-04 unverdicted novelty 7.0

    GenAC introduces generative critics with chain-of-thought reasoning and in-context conditioning to improve value approximation and downstream RL performance in LLMs compared to value-based and value-free baselines.

  8. User Simulator-Guided Multi-Turn Preference Optimization for Reasoning LLM-based Conversational Recommendation

    cs.IR 2026-04 unverdicted novelty 7.0

    SMTPO uses multi-task SFT to improve simulator feedback quality and RL with fine-grained rewards to optimize multi-turn preference reasoning in LLM-based conversational recommendation.

  9. Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models

    cs.LG 2026-01 unverdicted novelty 7.0

    A single LLM improves its own reasoning by self-distilling from privileged verified traces as teacher to its question-only student policy, outperforming off-policy distillation and RL on math benchmarks with better to...

  10. Entropy Polarity in Reinforcement Fine-Tuning: Direction, Asymmetry, and Control

    cs.LG 2026-05 unverdicted novelty 6.0

    Entropy polarity is a signed token-level quantity derived from a first-order approximation of entropy change that predicts whether RL updates expand or contract policy entropy in LLM fine-tuning, revealing an asymmetr...

  11. Seir\^enes: Adversarial Self-Play with Evolving Distractions for LLM Reasoning

    cs.AI 2026-05 unverdicted novelty 6.0

    Seirênes trains LLMs via adversarial self-play to generate and overcome evolving distractions, producing gains of 7-10 points on math reasoning benchmarks and exposing blind spots in larger models.

  12. Power Reinforcement Post-Training of Text-to-Image Models with Super-Linear Advantage Shaping

    cs.CV 2026-05 unverdicted novelty 6.0

    Super-Linear Advantage Shaping (SLAS) introduces a non-linear geometric policy update for RL post-training of text-to-image models that reshapes the local policy space via advantage-dependent Fisher-Rao weighting to r...

  13. Forge: Quality-Aware Reinforcement Learning for NP-Hard Optimization in LLMs

    cs.AI 2026-05 unverdicted novelty 6.0

    OPT-BENCH trains LLMs on NP-hard optimization via quality-aware RLVR, achieving 93.1% success rate and 46.6% quality ratio on Qwen2.5-7B while outperforming GPT-4o and transferring gains to other domains.

  14. AIPO: : Learning to Reason from Active Interaction

    cs.CL 2026-05 unverdicted novelty 6.0

    AIPO trains LLMs to expand their reasoning capability boundary via active multi-agent interaction with Verify, Knowledge, and Reasoning agents during RLVR, using importance sampling and clipping to handle feedback, th...

  15. HTPO: Towards Exploration-Exploitation Balanced Policy Optimization via Hierarchical Token-level Objective Control

    cs.LG 2026-05 unverdicted novelty 6.0

    HTPO introduces hierarchical token-level objective control in RLVR to balance exploration and exploitation by grouping tokens according to difficulty, correctness, and entropy, yielding up to 8.6% gains on AIME benchm...

  16. Experience Sharing in Mutual Reinforcement Learning for Heterogeneous Language Models

    cs.LG 2026-05 unverdicted novelty 6.0

    Mutual Reinforcement Learning allows heterogeneous LLMs to exchange experience through mechanisms like Peer Rollout Pooling, Cross-Policy GRPO Advantage Sharing, and Success-Gated Transfer, with outcome-level sharing ...

  17. Gradient Extrapolation-Based Policy Optimization

    cs.LG 2026-05 unverdicted novelty 6.0

    GXPO approximates longer local lookahead in GRPO training via gradient extrapolation from two optimizer steps using three backward passes total, improving pass@1 accuracy by 1.65-5.00 points over GRPO and delivering u...

  18. Segment-Aligned Policy Optimization for Multi-Modal Reasoning

    cs.AI 2026-05 unverdicted novelty 6.0

    SAPO introduces segment-level policy optimization using a step-wise MDP abstraction to better align RL updates with reasoning structure in multi-modal LLM tasks.

  19. Length Value Model: Scalable Value Pretraining for Token-Level Length Modeling

    cs.CL 2026-04 unverdicted novelty 6.0

    LenVM models token-level remaining generation length as a bounded discounted value function derived from constant negative per-token rewards, providing a scalable proxy for generation horizon.

  20. V-tableR1: Process-Supervised Multimodal Table Reasoning with Critic-Guided Policy Optimization

    cs.AI 2026-04 unverdicted novelty 6.0

    V-tableR1 uses a critic VLM for dense step-level feedback and a new PGPO algorithm to shift multimodal table reasoning from pattern matching to verifiable logical steps, achieving SOTA accuracy with a 4B open-source model.

  21. GRPO-VPS: Enhancing Group Relative Policy Optimization with Verifiable Process Supervision for Effective Reasoning

    cs.LG 2026-04 unverdicted novelty 6.0

    GRPO-VPS improves GRPO by using segment-wise conditional probabilities of the correct answer to supply process-level feedback, yielding up to 2.6-point accuracy gains and 13.7% shorter reasoning on math tasks.

  22. HEALing Entropy Collapse: Enhancing Exploration in Few-Shot RLVR via Hybrid-Domain Entropy Dynamics Alignment

    cs.LG 2026-04 unverdicted novelty 6.0

    HEAL mitigates entropy collapse in few-shot RLVR by selectively adding general-domain data and aligning trajectory-level entropy dynamics, matching full-shot performance with 32 target samples.

  23. Lightning OPD: Efficient Post-Training for Large Reasoning Models with Offline On-Policy Distillation

    cs.LG 2026-04 unverdicted novelty 6.0

    Lightning OPD is an offline on-policy distillation method that matches standard OPD performance at 4x efficiency by enforcing teacher consistency between SFT and distillation phases.

  24. AReaL: A Large-Scale Asynchronous Reinforcement Learning System for Language Reasoning

    cs.LG 2025-05 conditional novelty 6.0

    AReaL decouples generation and training in LLM reinforcement learning to achieve up to 2.77x speedup with matched or better performance on math and code benchmarks.

  25. ToolRL: Reward is All Tool Learning Needs

    cs.LG 2025-04 conditional novelty 6.0

    A principled reward design for tool selection and application in RL-trained LLMs delivers 17% gains over base models and 15% over SFT across benchmarks.

  26. UI-TARS-2 Technical Report: Advancing GUI Agent with Multi-Turn Reinforcement Learning

    cs.AI 2025-09 conditional novelty 5.0

    UI-TARS-2 reaches 88.2 on Online-Mind2Web, 47.5 on OSWorld, 50.6 on WindowsAgentArena, and 73.3 on AndroidWorld while attaining 59.8 mean normalized score on a 15-game suite through multi-turn RL and scalable data generation.

  27. Adaptive Negative Reinforcement for LLM Reasoning:Dynamically Balancing Correction and Diversity in RLVR

    cs.LG 2026-05 unverdicted novelty 4.0

    Adaptive scheduling of penalties over training time plus confidence-based weighting of mistakes improves LLM performance on math reasoning benchmarks compared to fixed-penalty negative reinforcement.

  28. Seed1.5-VL Technical Report

    cs.CV 2025-05 unverdicted novelty 4.0

    Seed1.5-VL is a compact multimodal model that sets new records on dozens of vision-language benchmarks and outperforms prior systems on agent-style tasks.

  29. A Brief Overview: Agentic Reinforcement Learning In Large Language Models

    cs.AI 2026-04 unverdicted novelty 2.0

    This review synthesizes conceptual foundations, methods, challenges, and future directions for agentic reinforcement learning in large language models.

  30. A Brief Overview: Agentic Reinforcement Learning In Large Language Models

    cs.AI 2026-04 unverdicted novelty 2.0

    The paper surveys the conceptual foundations, methodological innovations, challenges, and future directions of agentic reinforcement learning frameworks that embed cognitive capabilities like meta-reasoning and self-r...

Reference graph

Works this paper leans on

30 extracted references · 30 canonical work pages · cited by 27 Pith papers · 13 internal anchors

  1. [1]

    Back to Basics: Revisiting REINFORCE Style Optimization for Learning from Human Feedback in LLMs

    Arash Ahmadian, Chris Cremer, Matthias Gallé, Marzieh Fadaee, Julia Kreutzer, Olivier Pietquin, Ahmet Üstün, and Sara Hooker. Back to basics: Revisiting reinforce style optimization for learning from human feedback in llms, 2024. URL https://arxiv.org/abs/2402.14740

  2. [2]

    Claude 3.5 sonnet, 2024

    Anthropic. Claude 3.5 sonnet, 2024. URLhttps://www.anthropic.com/news/claude-3-5-sonnet

  3. [3]

    Language models are few-shot learners

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advancesin neural information processing systems, 33:1877–1901, 2020

  4. [4]

    Palm: Scaling language modeling with pathways

    Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. Palm: Scaling language modeling with pathways. Journal of Machine Learning Research, 24(240):1–113, 2023

  5. [5]

    Gemini 2.0 flash thinking, 2024

    Google DeepMind. Gemini 2.0 flash thinking, 2024. URLhttps://deepmind.google/technologies/gemini/ flash-thinking/

  6. [6]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    DeepSeek-AI. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning, 2025. URL https://arxiv.org/abs/2501.12948

  7. [7]

    Fletcher

    Ron Good and Harold J. Fletcher. Reporting explained variance.Journal of Research in Science Teaching, 18(1): 1–7, 1981. doi: https://doi.org/10.1002/tea.3660180102. URLhttps://onlinelibrary.wiley.com/doi/abs/10. 1002/tea.3660180102

  8. [8]

    REINFORCE++: Stabilizing Critic-Free Policy Optimization with Global Advantage Normalization

    Jian Hu. Reinforce++: A simple and efficient approach for aligning large language models.arXiv preprint arXiv:2501.03262, 2025

  9. [9]

    Open-Reasoner-Zero: An Open Source Approach to Scaling Up Reinforcement Learning on the Base Model

    Jingcheng Hu, Yinmin Zhang, Qi Han, Daxin Jiang, Xiangyu Zhang, and Heung-Yeung Shum. Open-reasoner- zero: An open source approach to scaling up reinforcement learning on the base model, 2025. URLhttps: //arxiv.org/abs/2503.24290

  10. [10]

    Buy 4 REINFORCE samples, get a baseline for free! InDeep Reinforcement Learning Meets Structured Prediction, ICLR 2019 Workshop, New Orleans, Louisiana, United States, May 6, 2019

    Wouter Kool, Herke van Hoof, and Max Welling. Buy 4 REINFORCE samples, get a baseline for free! InDeep Reinforcement Learning Meets Structured Prediction, ICLR 2019 Workshop, New Orleans, Louisiana, United States, May 6, 2019. OpenReview.net, 2019. URLhttps://openreview.net/forum?id=r1lgTGL5DE

  11. [11]

    DeepSeek-V3 Technical Report

    Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 2024

  12. [12]

    Understanding R1-Zero-Like Training: A Critical Perspective

    Zichen Liu, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, Chao Du, Wee Sun Lee, and Min Lin. Understanding r1-zero-like training: A critical perspective, 2025. URLhttps://arxiv.org/abs/2503.20783

  13. [13]

    Real: Efficient rlhf training of large language models with parameter reallocation

    Zhiyu Mei, Wei Fu, Kaiwei Li, Guangju Wang, Huanchen Zhang, and Yi Wu. Real: Efficient rlhf training of large language models with parameter reallocation. InProceedings of the Eighth Conference on Machine Learning and Systems, MLSys 2025, Santa Clara, CA, USA, May 12-15, 2025. mlsys.org, 2025

  14. [14]

    Self-imitation learning

    Junhyuk Oh, Yijie Guo, Satinder Singh, and Honglak Lee. Self-imitation learning. In Jennifer Dy and Andreas Krause, editors,Proceedings of the 35th InternationalConference on MachineLearning, volume 80 ofProceedings of MachineLearning Research, pages 3878–3887. PMLR, 10–15 Jul 2018. URLhttps://proceedings.mlr.press/ v80/oh18b.html

  15. [15]

    GPT-4 Technical Report

    OpenAI. GPT4 technical report.arXiv preprint arXiv:2303.08774, 2023

  16. [16]

    Learning to reason with llms, 2024

    OpenAI. Learning to reason with llms, 2024. URL https://openai.com/index/ learning-to-reason-with-llms/

  17. [17]

    Training language models to follow instructions with human feedback

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advancesin neural information processing systems, 35:27730–27744, 2022

  18. [18]

    Training language models to follow instructions with human feedback

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advancesin neural information processing systems, 35:27730–27744, 2022. 12

  19. [19]

    Qwq-32b: Embracing the power of reinforcement learning, 2024

    Qwen. Qwq-32b: Embracing the power of reinforcement learning, 2024. URLhttps://qwenlm.github.io/blog/ qwq-32b/

  20. [20]

    High-Dimensional Continuous Control Using Generalized Advantage Estimation

    John Schulman, Philipp Moritz, Sergey Levine, Michael Jordan, and Pieter Abbeel. High-dimensional continuous control using generalized advantage estimation.arXiv preprint arXiv:1506.02438, 2015

  21. [21]

    Proximal Policy Optimization Algorithms

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017

  22. [22]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Mingchuan Zhang, YK Li, Yu Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

  23. [23]

    Exploring data scaling trends and effects in reinforcement learning from human feedback

    Wei Shen, Guanlin Liu, Zheng Wu, Ruofei Zhu, Qingping Yang, Chao Xin, Yu Yue, and Lin Yan. Exploring data scaling trends and effects in reinforcement learning from human feedback.arXiv preprint arXiv:2503.22230, 2025

  24. [24]

    MIT press Cambridge, 1998

    Richard S Sutton, Andrew G Barto, et al.Reinforcement learning: An introduction, volume 1. MIT press Cambridge, 1998

  25. [25]

    Gemini: A Family of Highly Capable Multimodal Models

    Gemini Team, Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, et al. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023

  26. [26]

    Kimi Team, Angang Du, Bofei Gao, Bowei Xing, Changjiu Jiang, Cheng Chen, Cheng Li, Chenjun Xiao, Chenzhuang Du, Chonghua Liao, et al. Kimi k1. 5: Scaling reinforcement learning with llms.arXiv preprint arXiv:2501.12599, 2025

  27. [27]

    Chi, Quoc V

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed H. Chi, Quoc V. Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. In Sanmi Koyejo, S. Mohamed, A. Agarwal, Danielle Belgrave, K. Cho, and A. Oh, editors,Advances in Neural Information Processing Systems 35: Annual Conference on Neural...

  28. [28]

    Grok 3 beta — the age of reasoning agents, 2024

    XAI. Grok 3 beta — the age of reasoning agents, 2024. URLhttps://x.ai/news/grok-3

  29. [29]

    DAPO: An Open-Source LLM Reinforcement Learning System at Scale

    Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Tiantian Fan, Gaohong Liu, Lingjun Liu, Xin Liu, Haibin Lin, Zhiqi Lin, Bole Ma, Guangming Sheng, Yuxuan Tong, Chi Zhang, Mofan Zhang, Wang Zhang, Hang Zhu, Jinhua Zhu, Jiaze Chen, Jiangjie Chen, Chengyi Wang, Hongli Yu, Weinan Dai, Yuxuan Song, Xiangpeng Wei, Hao Zhou, Jingjing Liu, W...

  30. [30]

    What’s behind ppo’s collapse in long-cot? value optimization holds the secret, 2025

    Yufeng Yuan, Yu Yue, Ruofei Zhu, Tiantian Fan, and Lin Yan. What’s behind ppo’s collapse in long-cot? value optimization holds the secret, 2025. URLhttps://arxiv.org/abs/2503.01491. 13