Recognition: 2 theorem links
· Lean TheoremProcess Reinforcement through Implicit Rewards
Pith reviewed 2026-05-11 20:17 UTC · model grok-4.3
The pith
PRIME derives implicit process rewards from policy rollouts and outcome labels alone to enable online training of process reward models for LLM reinforcement learning.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
PRIME enables online PRM updates using only policy rollouts and outcome labels through implicit process rewards, combines well with various advantage functions, and forgoes the dedicated reward model training phase that existing approaches require, substantially reducing the development overhead.
What carries the argument
Implicit process rewards computed from policy rollouts and outcome labels, which supply fine-grained training signals for updating the process reward model directly during reinforcement learning.
Load-bearing premise
Implicit rewards extracted only from full rollouts and final outcome labels can give reliable step-level credit without reward hacking or mis-assignment that would occur if the signals were noisy.
What would settle it
Running the same reinforcement learning loop on math and coding benchmarks and finding no average gain over the supervised fine-tuning baseline or observing clear reward-hacking behaviors such as length exploitation without reasoning improvement.
read the original abstract
Dense process rewards have proven a more effective alternative to the sparse outcome-level rewards in the inference-time scaling of large language models (LLMs), particularly in tasks requiring complex multi-step reasoning. While dense rewards also offer an appealing choice for the reinforcement learning (RL) of LLMs since their fine-grained rewards have the potential to address some inherent issues of outcome rewards, such as training efficiency and credit assignment, this potential remains largely unrealized. This can be primarily attributed to the challenges of training process reward models (PRMs) online, where collecting high-quality process labels is prohibitively expensive, making them particularly vulnerable to reward hacking. To address these challenges, we propose PRIME (Process Reinforcement through IMplicit rEwards), which enables online PRM updates using only policy rollouts and outcome labels through implict process rewards. PRIME combines well with various advantage functions and forgoes the dedicated reward model training phrase that existing approaches require, substantially reducing the development overhead. We demonstrate PRIME's effectiveness on competitional math and coding. Starting from Qwen2.5-Math-7B-Base, PRIME achieves a 15.1% average improvement across several key reasoning benchmarks over the SFT model. Notably, our resulting model, Eurus-2-7B-PRIME, surpasses Qwen2.5-Math-7B-Instruct on seven reasoning benchmarks with 10% of its training data.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes PRIME, a method for online training of process reward models (PRMs) in LLM reinforcement learning that derives implicit process rewards solely from policy rollouts and binary outcome labels, eliminating the need for expensive dedicated process annotations. It combines this with various advantage functions and reports substantial gains on mathematical reasoning and coding benchmarks: starting from Qwen2.5-Math-7B-Base, the approach yields a 15.1% average improvement over the SFT baseline, with the resulting Eurus-2-7B-PRIME model outperforming Qwen2.5-Math-7B-Instruct on seven benchmarks using only 10% of the training data.
Significance. If the implicit-reward mechanism genuinely supplies reliable step-level signals that improve credit assignment over outcome-only RL without introducing new forms of reward hacking, the method could meaningfully reduce the cost and complexity of dense-reward RL for reasoning models. The reported benchmark improvements are large enough to be practically relevant, and the ability to forgo a separate PRM training stage is a clear engineering advantage; however, these benefits hinge on the unverified assumption that outcome-derived implicit rewards meaningfully differentiate correct versus incorrect intermediate steps in long trajectories.
major comments (3)
- [Abstract and §3] Abstract and §3 (method description): the claim that implicit process rewards 'address some inherent issues of outcome rewards, such as training efficiency and credit assignment' is load-bearing for the central contribution, yet the manuscript provides no explicit equations or pseudocode showing how per-step implicit rewards are extracted from terminal binary labels and rollouts; without this, it is impossible to determine whether the procedure reduces to uniform outcome propagation (which inherits standard RL credit-assignment failures) or introduces a non-circular, process-sensitive signal.
- [Experiments section] Experiments section (benchmark results and ablations): the 15.1% average gain and outperformance of Qwen2.5-Math-7B-Instruct are presented as evidence for the implicit-reward mechanism, but no ablation disables the implicit PRM component, compares directly against pure outcome-only RL with identical data and optimizer, or reports correlation between the learned implicit rewards and human-annotated step correctness; without these controls, attribution of gains to process-level supervision rather than base-model strength or data volume remains unestablished.
- [§4] §4 (evaluation on math/coding trajectories): in multi-step reasoning chains, a single terminal label supplies only weak supervision for intermediate steps; the paper does not include direct validation (e.g., step-level accuracy metrics or human process annotations) that the implicit rewards avoid the credit-assignment ambiguity highlighted in the skeptic note, leaving open the possibility that reported improvements stem from outcome-correlated noise rather than fine-grained process signals.
minor comments (2)
- [§3] Notation for the implicit reward function is introduced without a clear reference to the advantage estimator used (e.g., which of the 'various advantage functions' is default); a single equation or algorithm box would improve reproducibility.
- [Experiments] The abstract states '10% of its training data' for Eurus-2-7B-PRIME versus Qwen2.5-Math-7B-Instruct, but the main text does not specify the exact data volume or composition used for the instruct model, making the comparison harder to interpret.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive review. We address each major comment below and have prepared revisions to improve clarity and strengthen the experimental support for our claims.
read point-by-point responses
-
Referee: [Abstract and §3] Abstract and §3 (method description): the claim that implicit process rewards 'address some inherent issues of outcome rewards, such as training efficiency and credit assignment' is load-bearing for the central contribution, yet the manuscript provides no explicit equations or pseudocode showing how per-step implicit rewards are extracted from terminal binary labels and rollouts; without this, it is impossible to determine whether the procedure reduces to uniform outcome propagation (which inherits standard RL credit-assignment failures) or introduces a non-circular, process-sensitive signal.
Authors: We agree that the extraction of per-step implicit rewards from rollouts and terminal labels requires more explicit formalization. In the revised manuscript we will insert the full set of equations and pseudocode in §3 that define the implicit reward computation, showing how the terminal outcome is used to assign differentiated step-level signals via the rollout structure rather than uniform back-propagation. This addition will make clear that the procedure is non-circular and process-sensitive. revision: yes
-
Referee: [Experiments section] Experiments section (benchmark results and ablations): the 15.1% average gain and outperformance of Qwen2.5-Math-7B-Instruct are presented as evidence for the implicit-reward mechanism, but no ablation disables the implicit PRM component, compares directly against pure outcome-only RL with identical data and optimizer, or reports correlation between the learned implicit rewards and human-annotated step correctness; without these controls, attribution of gains to process-level supervision rather than base-model strength or data volume remains unestablished.
Authors: The referee correctly notes the absence of a direct outcome-only ablation and step-level correlation analysis. We will add these controls in the revised experiments section: (1) a head-to-head comparison of PRIME against pure outcome RL using identical data, optimizer, and base model, and (2) correlation metrics between the learned implicit rewards and available step annotations on a held-out set. These additions will allow readers to attribute performance gains more precisely to the implicit process signals. revision: yes
-
Referee: [§4] §4 (evaluation on math/coding trajectories): in multi-step reasoning chains, a single terminal label supplies only weak supervision for intermediate steps; the paper does not include direct validation (e.g., step-level accuracy metrics or human process annotations) that the implicit rewards avoid the credit-assignment ambiguity highlighted in the skeptic note, leaving open the possibility that reported improvements stem from outcome-correlated noise rather than fine-grained process signals.
Authors: We acknowledge that end-to-end benchmark gains alone do not fully rule out credit-assignment ambiguity. In the revision we will include step-level accuracy metrics on trajectories where partial process labels exist and add qualitative examples illustrating reward differentiation between correct and incorrect intermediate steps. While obtaining new large-scale human process annotations is resource-intensive and beyond the scope of the current study, the added quantitative and qualitative analyses will provide direct evidence that the implicit rewards supply non-uniform, process-sensitive signals. revision: partial
Circularity Check
No circularity: implicit rewards derived from standard rollout-based advantage estimation without self-referential reduction.
full rationale
The paper's core mechanism computes implicit process rewards directly from policy-generated rollouts paired with terminal outcome labels, then uses these for online PRM updates within existing advantage estimators. This construction is self-contained and does not define the reward signal in terms of the target improvement, fit a parameter on a subset and relabel it as a prediction, or import uniqueness via self-citation chains. Empirical gains on math/coding benchmarks are presented as validation rather than as a definitional consequence of the method itself. No load-bearing step reduces by construction to its inputs.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Outcome labels alone can be used to derive reliable implicit process rewards during online training
Lean theorems connected to this paper
-
Cost.FunctionalEquationJcost uniqueness echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
PRIME enables online PRM updates using only policy rollouts and outcome labels through implicit process rewards.
-
Foundation.LawOfExistencedefect_zero_iff_one echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
dense rewards also offer an appealing choice for the reinforcement learning (RL) of LLMs since their fine-grained rewards have the potential to address some inherent issues of outcome rewards, such as training efficiency and credit assignment
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 52 Pith papers
-
Lightning OPD: Efficient Post-Training for Large Reasoning Models with Offline On-Policy Distillation
Lightning OPD enforces teacher consistency by precomputing log-probabilities over SFT rollouts, matching standard OPD performance with bounded gradient discrepancy and achieving 4x speedup on math and code reasoning tasks.
-
From Generic Correlation to Input-Specific Credit in On-Policy Self Distillation
Self-distillation token rewards measure input-response-feedback pointwise mutual information, and CREDIT extracts the input-specific component with contrastive baselines to improve LLM reasoning performance.
-
Unsupervised Process Reward Models
Unsupervised PRMs derived from LLM probabilities achieve up to 15% better error detection than LLM judges and match supervised PRMs in verification and RL tasks.
-
Not All Tokens Learn Alike: Attention Entropy Reveals Heterogeneous Signals in RL Reasoning
Attention entropy splits RL training tokens into stable anchors and volatile explorers, and entropy-aware reweighting improves held-out reasoning performance.
-
Beyond Negative Rollouts: Positive-Only Policy Optimization with Implicit Negative Gradients
POPO uses bounded importance sampling on positive rollouts and a siamese policy network to achieve implicit negative gradients and stable optimization, matching or exceeding GRPO on math benchmarks such as 36.67% on A...
-
Rethinking RL for LLM Reasoning: It's Sparse Policy Selection, Not Capability Learning
RL improves LLM reasoning by sparse policy selection at high-entropy tokens rather than new capability learning, and a minimal RL-free method matches its gains at three orders of magnitude lower cost.
-
Rewarding the Scientific Process: Process-Level Reward Modeling for Agentic Data Analysis
DataPRM is a new process reward model for data analysis agents that detects silent errors via environment interaction and ternary rewards, yielding 7-11% gains on benchmarks and further RL improvements.
-
A Survey of Reinforcement Learning for Large Language Models under Data Scarcity: Challenges and Solutions
The paper delivers the first systematic taxonomy and hierarchical framework for data-efficient reinforcement learning post-training of large language models across data-centric, training-centric, and framework-centric views.
-
Bringing Value Models Back: Generative Critics for Value Modeling in LLM Reinforcement Learning
GenAC introduces generative critics with chain-of-thought reasoning and in-context conditioning to improve value approximation and downstream RL performance in LLMs compared to value-based and value-free baselines.
-
Demystifying OPD: Length Inflation and Stabilization Strategies for Large Language Models
OPD for LLMs suffers length inflation and repetition collapse; StableOPD uses reference divergence and rollout mixing to prevent it and improve math reasoning performance by 7.2% on average.
-
Self-Distilled RLVR
RLSD mixes self-distillation for token-level policy difference magnitudes with RLVR for reliable update directions from response correctness to reach higher convergence and better training stability.
-
What If Consensus Lies? Selective-Complementary Reinforcement Learning at Test Time
SCRL adds selective positive pseudo-labeling and entropy-gated negative pseudo-labeling to test-time RL, reducing noise from weak consensus and improving LLM reasoning on benchmarks.
-
Resolving Action Bottleneck: Agentic Reinforcement Learning Informed by Token-Level Energy
ActFocus resolves the action bottleneck in agentic RL by reweighting token gradients toward action tokens using observed reward variance and an energy-based uncertainty term, outperforming PPO and GRPO by up to 65 per...
-
Learning from Failures: Correction-Oriented Policy Optimization with Verifiable Rewards
CIPO jointly optimizes standard RLVR rewards with correction samples derived from the model's own failed attempts, yielding better reasoning and self-correction on math and code benchmarks.
-
Prefix Teach, Suffix Fade: Local Teachability Collapse in Strong-to-Weak On-Policy Distillation
Local teachability collapse in trajectory suffixes makes uniform dense supervision suboptimal in strong-to-weak OPD; truncating at BIC-style change points on teacher margin improves performance.
-
Teacher-Guided Policy Optimization for LLM Distillation
TGPO improves on-policy LLM distillation by using teacher predictions conditioned on student rollouts to supply informative guidance when the two distributions diverge.
-
H\"older Policy Optimisation
HölderPO unifies token aggregation in GRPO via the Hölder mean with dynamic p annealing, reporting 54.9% average math-benchmark accuracy and 93.8% ALFWorld success.
-
Anti-Self-Distillation for Reasoning RL via Pointwise Mutual Information
Anti-Self-Distillation reverses self-distillation signals via PMI to fix overconfidence on structural tokens, matching GRPO baseline accuracy 2-10x faster with up to 11.5 point gains across 4B-30B models.
-
Selective Off-Policy Reference Tuning with Plan Guidance
SORT turns all-wrong prompts into selective learning signals by weighting tokens more predictable under plan guidance from reference solutions, improving over GRPO on reasoning benchmarks especially for weaker models.
-
Internalizing Curriculum Judgment for LLM Reinforcement Fine-Tuning
METIS internalizes curriculum judgment in LLM reinforcement fine-tuning by predicting within-prompt reward variance via in-context learning and jointly optimizing with a self-judgment reward, yielding superior perform...
-
AIPO: : Learning to Reason from Active Interaction
AIPO trains LLMs to expand their reasoning capability boundary via active multi-agent interaction with Verify, Knowledge, and Reasoning agents during RLVR, using importance sampling and clipping to handle feedback, th...
-
Teaching Thinking Models to Reason with Tools: A Full-Pipeline Recipe for Tool-Integrated Reasoning
A training recipe for tool-integrated reasoning models achieves state-of-the-art open-source results on math benchmarks such as 96.7% and 99.2% on AIME 2025 at 4B and 30B scales by balancing tool-use trajectories and ...
-
Rethinking RL for LLM Reasoning: It's Sparse Policy Selection, Not Capability Learning
RL for LLM reasoning acts as sparse policy selection at high-entropy tokens already present in the base model, enabling ReasonMaxxer—an efficient contrastive method that recovers most RL gains at three orders of magni...
-
Listwise Policy Optimization: Group-based RLVR as Target-Projection on the LLM Response Simplex
LPO reframes group-based RLVR as explicit target-projection on the LLM response simplex and performs exact divergence minimization to achieve monotonic listwise improvement with bounded gradients.
-
Every Step Counts: Step-Level Credit Assignment for Tool-Integrated Text-to-SQL
FineStep adds step-level process rewards and credit assignment to tool-augmented Text-to-SQL, achieving 3.25% higher execution accuracy than GRPO on BIRD while cutting redundant tool calls.
-
Uni-OPD: Unifying On-Policy Distillation with a Dual-Perspective Recipe
Uni-OPD unifies on-policy distillation across LLMs and MLLMs with dual-perspective strategies that promote student exploration and enforce order-consistent teacher supervision based on outcome rewards.
-
GR-Ben: A General Reasoning Benchmark for Evaluating Process Reward Models
GR-Ben is a new process-level benchmark that evaluates error detection by PRMs and LLMs in science and logic reasoning, showing weaker performance outside mathematics.
-
Decouple before Integration: Test-time Synthesis of SFT and RLVR Task Vectors
DoTS decouples SFT and RLVR training then synthesizes their task vectors at inference time to match integrated training results at ~3% compute cost.
-
Hidden States Know Where Reasoning Diverges: Credit Assignment via Span-Level Wasserstein Distance
Span-level Wasserstein distances between hidden-state distributions of correct and incorrect rollouts provide a self-supervised signal to reweight advantages in GRPO, improving fine-grained credit assignment on math a...
-
V-tableR1: Process-Supervised Multimodal Table Reasoning with Critic-Guided Policy Optimization
V-tableR1 uses a critic VLM for dense step-level feedback and a new PGPO algorithm to shift multimodal table reasoning from pattern matching to verifiable logical steps, achieving SOTA accuracy with a 4B open-source model.
-
TEMPO: Scaling Test-time Training for Large Reasoning Models
TEMPO scales test-time training for large reasoning models by interleaving policy refinement on unlabeled data with critic recalibration on labeled data via an EM formulation, yielding large gains on AIME tasks.
-
Process Reward Models Meet Planning: Generating Precise and Scalable Datasets for Step-Level Rewards
PDDL planning problems are used to generate about one million precise reasoning steps for training Process Reward Models, and adding this data to existing datasets improves LLM performance on both mathematical and non...
-
HEALing Entropy Collapse: Enhancing Exploration in Few-Shot RLVR via Hybrid-Domain Entropy Dynamics Alignment
HEAL mitigates entropy collapse in few-shot RLVR by selectively adding general-domain data and aligning trajectory-level entropy dynamics, matching full-shot performance with 32 target samples.
-
Internalizing Outcome Supervision into Process Supervision: A New Paradigm for Reinforcement Learning for Reasoning
A new RL paradigm for reasoning where models generate their own internal process supervision from outcome feedback by recycling failed trajectories.
-
AgentV-RL: Scaling Reward Modeling with Agentic Verifier
AgentV-RL introduces bidirectional forward-backward agents and RL-driven tool use to improve LLM verifiers, with a 4B model beating prior outcome reward models by 25.2%.
-
Unleashing Implicit Rewards: Prefix-Value Learning for Distribution-Level Optimization
IPVRM learns prefix values to produce reliable step rewards from sequence outcomes using TD learning, enabling distribution-level RL that improves reasoning when paired with calibrated rewards.
-
Lightning OPD: Efficient Post-Training for Large Reasoning Models with Offline On-Policy Distillation
Lightning OPD is an offline on-policy distillation method that matches standard OPD performance at 4x efficiency by enforcing teacher consistency between SFT and distillation phases.
-
OASES: Outcome-Aligned Search-Evaluation Co-Training for Agentic Search
OASES co-trains search policies and evaluators to generate outcome-aligned process rewards, outperforming standard RL baselines on five multi-hop QA benchmarks.
-
GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization
GDPO decouples per-reward normalization in multi-reward RL to avoid advantage collapse and improve convergence over GRPO on tool-calling, math, and coding tasks.
-
SimpleVLA-RL: Scaling VLA Training via Reinforcement Learning
SimpleVLA-RL applies tailored reinforcement learning to VLA models, reaching SoTA on LIBERO, outperforming π₀ on RoboTwin, and surpassing SFT in real-world tasks while reducing data needs and identifying a 'pushcut' p...
-
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency
InternVL3.5 advances open-source multimodal models with Cascade RL for +16% reasoning gains and ViR for 4x inference speedup, with the 241B model reaching SOTA among open-source MLLMs on multimodal, reasoning, and age...
-
Rubrics as Rewards: Reinforcement Learning Beyond Verifiable Domains
RaR uses aggregated rubric feedback as rewards in on-policy RL, delivering up to 31% relative gains on HealthBench and 7% on GPQA-Diamond versus direct Likert LLM-as-judge baselines.
-
Selective Off-Policy Reference Tuning with Plan Guidance
SORT converts all-failed reasoning prompts into selective, structure-aware training signals by weighting tokens according to how much a reference-derived plan increases their probability.
-
GUI Agents with Reinforcement Learning: Toward Digital Inhabitants
The paper delivers the first comprehensive overview of RL for GUI agents, organizing methods into offline, online, and hybrid strategies while analyzing trends in rewards, efficiency, and deliberation to outline a fut...
-
OGER: A Robust Offline-Guided Exploration Reward for Hybrid Reinforcement Learning
OGER adds an auxiliary exploration reward built from offline trajectories and model entropy to hybrid RL training, yielding gains on math reasoning benchmarks and out-of-domain generalization.
-
TimeRFT: Stimulating Generalizable Time Series Forecasting for TSFMs via Reinforcement Finetuning
TimeRFT applies reinforcement learning with multi-faceted step-wise rewards and informative sample selection to improve generalization and accuracy in TSFM adaptation beyond supervised fine-tuning.
-
SCOPE: Signal-Calibrated On-Policy Distillation Enhancement with Dual-Path Adaptive Weighting
SCOPE routes LLM on-policy rollouts by correctness into teacher-perplexity-weighted KL for errors and student-perplexity-weighted MLE for successes, with group normalization, yielding 11.42% relative Avg@32 gain on re...
-
SVSR: A Self-Verification and Self-Rectification Paradigm for Multimodal Reasoning
SVSR trains multimodal models to verify and correct their own reasoning using a preference dataset, supervised fine-tuning, and semi-online DPO with a teacher model.
-
A Comparative Theoretical Analysis of Entropy Control Methods in Reinforcement Learning
Covariance-based entropy control selectively regularizes high-covariance tokens in softmax policies and achieves asymptotic unbiasedness upon annealing, unlike traditional regularization which introduces dense bias an...
-
Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models
The paper unifies perspectives on Long CoT in reasoning LLMs by introducing a taxonomy, detailing characteristics of deep reasoning and reflection, and discussing emergence phenomena and future directions.
-
From System 1 to System 2: A Survey of Reasoning Large Language Models
The survey organizes the shift of LLMs toward deliberate System 2 reasoning, covering model construction techniques, performance on math and coding benchmarks, and future research directions.
- Fin-PRM: A Domain-Specialized Process Reward Model for Financial Reasoning in Large Language Models
Reference graph
Works this paper leans on
-
[1]
Scaling Learning Algorithms Towards
Bengio, Yoshua and LeCun, Yann , booktitle =. Scaling Learning Algorithms Towards
-
[2]
and Osindero, Simon and Teh, Yee Whye , journal =
Hinton, Geoffrey E. and Osindero, Simon and Teh, Yee Whye , journal =. A Fast Learning Algorithm for Deep Belief Nets , volume =
- [3]
-
[5]
arXiv preprint arXiv:2310.12036 , year=
Mohammad Gheshlaghi Azar, Mark Rowland, Bilal Piot, Daniel Guo, Daniele Calandriello, Michal Valko, and R \'e mi Munos. A general theoretical paradigm to understand learning from human preferences. International Conference on Artificial Intelligence and Statistics, abs/2310.12036, 2024
-
[8]
Deep reinforcement learning from human preferences
Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences. Advances in neural information processing systems, 30, 2017
work page 2017
-
[9]
Ultrafeedback: Boosting language models with scaled ai feedback
Ganqu Cui, Lifan Yuan, Ning Ding, Guanming Yao, Bingxiang He, Wei Zhu, Yuan Ni, Guotong Xie, Ruobing Xie, Yankai Lin, Zhiyuan Liu, and Maosong Sun. Ultrafeedback: Boosting language models with scaled ai feedback. In ICML, 2024
work page 2024
-
[10]
DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai D...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[11]
Kto: Model alignment as prospect theoretic optimization
Kawin Ethayarajh, Winnie Xu, Niklas Muennighoff, Dan Jurafsky, and Douwe Kiela. Kto: Model alignment as prospect theoretic optimization. ICML, 2024
work page 2024
-
[12]
Jiaxuan Gao, Shusheng Xu, Wenjie Ye, Weiling Liu, Chuyi He, Wei Fu, Zhiyu Mei, Guangju Wang, and Yi Wu. On designing effective rl reward at training time for llm reasoning. ArXiv, abs/2410.15115, 2024
-
[13]
Scaling laws for reward model overoptimization
Leo Gao, John Schulman, and Jacob Hilton. Scaling laws for reward model overoptimization. In International Conference on Machine Learning, 2022
work page 2022
-
[15]
Reinforcement learning with deep energy-based policies
Tuomas Haarnoja, Haoran Tang, Pieter Abbeel, and Sergey Levine. Reinforcement learning with deep energy-based policies. In Doina Precup and Yee Whye Teh (eds.), Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017 , volume 70 of Proceedings of Machine Learning Research, pp.\ 1352--1361. ...
work page 2017
-
[22]
Buy 4 reinforce samples, get a baseline for free! In DeepRLStructPred@ICLR, 2019
Wouter Kool, Herke van Hoof, and Max Welling. Buy 4 reinforce samples, get a baseline for free! In DeepRLStructPred@ICLR, 2019. URL https://api.semanticscholar.org/CorpusID:198489118
work page 2019
-
[23]
Tulu 3: Pushing Frontiers in Open Language Model Post-Training
Nathan Lambert, Jacob Daniel Morrison, Valentina Pyatkin, Shengyi Huang, Hamish Ivison, Faeze Brahman, Lester James Validad Miranda, Alisa Liu, Nouha Dziri, Xinxi Lyu, Yuling Gu, Saumya Malik, Victoria Graf, Jena D. Hwang, Jiangjiang Yang, Ronan Le Bras, Oyvind Tafjord, Chris Wilhelm, Luca Soldaini, Noah A. Smith, Yizhong Wang, Pradeep Dasigi, and Hanna H...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[25]
Solving quantitative reasoning problems with language models
Aitor Lewkowycz, Anders Andreassen, David Dohan, Ethan Dyer, Henryk Michalewski, Vinay Ramasesh, Ambrose Slone, Cem Anil, Imanol Schlag, Theo Gutman-Solo, et al. Solving quantitative reasoning problems with language models. Advances in Neural Information Processing Systems, 35: 0 3843--3857, 2022
work page 2022
-
[26]
Jia Li, Edward Beeching, Lewis Tunstall, Ben Lipkin, Roman Soletskyi, Shengyi Huang, Kashif Rasul, Longhui Yu, Albert Q Jiang, Ziju Shen, et al. Numinamath: The largest public dataset in ai4maths with 860k pairs of competition math problems and solutions. Hugging Face repository, 13: 0 9, 2024
work page 2024
-
[29]
Hunter Lightman, Vineet Kosaraju, Yura Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let's verify step by step. ArXiv, abs/2305.20050, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[30]
Rho-1: Not all tokens are what you need, 2024
Zhenghao Lin, Zhibin Gou, Yeyun Gong, Xiao Liu, Yelong Shen, Ruochen Xu, Chen Lin, Yujiu Yang, Jian Jiao, Nan Duan, and Weizhu Chen. Rho-1: Not all tokens are what you need, 2024
work page 2024
-
[32]
Tang, Manan Roongta, Colin Cai, Jeffrey Luo, Li Erran Li, Raluca Ada Popa, and Ion Stoica
Michael Luo, Sijun Tan, Justin Wong, Xiaoxiang Shi, William Y. Tang, Manan Roongta, Colin Cai, Jeffrey Luo, Li Erran Li, Raluca Ada Popa, and Ion Stoica. Deepscaler: Surpassing o1-preview with a 1.5b model by scaling rl. https://pretty-radio-b75.notion.site/DeepScaleR-Surpassing-O1-Preview-with-a-1-5B-Model-by-Scaling-RL-19681902c1468005bed8ca303013a4e2, ...
work page 2025
-
[33]
Ng, Daishi Harada, and Stuart Russell
Andrew Y. Ng, Daishi Harada, and Stuart Russell. Policy invariance under reward transformations: Theory and application to reward shaping. In Ivan Bratko and Saso Dzeroski (eds.), Proceedings of the Sixteenth International Conference on Machine Learning (ICML 1999), Bled, Slovenia, June 27 - 30, 1999 , pp.\ 278--287. Morgan Kaufmann, 1999
work page 1999
-
[35]
Training language models to follow instructions with human feedback
Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35: 0 27730--27744, 2022
work page 2022
-
[36]
Direct preference optimization: Your language model is secretly a reward model
Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 36, 2023
work page 2023
-
[37]
Fromr to Q∗: Your language model is secretly a Q-function,
Rafael Rafailov, Joey Hejna, Ryan Park, and Chelsea Finn. From r to q^ * : Your language model is secretly a q-function. arXiv preprint arXiv:2404.12358, 2024
-
[38]
arXiv preprint arXiv:2404.03715 , year=
Corby Rosset, Ching-An Cheng, Arindam Mitra, Michael Santacroce, Ahmed Awadallah, and Tengyang Xie. Direct nash optimization: Teaching language models to self-improve with general preferences. ArXiv, abs/2404.03715, 2024
-
[39]
John Schulman, Philipp Moritz, Sergey Levine, Michael I. Jordan, and Pieter Abbeel. High-dimensional continuous control using generalized advantage estimation. In 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings , 2016
work page 2016
-
[42]
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models, 2024. URL https://arxiv.org/abs/2402.03300
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[43]
HybridFlow: A Flexible and Efficient RLHF Framework
Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework. arXiv preprint arXiv: 2409.19256, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
- [44]
-
[45]
Richard Sutton. The bitter lesson. Incomplete Ideas (blog), 13 0 (1): 0 38, 2019
work page 2019
-
[46]
Learning to predict by the methods of temporal differences
Richard S Sutton. Learning to predict by the methods of temporal differences. Machine learning, 3: 0 9--44, 1988
work page 1988
-
[47]
Reinforcement learning: An introduction
Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT press, 2018
work page 2018
-
[49]
Qwq: Reflect deeply on the boundaries of the unknown, November 2024
Qwen Team. Qwq: Reflect deeply on the boundaries of the unknown, November 2024. URL https://qwenlm.github.io/blog/qwq-32b-preview/
work page 2024
-
[50]
Shubham Toshniwal, Wei Du, Ivan Moshkov, Branislav Kisacanin, Alexan Ayrapetyan, and Igor Gitman. Openmathinstruct-2: Accelerating ai for math with massive open-source instruction data. arXiv preprint arXiv:2410.01560, 2024
-
[52]
Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations
Peiyi Wang, Lei Li, Zhihong Shao, Runxin Xu, Damai Dai, Yifei Li, Deli Chen, Y.Wu, and Zhifang Sui. Math-shepherd: Verify and reinforce llms step-by-step without human annotations. ArXiv, abs/2312.08935, 2023
work page internal anchor Pith review arXiv 2023
-
[53]
Magicoder: Empowering code generation with oss-instruct
Yuxiang Wei, Zhe Wang, Jiawei Liu, Yifeng Ding, and Lingming Zhang. Magicoder: Empowering code generation with oss-instruct. In Forty-first International Conference on Machine Learning, 2024
work page 2024
-
[54]
Simple statistical gradient-following algorithms for connectionist reinforcement learning
Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8: 0 229--256, 1992
work page 1992
-
[56]
Qwen2.5-Math Technical Report: Toward Mathematical Expert Model via Self-Improvement
An Yang, Beichen Zhang, Binyuan Hui, Bofei Gao, Bowen Yu, Chengpeng Li, Dayiheng Liu, Jianhong Tu, Jingren Zhou, Junyang Lin, Keming Lu, Mingfeng Xue, Runji Lin, Tianyu Liu, Xingzhang Ren, and Zhenru Zhang. Qwen2.5-math technical report: Toward mathematical expert model via self-improvement, 2024 b . URL https://arxiv.org/abs/2409.12122
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[57]
Advancing llm reasoning generalists with preference trees
Lifan Yuan, Ganqu Cui, Hanbin Wang, Ning Ding, Xingyao Wang, Jia Deng, Boji Shan, Huimin Chen, Ruobing Xie, Yankai Lin, Zhenghao Liu, Bowen Zhou, Hao Peng, Zhiyuan Liu, and Maosong Sun. Advancing llm reasoning generalists with preference trees. ArXiv, 2024 a
work page 2024
-
[58]
Free process rewards without process labels.arXiv preprint arXiv:2412.01981, 2024
Lifan Yuan, Wendi Li, Huayu Chen, Ganqu Cui, Ning Ding, Kaiyan Zhang, Bowen Zhou, Zhiyuan Liu, and Hao Peng. Free process rewards without process labels, 2024 b . URL https://arxiv.org/abs/2412.01981
-
[60]
Mammoth2: Scaling instructions from the web
Xiang Yue, Tuney Zheng, Ge Zhang, and Wenhu Chen. Mammoth2: Scaling instructions from the web. ArXiv, abs/2405.03548, 2024
-
[61]
Ultramedical: Building specialized generalists in biomedicine, 2024
Kaiyan Zhang, Sihang Zeng, Ermo Hua, Ning Ding, Zhang-Ren Chen, Zhiyuan Ma, Haoxin Li, Ganqu Cui, Biqing Qi, Xuekai Zhu, Xingtai Lv, Hu Jinfang, Zhiyuan Liu, and Bowen Zhou. Ultramedical: Building specialized generalists in biomedicine, 2024
work page 2024
-
[64]
Brian D. Ziebart, Andrew L. Maas, J. Andrew Bagnell, and Anind K. Dey. Maximum entropy inverse reinforcement learning. In Dieter Fox and Carla P. Gomes (eds.), Proceedings of the Twenty-Third AAAI Conference on Artificial Intelligence, AAAI 2008, Chicago, Illinois, USA, July 13-17, 2008 , pp.\ 1433--1438. AAAI Press, 2008. URL http://www.aaai.org/Library/...
work page 2008
-
[65]
Brian D. Ziebart and Andrew L. Maas and J. Andrew Bagnell and Anind K. Dey , editor =. Maximum Entropy Inverse Reinforcement Learning , booktitle =. 2008 , url =
work page 2008
-
[66]
Reinforcement Learning with Deep Energy-Based Policies , booktitle =
Tuomas Haarnoja and Haoran Tang and Pieter Abbeel and Sergey Levine , editor =. Reinforcement Learning with Deep Energy-Based Policies , booktitle =. 2017 , url =
work page 2017
-
[67]
Ng and Daishi Harada and Stuart Russell , editor =
Andrew Y. Ng and Daishi Harada and Stuart Russell , editor =. Policy Invariance Under Reward Transformations: Theory and Application to Reward Shaping , booktitle =
- [68]
-
[69]
Scalable agent alignment via reward modeling: a research direction
Scalable agent alignment via reward modeling: a research direction , author=. arXiv preprint arXiv:1811.07871 , year=
-
[70]
Advances in neural information processing systems , volume=
Deep reinforcement learning from human preferences , author=. Advances in neural information processing systems , volume=
-
[71]
ULTRAFEEDBACK: Boosting Language Models with Scaled AI Feedback , author=. ICML , year=
-
[72]
International Conference on Artificial Intelligence and Statistics , year=
A General Theoretical Paradigm to Understand Learning from Human Preferences , author=. International Conference on Artificial Intelligence and Statistics , year=
-
[73]
KTO: Model Alignment as Prospect Theoretic Optimization , author=. ICML , year=
-
[74]
Direct Nash Optimization: Teaching Language Models to Self-Improve with General Preferences , author=. ArXiv , year=
-
[75]
Kimi k1.5: Scaling Reinforcement Learning with LLMs
Kimi k1. 5: Scaling Reinforcement Learning with LLMs , author=. arXiv preprint arXiv:2501.12599 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[76]
Advancing LLM Reasoning Generalists with Preference Trees , author=. ArXiv , year=
-
[77]
Openai o1 system card , author=. arXiv preprint arXiv:2412.16720 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[78]
Advances in neural information processing systems , volume=
Training language models to follow instructions with human feedback , author=. Advances in neural information processing systems , volume=
-
[79]
arXiv preprint arXiv:2406.09760 , year=
Bootstrapping Language Models with DPO Implicit Rewards , author=. arXiv preprint arXiv:2406.09760 , year=
-
[80]
Rafailov, Rafael and Hejna, Joey and Park, Ryan and Finn, Chelsea , journal=. From r to Q^
-
[81]
arXiv preprint arXiv:2405.19262 , year=
Weak-to-Strong Search: Align Large Language Models via Searching over Small Language Models , author=. arXiv preprint arXiv:2405.19262 , year=
-
[82]
Vineppo: Refining credit assignment in rl training of llms, 2025
Vineppo: Unlocking rl potential for llm reasoning through refined credit assignment , author=. arXiv preprint arXiv:2410.01679 , year=
- [83]
-
[84]
Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations , author=. ArXiv , year=
-
[85]
International Conference on Machine Learning , year=
Scaling Laws for Reward Model Overoptimization , author=. International Conference on Machine Learning , year=
- [86]
-
[87]
Acemath: Advancing frontier math reasoning with post-training and reward modeling
AceMath: Advancing Frontier Math Reasoning with Post-Training and Reward Modeling , author=. arXiv preprint arXiv:2412.15084 , year=
-
[88]
Rewarding progress: Scaling automated process verifiers for llm reasoning , author=. arXiv preprint arXiv:2410.08146 , year=
-
[89]
Reinforcement learning: An introduction , author=. 2018 , publisher=
work page 2018
-
[90]
Nathan Lambert and Jacob Daniel Morrison and Valentina Pyatkin and Shengyi Huang and Hamish Ivison and Faeze Brahman and Lester James Validad Miranda and Alisa Liu and Nouha Dziri and Xinxi Lyu and Yuling Gu and Saumya Malik and Victoria Graf and Jena D. Hwang and Jiangjiang Yang and Ronan Le Bras and Oyvind Tafjord and Chris Wilhelm and Luca Soldaini and...
work page 2024
-
[91]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning , author=. 2025 , eprint=
work page 2025
-
[92]
Buy 4 REINFORCE Samples, Get a Baseline for Free! , author=. DeepRLStructPred@ICLR , year=
-
[93]
Proximal Policy Optimization Algorithms
Proximal policy optimization algorithms , author=. arXiv preprint arXiv:1707.06347 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[94]
Simple statistical gradient-following algorithms for connectionist reinforcement learning , author=. Machine learning , volume=. 1992 , publisher=
work page 1992
-
[95]
Back to Basics: Revisiting REINFORCE Style Optimization for Learning from Human Feedback in LLMs
Back to basics: Revisiting reinforce style optimization for learning from human feedback in llms , author=. arXiv preprint arXiv:2402.14740 , year=
work page internal anchor Pith review arXiv
-
[96]
Zhihong Shao and Peiyi Wang and Qihao Zhu and Runxin Xu and Junxiao Song and Xiao Bi and Haowei Zhang and Mingchuan Zhang and Y. K. Li and Y. Wu and Daya Guo , title =. CoRR , volume =. 2024 , url =
work page 2024
-
[97]
ReMax: A Simple, Effective, and Efficient Method for Aligning Large Language Models , author =. arXiv preprint arXiv:2310.10505 , year =
-
[98]
Self-play with Execution Feedback: Improving Instruction-following Capabilities of Large Language Models , author=. arXiv preprint arXiv:2406.13542 , year=
-
[99]
WizardLM: Empowering large pre-trained language models to follow complex instructions
Wizardlm: Empowering large language models to follow complex instructions , author=. arXiv preprint arXiv:2304.12244 , year=
work page internal anchor Pith review arXiv
-
[100]
arxiv preprint arXiv:2404.02823 , year=
Conifer: Improving Complex Constrained Instruction-Following Ability of Large Language Models , author=. arxiv preprint arXiv:2404.02823 , year=
- [101]
- [102]
-
[103]
Large Language Model Instruction Following: A Survey of Progresses and Challenges , author=. 2024 , eprint=
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.