arxiv: 2502.01456 · v2 · submitted 2025-02-03 · 💻 cs.LG · cs.AI· cs.CL

Recognition: 2 theorem links

· Lean Theorem

Process Reinforcement through Implicit Rewards

Ganqu Cui , Lifan Yuan , Zefan Wang , Hanbin Wang , Yuchen Zhang , Jiacheng Chen , Wendi Li , Bingxiang He

show 17 more authors

Yuchen Fan Tianyu Yu Qixin Xu Weize Chen Jiarui Yuan Huayu Chen Kaiyan Zhang Xingtai Lv Shuo Wang Yuan Yao Xu Han Hao Peng Yu Cheng Zhiyuan Liu Maosong Sun Bowen Zhou Ning Ding

Authors on Pith no claims yet

Pith reviewed 2026-05-11 20:17 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL

keywords process reward modelsimplicit rewardsreinforcement learningLLM reasoningoutcome supervisionmath benchmarkscredit assignment

0 comments

The pith

PRIME derives implicit process rewards from policy rollouts and outcome labels alone to enable online training of process reward models for LLM reinforcement learning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that dense process supervision for large language models can be achieved without collecting explicit step-by-step labels by instead computing implicit rewards directly from the model's own rollouts paired with final outcome correctness. This removes the expensive offline labeling step and the separate reward-model training phase required by prior methods, letting the process reward model update online during reinforcement learning. The resulting signals address credit assignment and training efficiency problems that arise with purely sparse outcome rewards. Experiments on mathematical competition problems and coding tasks starting from a base model demonstrate clear gains over supervised fine-tuning and competitive performance against larger instruct models trained on far more data.

Core claim

PRIME enables online PRM updates using only policy rollouts and outcome labels through implicit process rewards, combines well with various advantage functions, and forgoes the dedicated reward model training phase that existing approaches require, substantially reducing the development overhead.

What carries the argument

Implicit process rewards computed from policy rollouts and outcome labels, which supply fine-grained training signals for updating the process reward model directly during reinforcement learning.

Load-bearing premise

Implicit rewards extracted only from full rollouts and final outcome labels can give reliable step-level credit without reward hacking or mis-assignment that would occur if the signals were noisy.

What would settle it

Running the same reinforcement learning loop on math and coding benchmarks and finding no average gain over the supervised fine-tuning baseline or observing clear reward-hacking behaviors such as length exploitation without reasoning improvement.

read the original abstract

Dense process rewards have proven a more effective alternative to the sparse outcome-level rewards in the inference-time scaling of large language models (LLMs), particularly in tasks requiring complex multi-step reasoning. While dense rewards also offer an appealing choice for the reinforcement learning (RL) of LLMs since their fine-grained rewards have the potential to address some inherent issues of outcome rewards, such as training efficiency and credit assignment, this potential remains largely unrealized. This can be primarily attributed to the challenges of training process reward models (PRMs) online, where collecting high-quality process labels is prohibitively expensive, making them particularly vulnerable to reward hacking. To address these challenges, we propose PRIME (Process Reinforcement through IMplicit rEwards), which enables online PRM updates using only policy rollouts and outcome labels through implict process rewards. PRIME combines well with various advantage functions and forgoes the dedicated reward model training phrase that existing approaches require, substantially reducing the development overhead. We demonstrate PRIME's effectiveness on competitional math and coding. Starting from Qwen2.5-Math-7B-Base, PRIME achieves a 15.1% average improvement across several key reasoning benchmarks over the SFT model. Notably, our resulting model, Eurus-2-7B-PRIME, surpasses Qwen2.5-Math-7B-Instruct on seven reasoning benchmarks with 10% of its training data.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PRIME derives implicit process rewards from rollouts and outcomes alone to update PRMs online, cutting labeling costs and showing 15%+ gains on math benchmarks, but the credit-assignment mechanism needs direct checks.

read the letter

The main thing here is that PRIME lets you do online updates to a process reward model using only policy rollouts and binary outcome labels to create implicit rewards. This skips the expensive labeling and separate training that usually comes with PRMs. The authors show it works on math and coding, with a 15.1 percent average gain over the supervised fine-tuned starting point across benchmarks, and their Eurus-2-7B-PRIME beats Qwen2.5-Math-7B-Instruct on seven tasks using just a tenth of the data. The novelty is in deriving those implicit process rewards directly from the rollouts and outcomes, then using them for PRM updates without dedicated phases. It integrates with different advantage functions, which is practical. This does address the supervision cost problem head on, and the results suggest it can lead to stronger reasoning models with less human input. Where it could be softer is in confirming that the rewards are truly process-oriented. The abstract does not include the exact computation or ablations, so we have to take on faith that the implicit signals differentiate correct intermediate steps rather than just reflecting the final answer. In multi-step problems, a terminal label alone often fails to assign credit properly, and any derived advantage might carry that weakness forward. If the paper has no checks for reward hacking or step accuracy correlation, that would be a gap worth noting. Overall, this is for people doing RL on LLMs for reasoning-heavy domains. A reader looking for ways to make process supervision cheaper will find useful ideas, even if the full validation comes later. The work shows clear thinking on the practical side and engages with the literature on rewards in RL. It deserves a serious referee to dig into the implementation details and run the necessary tests. My recommendation is to put it through peer review rather than desk reject, since the idea has potential and the benchmarks are competitive.

Referee Report

3 major / 2 minor

Summary. The paper proposes PRIME, a method for online training of process reward models (PRMs) in LLM reinforcement learning that derives implicit process rewards solely from policy rollouts and binary outcome labels, eliminating the need for expensive dedicated process annotations. It combines this with various advantage functions and reports substantial gains on mathematical reasoning and coding benchmarks: starting from Qwen2.5-Math-7B-Base, the approach yields a 15.1% average improvement over the SFT baseline, with the resulting Eurus-2-7B-PRIME model outperforming Qwen2.5-Math-7B-Instruct on seven benchmarks using only 10% of the training data.

Significance. If the implicit-reward mechanism genuinely supplies reliable step-level signals that improve credit assignment over outcome-only RL without introducing new forms of reward hacking, the method could meaningfully reduce the cost and complexity of dense-reward RL for reasoning models. The reported benchmark improvements are large enough to be practically relevant, and the ability to forgo a separate PRM training stage is a clear engineering advantage; however, these benefits hinge on the unverified assumption that outcome-derived implicit rewards meaningfully differentiate correct versus incorrect intermediate steps in long trajectories.

major comments (3)

[Abstract and §3] Abstract and §3 (method description): the claim that implicit process rewards 'address some inherent issues of outcome rewards, such as training efficiency and credit assignment' is load-bearing for the central contribution, yet the manuscript provides no explicit equations or pseudocode showing how per-step implicit rewards are extracted from terminal binary labels and rollouts; without this, it is impossible to determine whether the procedure reduces to uniform outcome propagation (which inherits standard RL credit-assignment failures) or introduces a non-circular, process-sensitive signal.
[Experiments section] Experiments section (benchmark results and ablations): the 15.1% average gain and outperformance of Qwen2.5-Math-7B-Instruct are presented as evidence for the implicit-reward mechanism, but no ablation disables the implicit PRM component, compares directly against pure outcome-only RL with identical data and optimizer, or reports correlation between the learned implicit rewards and human-annotated step correctness; without these controls, attribution of gains to process-level supervision rather than base-model strength or data volume remains unestablished.
[§4] §4 (evaluation on math/coding trajectories): in multi-step reasoning chains, a single terminal label supplies only weak supervision for intermediate steps; the paper does not include direct validation (e.g., step-level accuracy metrics or human process annotations) that the implicit rewards avoid the credit-assignment ambiguity highlighted in the skeptic note, leaving open the possibility that reported improvements stem from outcome-correlated noise rather than fine-grained process signals.

minor comments (2)

[§3] Notation for the implicit reward function is introduced without a clear reference to the advantage estimator used (e.g., which of the 'various advantage functions' is default); a single equation or algorithm box would improve reproducibility.
[Experiments] The abstract states '10% of its training data' for Eurus-2-7B-PRIME versus Qwen2.5-Math-7B-Instruct, but the main text does not specify the exact data volume or composition used for the instruct model, making the comparison harder to interpret.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive review. We address each major comment below and have prepared revisions to improve clarity and strengthen the experimental support for our claims.

read point-by-point responses

Referee: [Abstract and §3] Abstract and §3 (method description): the claim that implicit process rewards 'address some inherent issues of outcome rewards, such as training efficiency and credit assignment' is load-bearing for the central contribution, yet the manuscript provides no explicit equations or pseudocode showing how per-step implicit rewards are extracted from terminal binary labels and rollouts; without this, it is impossible to determine whether the procedure reduces to uniform outcome propagation (which inherits standard RL credit-assignment failures) or introduces a non-circular, process-sensitive signal.

Authors: We agree that the extraction of per-step implicit rewards from rollouts and terminal labels requires more explicit formalization. In the revised manuscript we will insert the full set of equations and pseudocode in §3 that define the implicit reward computation, showing how the terminal outcome is used to assign differentiated step-level signals via the rollout structure rather than uniform back-propagation. This addition will make clear that the procedure is non-circular and process-sensitive. revision: yes
Referee: [Experiments section] Experiments section (benchmark results and ablations): the 15.1% average gain and outperformance of Qwen2.5-Math-7B-Instruct are presented as evidence for the implicit-reward mechanism, but no ablation disables the implicit PRM component, compares directly against pure outcome-only RL with identical data and optimizer, or reports correlation between the learned implicit rewards and human-annotated step correctness; without these controls, attribution of gains to process-level supervision rather than base-model strength or data volume remains unestablished.

Authors: The referee correctly notes the absence of a direct outcome-only ablation and step-level correlation analysis. We will add these controls in the revised experiments section: (1) a head-to-head comparison of PRIME against pure outcome RL using identical data, optimizer, and base model, and (2) correlation metrics between the learned implicit rewards and available step annotations on a held-out set. These additions will allow readers to attribute performance gains more precisely to the implicit process signals. revision: yes
Referee: [§4] §4 (evaluation on math/coding trajectories): in multi-step reasoning chains, a single terminal label supplies only weak supervision for intermediate steps; the paper does not include direct validation (e.g., step-level accuracy metrics or human process annotations) that the implicit rewards avoid the credit-assignment ambiguity highlighted in the skeptic note, leaving open the possibility that reported improvements stem from outcome-correlated noise rather than fine-grained process signals.

Authors: We acknowledge that end-to-end benchmark gains alone do not fully rule out credit-assignment ambiguity. In the revision we will include step-level accuracy metrics on trajectories where partial process labels exist and add qualitative examples illustrating reward differentiation between correct and incorrect intermediate steps. While obtaining new large-scale human process annotations is resource-intensive and beyond the scope of the current study, the added quantitative and qualitative analyses will provide direct evidence that the implicit rewards supply non-uniform, process-sensitive signals. revision: partial

Circularity Check

0 steps flagged

No circularity: implicit rewards derived from standard rollout-based advantage estimation without self-referential reduction.

full rationale

The paper's core mechanism computes implicit process rewards directly from policy-generated rollouts paired with terminal outcome labels, then uses these for online PRM updates within existing advantage estimators. This construction is self-contained and does not define the reward signal in terms of the target improvement, fit a parameter on a subset and relabel it as a prediction, or import uniqueness via self-citation chains. Empirical gains on math/coding benchmarks are presented as validation rather than as a definitional consequence of the method itself. No load-bearing step reduces by construction to its inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that outcome labels suffice to generate useful implicit process signals; no explicit free parameters or invented entities are named in the abstract.

axioms (1)

domain assumption Outcome labels alone can be used to derive reliable implicit process rewards during online training
This is the core premise that allows PRIME to forgo dedicated process label collection.

pith-pipeline@v0.9.0 · 5635 in / 1210 out tokens · 35223 ms · 2026-05-11T20:17:20.605350+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Cost.FunctionalEquation Jcost uniqueness echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

PRIME enables online PRM updates using only policy rollouts and outcome labels through implicit process rewards.
Foundation.LawOfExistence defect_zero_iff_one echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

dense rewards also offer an appealing choice for the reinforcement learning (RL) of LLMs since their fine-grained rewards have the potential to address some inherent issues of outcome rewards, such as training efficiency and credit assignment

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 52 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Lightning OPD: Efficient Post-Training for Large Reasoning Models with Offline On-Policy Distillation
cs.LG 2026-04 unverdicted novelty 8.0

Lightning OPD enforces teacher consistency by precomputing log-probabilities over SFT rollouts, matching standard OPD performance with bounded gradient discrepancy and achieving 4x speedup on math and code reasoning tasks.
From Generic Correlation to Input-Specific Credit in On-Policy Self Distillation
cs.LG 2026-05 conditional novelty 7.0

Self-distillation token rewards measure input-response-feedback pointwise mutual information, and CREDIT extracts the input-specific component with contrastive baselines to improve LLM reasoning performance.
Unsupervised Process Reward Models
cs.LG 2026-05 unverdicted novelty 7.0

Unsupervised PRMs derived from LLM probabilities achieve up to 15% better error detection than LLM judges and match supervised PRMs in verification and RL tasks.
Not All Tokens Learn Alike: Attention Entropy Reveals Heterogeneous Signals in RL Reasoning
cs.CL 2026-05 unverdicted novelty 7.0

Attention entropy splits RL training tokens into stable anchors and volatile explorers, and entropy-aware reweighting improves held-out reasoning performance.
Beyond Negative Rollouts: Positive-Only Policy Optimization with Implicit Negative Gradients
cs.CL 2026-05 unverdicted novelty 7.0

POPO uses bounded importance sampling on positive rollouts and a siamese policy network to achieve implicit negative gradients and stable optimization, matching or exceeding GRPO on math benchmarks such as 36.67% on A...
Rethinking RL for LLM Reasoning: It's Sparse Policy Selection, Not Capability Learning
cs.CL 2026-05 unverdicted novelty 7.0

RL improves LLM reasoning by sparse policy selection at high-entropy tokens rather than new capability learning, and a minimal RL-free method matches its gains at three orders of magnitude lower cost.
Rewarding the Scientific Process: Process-Level Reward Modeling for Agentic Data Analysis
cs.CL 2026-04 unverdicted novelty 7.0

DataPRM is a new process reward model for data analysis agents that detects silent errors via environment interaction and ternary rewards, yielding 7-11% gains on benchmarks and further RL improvements.
A Survey of Reinforcement Learning for Large Language Models under Data Scarcity: Challenges and Solutions
cs.LG 2026-04 accept novelty 7.0

The paper delivers the first systematic taxonomy and hierarchical framework for data-efficient reinforcement learning post-training of large language models across data-centric, training-centric, and framework-centric views.
Bringing Value Models Back: Generative Critics for Value Modeling in LLM Reinforcement Learning
cs.LG 2026-04 unverdicted novelty 7.0

GenAC introduces generative critics with chain-of-thought reasoning and in-context conditioning to improve value approximation and downstream RL performance in LLMs compared to value-based and value-free baselines.
Demystifying OPD: Length Inflation and Stabilization Strategies for Large Language Models
cs.CL 2026-04 unverdicted novelty 7.0

OPD for LLMs suffers length inflation and repetition collapse; StableOPD uses reference divergence and rollout mixing to prevent it and improve math reasoning performance by 7.2% on average.
Self-Distilled RLVR
cs.LG 2026-04 unverdicted novelty 7.0

RLSD mixes self-distillation for token-level policy difference magnitudes with RLVR for reliable update directions from response correctness to reach higher convergence and better training stability.
What If Consensus Lies? Selective-Complementary Reinforcement Learning at Test Time
cs.LG 2026-03 unverdicted novelty 7.0

SCRL adds selective positive pseudo-labeling and entropy-gated negative pseudo-labeling to test-time RL, reducing noise from weak consensus and improving LLM reasoning on benchmarks.
Resolving Action Bottleneck: Agentic Reinforcement Learning Informed by Token-Level Energy
cs.LG 2026-05 conditional novelty 6.0

ActFocus resolves the action bottleneck in agentic RL by reweighting token gradients toward action tokens using observed reward variance and an energy-based uncertainty term, outperforming PPO and GRPO by up to 65 per...
Learning from Failures: Correction-Oriented Policy Optimization with Verifiable Rewards
cs.CL 2026-05 unverdicted novelty 6.0

CIPO jointly optimizes standard RLVR rewards with correction samples derived from the model's own failed attempts, yielding better reasoning and self-correction on math and code benchmarks.
Prefix Teach, Suffix Fade: Local Teachability Collapse in Strong-to-Weak On-Policy Distillation
cs.CL 2026-05 unverdicted novelty 6.0

Local teachability collapse in trajectory suffixes makes uniform dense supervision suboptimal in strong-to-weak OPD; truncating at BIC-style change points on teacher margin improves performance.
Teacher-Guided Policy Optimization for LLM Distillation
cs.LG 2026-05 unverdicted novelty 6.0

TGPO improves on-policy LLM distillation by using teacher predictions conditioned on student rollouts to supply informative guidance when the two distributions diverge.
H\"older Policy Optimisation
cs.LG 2026-05 unverdicted novelty 6.0

HölderPO unifies token aggregation in GRPO via the Hölder mean with dynamic p annealing, reporting 54.9% average math-benchmark accuracy and 93.8% ALFWorld success.
Anti-Self-Distillation for Reasoning RL via Pointwise Mutual Information
cs.LG 2026-05 unverdicted novelty 6.0

Anti-Self-Distillation reverses self-distillation signals via PMI to fix overconfidence on structural tokens, matching GRPO baseline accuracy 2-10x faster with up to 11.5 point gains across 4B-30B models.
Selective Off-Policy Reference Tuning with Plan Guidance
cs.AI 2026-05 unverdicted novelty 6.0

SORT turns all-wrong prompts into selective learning signals by weighting tokens more predictable under plan guidance from reference solutions, improving over GRPO on reasoning benchmarks especially for weaker models.
Internalizing Curriculum Judgment for LLM Reinforcement Fine-Tuning
cs.LG 2026-05 unverdicted novelty 6.0

METIS internalizes curriculum judgment in LLM reinforcement fine-tuning by predicting within-prompt reward variance via in-context learning and jointly optimizing with a self-judgment reward, yielding superior perform...
AIPO: : Learning to Reason from Active Interaction
cs.CL 2026-05 unverdicted novelty 6.0

AIPO trains LLMs to expand their reasoning capability boundary via active multi-agent interaction with Verify, Knowledge, and Reasoning agents during RLVR, using importance sampling and clipping to handle feedback, th...
Teaching Thinking Models to Reason with Tools: A Full-Pipeline Recipe for Tool-Integrated Reasoning
cs.CL 2026-05 unverdicted novelty 6.0

A training recipe for tool-integrated reasoning models achieves state-of-the-art open-source results on math benchmarks such as 96.7% and 99.2% on AIME 2025 at 4B and 30B scales by balancing tool-use trajectories and ...
Rethinking RL for LLM Reasoning: It's Sparse Policy Selection, Not Capability Learning
cs.CL 2026-05 unverdicted novelty 6.0

RL for LLM reasoning acts as sparse policy selection at high-entropy tokens already present in the base model, enabling ReasonMaxxer—an efficient contrastive method that recovers most RL gains at three orders of magni...
Listwise Policy Optimization: Group-based RLVR as Target-Projection on the LLM Response Simplex
cs.LG 2026-05 unverdicted novelty 6.0

LPO reframes group-based RLVR as explicit target-projection on the LLM response simplex and performs exact divergence minimization to achieve monotonic listwise improvement with bounded gradients.
Every Step Counts: Step-Level Credit Assignment for Tool-Integrated Text-to-SQL
cs.CL 2026-05 unverdicted novelty 6.0

FineStep adds step-level process rewards and credit assignment to tool-augmented Text-to-SQL, achieving 3.25% higher execution accuracy than GRPO on BIRD while cutting redundant tool calls.
Uni-OPD: Unifying On-Policy Distillation with a Dual-Perspective Recipe
cs.LG 2026-05 unverdicted novelty 6.0

Uni-OPD unifies on-policy distillation across LLMs and MLLMs with dual-perspective strategies that promote student exploration and enforce order-consistent teacher supervision based on outcome rewards.
GR-Ben: A General Reasoning Benchmark for Evaluating Process Reward Models
cs.AI 2026-05 unverdicted novelty 6.0

GR-Ben is a new process-level benchmark that evaluates error detection by PRMs and LLMs in science and logic reasoning, showing weaker performance outside mathematics.
Decouple before Integration: Test-time Synthesis of SFT and RLVR Task Vectors
cs.LG 2026-05 conditional novelty 6.0

DoTS decouples SFT and RLVR training then synthesizes their task vectors at inference time to match integrated training results at ~3% compute cost.
Hidden States Know Where Reasoning Diverges: Credit Assignment via Span-Level Wasserstein Distance
cs.CL 2026-04 unverdicted novelty 6.0

Span-level Wasserstein distances between hidden-state distributions of correct and incorrect rollouts provide a self-supervised signal to reweight advantages in GRPO, improving fine-grained credit assignment on math a...
V-tableR1: Process-Supervised Multimodal Table Reasoning with Critic-Guided Policy Optimization
cs.AI 2026-04 unverdicted novelty 6.0

V-tableR1 uses a critic VLM for dense step-level feedback and a new PGPO algorithm to shift multimodal table reasoning from pattern matching to verifiable logical steps, achieving SOTA accuracy with a 4B open-source model.
TEMPO: Scaling Test-time Training for Large Reasoning Models
cs.LG 2026-04 unverdicted novelty 6.0

TEMPO scales test-time training for large reasoning models by interleaving policy refinement on unlabeled data with critic recalibration on labeled data via an EM formulation, yielding large gains on AIME tasks.
Process Reward Models Meet Planning: Generating Precise and Scalable Datasets for Step-Level Rewards
cs.CL 2026-04 unverdicted novelty 6.0

PDDL planning problems are used to generate about one million precise reasoning steps for training Process Reward Models, and adding this data to existing datasets improves LLM performance on both mathematical and non...
HEALing Entropy Collapse: Enhancing Exploration in Few-Shot RLVR via Hybrid-Domain Entropy Dynamics Alignment
cs.LG 2026-04 unverdicted novelty 6.0

HEAL mitigates entropy collapse in few-shot RLVR by selectively adding general-domain data and aligning trajectory-level entropy dynamics, matching full-shot performance with 32 target samples.
Internalizing Outcome Supervision into Process Supervision: A New Paradigm for Reinforcement Learning for Reasoning
cs.LG 2026-04 unverdicted novelty 6.0

A new RL paradigm for reasoning where models generate their own internal process supervision from outcome feedback by recycling failed trajectories.
AgentV-RL: Scaling Reward Modeling with Agentic Verifier
cs.CL 2026-04 unverdicted novelty 6.0

AgentV-RL introduces bidirectional forward-backward agents and RL-driven tool use to improve LLM verifiers, with a 4B model beating prior outcome reward models by 25.2%.
Unleashing Implicit Rewards: Prefix-Value Learning for Distribution-Level Optimization
cs.CL 2026-04 unverdicted novelty 6.0

IPVRM learns prefix values to produce reliable step rewards from sequence outcomes using TD learning, enabling distribution-level RL that improves reasoning when paired with calibrated rewards.
Lightning OPD: Efficient Post-Training for Large Reasoning Models with Offline On-Policy Distillation
cs.LG 2026-04 unverdicted novelty 6.0

Lightning OPD is an offline on-policy distillation method that matches standard OPD performance at 4x efficiency by enforcing teacher consistency between SFT and distillation phases.
OASES: Outcome-Aligned Search-Evaluation Co-Training for Agentic Search
cs.AI 2026-04 unverdicted novelty 6.0

OASES co-trains search policies and evaluators to generate outcome-aligned process rewards, outperforming standard RL baselines on five multi-hop QA benchmarks.
GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization
cs.CL 2026-01 unverdicted novelty 6.0

GDPO decouples per-reward normalization in multi-reward RL to avoid advantage collapse and improve convergence over GRPO on tool-calling, math, and coding tasks.
SimpleVLA-RL: Scaling VLA Training via Reinforcement Learning
cs.RO 2025-09 conditional novelty 6.0

SimpleVLA-RL applies tailored reinforcement learning to VLA models, reaching SoTA on LIBERO, outperforming π₀ on RoboTwin, and surpassing SFT in real-world tasks while reducing data needs and identifying a 'pushcut' p...
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency
cs.CV 2025-08 unverdicted novelty 6.0

InternVL3.5 advances open-source multimodal models with Cascade RL for +16% reasoning gains and ViR for 4x inference speedup, with the 241B model reaching SOTA among open-source MLLMs on multimodal, reasoning, and age...
Rubrics as Rewards: Reinforcement Learning Beyond Verifiable Domains
cs.LG 2025-07 unverdicted novelty 6.0

RaR uses aggregated rubric feedback as rewards in on-policy RL, delivering up to 31% relative gains on HealthBench and 7% on GPQA-Diamond versus direct Likert LLM-as-judge baselines.
Selective Off-Policy Reference Tuning with Plan Guidance
cs.AI 2026-05 unverdicted novelty 5.0

SORT converts all-failed reasoning prompts into selective, structure-aware training signals by weighting tokens according to how much a reference-derived plan increases their probability.
GUI Agents with Reinforcement Learning: Toward Digital Inhabitants
cs.AI 2026-04 unverdicted novelty 5.0

The paper delivers the first comprehensive overview of RL for GUI agents, organizing methods into offline, online, and hybrid strategies while analyzing trends in rewards, efficiency, and deliberation to outline a fut...
OGER: A Robust Offline-Guided Exploration Reward for Hybrid Reinforcement Learning
cs.AI 2026-04 unverdicted novelty 5.0

OGER adds an auxiliary exploration reward built from offline trajectories and model entropy to hybrid RL training, yielding gains on math reasoning benchmarks and out-of-domain generalization.
TimeRFT: Stimulating Generalizable Time Series Forecasting for TSFMs via Reinforcement Finetuning
eess.SP 2026-04 unverdicted novelty 5.0

TimeRFT applies reinforcement learning with multi-faceted step-wise rewards and informative sample selection to improve generalization and accuracy in TSFM adaptation beyond supervised fine-tuning.
SCOPE: Signal-Calibrated On-Policy Distillation Enhancement with Dual-Path Adaptive Weighting
cs.LG 2026-04 unverdicted novelty 5.0

SCOPE routes LLM on-policy rollouts by correctness into teacher-perplexity-weighted KL for errors and student-perplexity-weighted MLE for successes, with group normalization, yielding 11.42% relative Avg@32 gain on re...
SVSR: A Self-Verification and Self-Rectification Paradigm for Multimodal Reasoning
cs.AI 2026-04 unverdicted novelty 5.0

SVSR trains multimodal models to verify and correct their own reasoning using a preference dataset, supervised fine-tuning, and semi-online DPO with a teacher model.
A Comparative Theoretical Analysis of Entropy Control Methods in Reinforcement Learning
cs.LG 2026-04 unverdicted novelty 5.0

Covariance-based entropy control selectively regularizes high-covariance tokens in softmax policies and achieves asymptotic unbiasedness upon annealing, unlike traditional regularization which introduces dense bias an...
Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models
cs.AI 2025-03 unverdicted novelty 5.0

The paper unifies perspectives on Long CoT in reasoning LLMs by introducing a taxonomy, detailing characteristics of deep reasoning and reflection, and discussing emergence phenomena and future directions.
From System 1 to System 2: A Survey of Reasoning Large Language Models
cs.AI 2025-02 accept novelty 3.0

The survey organizes the shift of LLMs toward deliberate System 2 reasoning, covering model construction techniques, performance on math and coding benchmarks, and future research directions.
Fin-PRM: A Domain-Specialized Process Reward Model for Financial Reasoning in Large Language Models
cs.CL 2025-08

Reference graph

Works this paper leans on

133 extracted references · 133 canonical work pages · cited by 49 Pith papers · 25 internal anchors

[1]

Scaling Learning Algorithms Towards

Bengio, Yoshua and LeCun, Yann , booktitle =. Scaling Learning Algorithms Towards

work page
[2]

and Osindero, Simon and Teh, Yee Whye , journal =

Hinton, Geoffrey E. and Osindero, Simon and Teh, Yee Whye , journal =. A Fast Learning Algorithm for Deep Belief Nets , volume =

work page
[3]

2016 , publisher=

Deep learning , author=. 2016 , publisher=

work page 2016
[5]

arXiv preprint arXiv:2310.12036 , year=

Mohammad Gheshlaghi Azar, Mark Rowland, Bilal Piot, Daniel Guo, Daniele Calandriello, Michal Valko, and R \'e mi Munos. A general theoretical paradigm to understand learning from human preferences. International Conference on Artificial Intelligence and Statistics, abs/2310.12036, 2024

work page arXiv 2024
[8]

Deep reinforcement learning from human preferences

Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences. Advances in neural information processing systems, 30, 2017

work page 2017
[9]

Ultrafeedback: Boosting language models with scaled ai feedback

Ganqu Cui, Lifan Yuan, Ning Ding, Guanming Yao, Bingxiang He, Wei Zhu, Yuan Ni, Guotong Xie, Ruobing Xie, Yankai Lin, Zhiyuan Liu, and Maosong Sun. Ultrafeedback: Boosting language models with scaled ai feedback. In ICML, 2024

work page 2024
[10]

DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai D...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[11]

Kto: Model alignment as prospect theoretic optimization

Kawin Ethayarajh, Winnie Xu, Niklas Muennighoff, Dan Jurafsky, and Douwe Kiela. Kto: Model alignment as prospect theoretic optimization. ICML, 2024

work page 2024
[12]

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al- Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, and 1 others

Jiaxuan Gao, Shusheng Xu, Wenjie Ye, Weiling Liu, Chuyi He, Wei Fu, Zhiyu Mei, Guangju Wang, and Yi Wu. On designing effective rl reward at training time for llm reasoning. ArXiv, abs/2410.15115, 2024

work page arXiv 2024
[13]

Scaling laws for reward model overoptimization

Leo Gao, John Schulman, and Jacob Hilton. Scaling laws for reward model overoptimization. In International Conference on Machine Learning, 2022

work page 2022
[15]

Reinforcement learning with deep energy-based policies

Tuomas Haarnoja, Haoran Tang, Pieter Abbeel, and Sergey Levine. Reinforcement learning with deep energy-based policies. In Doina Precup and Yee Whye Teh (eds.), Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017 , volume 70 of Proceedings of Machine Learning Research, pp.\ 1352--1361. ...

work page 2017
[22]

Buy 4 reinforce samples, get a baseline for free! In DeepRLStructPred@ICLR, 2019

Wouter Kool, Herke van Hoof, and Max Welling. Buy 4 reinforce samples, get a baseline for free! In DeepRLStructPred@ICLR, 2019. URL https://api.semanticscholar.org/CorpusID:198489118

work page 2019
[23]

Tulu 3: Pushing Frontiers in Open Language Model Post-Training

Nathan Lambert, Jacob Daniel Morrison, Valentina Pyatkin, Shengyi Huang, Hamish Ivison, Faeze Brahman, Lester James Validad Miranda, Alisa Liu, Nouha Dziri, Xinxi Lyu, Yuling Gu, Saumya Malik, Victoria Graf, Jena D. Hwang, Jiangjiang Yang, Ronan Le Bras, Oyvind Tafjord, Chris Wilhelm, Luca Soldaini, Noah A. Smith, Yizhong Wang, Pradeep Dasigi, and Hanna H...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[25]

Solving quantitative reasoning problems with language models

Aitor Lewkowycz, Anders Andreassen, David Dohan, Ethan Dyer, Henryk Michalewski, Vinay Ramasesh, Ambrose Slone, Cem Anil, Imanol Schlag, Theo Gutman-Solo, et al. Solving quantitative reasoning problems with language models. Advances in Neural Information Processing Systems, 35: 0 3843--3857, 2022

work page 2022
[26]

Numinamath: The largest public dataset in ai4maths with 860k pairs of competition math problems and solutions

Jia Li, Edward Beeching, Lewis Tunstall, Ben Lipkin, Roman Soletskyi, Shengyi Huang, Kashif Rasul, Longhui Yu, Albert Q Jiang, Ziju Shen, et al. Numinamath: The largest public dataset in ai4maths with 860k pairs of competition math problems and solutions. Hugging Face repository, 13: 0 9, 2024

work page 2024
[29]

Let's Verify Step by Step

Hunter Lightman, Vineet Kosaraju, Yura Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let's verify step by step. ArXiv, abs/2305.20050, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[30]

Rho-1: Not all tokens are what you need, 2024

Zhenghao Lin, Zhibin Gou, Yeyun Gong, Xiao Liu, Yelong Shen, Ruochen Xu, Chen Lin, Yujiu Yang, Jian Jiao, Nan Duan, and Weizhu Chen. Rho-1: Not all tokens are what you need, 2024

work page 2024
[32]

Tang, Manan Roongta, Colin Cai, Jeffrey Luo, Li Erran Li, Raluca Ada Popa, and Ion Stoica

Michael Luo, Sijun Tan, Justin Wong, Xiaoxiang Shi, William Y. Tang, Manan Roongta, Colin Cai, Jeffrey Luo, Li Erran Li, Raluca Ada Popa, and Ion Stoica. Deepscaler: Surpassing o1-preview with a 1.5b model by scaling rl. https://pretty-radio-b75.notion.site/DeepScaleR-Surpassing-O1-Preview-with-a-1-5B-Model-by-Scaling-RL-19681902c1468005bed8ca303013a4e2, ...

work page 2025
[33]

Ng, Daishi Harada, and Stuart Russell

Andrew Y. Ng, Daishi Harada, and Stuart Russell. Policy invariance under reward transformations: Theory and application to reward shaping. In Ivan Bratko and Saso Dzeroski (eds.), Proceedings of the Sixteenth International Conference on Machine Learning (ICML 1999), Bled, Slovenia, June 27 - 30, 1999 , pp.\ 278--287. Morgan Kaufmann, 1999

work page 1999
[35]

Training language models to follow instructions with human feedback

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35: 0 27730--27744, 2022

work page 2022
[36]

Direct preference optimization: Your language model is secretly a reward model

Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 36, 2023

work page 2023
[37]

Fromr to Q∗: Your language model is secretly a Q-function,

Rafael Rafailov, Joey Hejna, Ryan Park, and Chelsea Finn. From r to q^ * : Your language model is secretly a q-function. arXiv preprint arXiv:2404.12358, 2024

work page arXiv 2024
[38]

arXiv preprint arXiv:2404.03715 , year=

Corby Rosset, Ching-An Cheng, Arindam Mitra, Michael Santacroce, Ahmed Awadallah, and Tengyang Xie. Direct nash optimization: Teaching language models to self-improve with general preferences. ArXiv, abs/2404.03715, 2024

work page arXiv 2024
[39]

Jordan, and Pieter Abbeel

John Schulman, Philipp Moritz, Sergey Levine, Michael I. Jordan, and Pieter Abbeel. High-dimensional continuous control using generalized advantage estimation. In 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings , 2016

work page 2016
[42]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models, 2024. URL https://arxiv.org/abs/2402.03300

work page internal anchor Pith review Pith/arXiv arXiv 2024
[43]

HybridFlow: A Flexible and Efficient RLHF Framework

Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework. arXiv preprint arXiv: 2409.19256, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[44]

reasoning-0.01, 2024

SkunkworksAI. reasoning-0.01, 2024

work page 2024
[45]

The bitter lesson

Richard Sutton. The bitter lesson. Incomplete Ideas (blog), 13 0 (1): 0 38, 2019

work page 2019
[46]

Learning to predict by the methods of temporal differences

Richard S Sutton. Learning to predict by the methods of temporal differences. Machine learning, 3: 0 9--44, 1988

work page 1988
[47]

Reinforcement learning: An introduction

Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT press, 2018

work page 2018
[49]

Qwq: Reflect deeply on the boundaries of the unknown, November 2024

Qwen Team. Qwq: Reflect deeply on the boundaries of the unknown, November 2024. URL https://qwenlm.github.io/blog/qwq-32b-preview/

work page 2024
[50]

Openmathinstruct-2: Accelerating ai for math with mas- sive open-source instruction data.arXiv preprint arXiv:2410.01560, 2024

Shubham Toshniwal, Wei Du, Ivan Moshkov, Branislav Kisacanin, Alexan Ayrapetyan, and Igor Gitman. Openmathinstruct-2: Accelerating ai for math with massive open-source instruction data. arXiv preprint arXiv:2410.01560, 2024

work page arXiv 2024
[52]

Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations

Peiyi Wang, Lei Li, Zhihong Shao, Runxin Xu, Damai Dai, Yifei Li, Deli Chen, Y.Wu, and Zhifang Sui. Math-shepherd: Verify and reinforce llms step-by-step without human annotations. ArXiv, abs/2312.08935, 2023

work page internal anchor Pith review arXiv 2023
[53]

Magicoder: Empowering code generation with oss-instruct

Yuxiang Wei, Zhe Wang, Jiawei Liu, Yifeng Ding, and Lingming Zhang. Magicoder: Empowering code generation with oss-instruct. In Forty-first International Conference on Machine Learning, 2024

work page 2024
[54]

Simple statistical gradient-following algorithms for connectionist reinforcement learning

Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8: 0 229--256, 1992

work page 1992
[56]

Qwen2.5-Math Technical Report: Toward Mathematical Expert Model via Self-Improvement

An Yang, Beichen Zhang, Binyuan Hui, Bofei Gao, Bowen Yu, Chengpeng Li, Dayiheng Liu, Jianhong Tu, Jingren Zhou, Junyang Lin, Keming Lu, Mingfeng Xue, Runji Lin, Tianyu Liu, Xingzhang Ren, and Zhenru Zhang. Qwen2.5-math technical report: Toward mathematical expert model via self-improvement, 2024 b . URL https://arxiv.org/abs/2409.12122

work page internal anchor Pith review Pith/arXiv arXiv 2024
[57]

Advancing llm reasoning generalists with preference trees

Lifan Yuan, Ganqu Cui, Hanbin Wang, Ning Ding, Xingyao Wang, Jia Deng, Boji Shan, Huimin Chen, Ruobing Xie, Yankai Lin, Zhenghao Liu, Bowen Zhou, Hao Peng, Zhiyuan Liu, and Maosong Sun. Advancing llm reasoning generalists with preference trees. ArXiv, 2024 a

work page 2024
[58]

Free process rewards without process labels.arXiv preprint arXiv:2412.01981, 2024

Lifan Yuan, Wendi Li, Huayu Chen, Ganqu Cui, Ning Ding, Kaiyan Zhang, Bowen Zhou, Zhiyuan Liu, and Hao Peng. Free process rewards without process labels, 2024 b . URL https://arxiv.org/abs/2412.01981

work page arXiv 2024
[60]

Mammoth2: Scaling instructions from the web

Xiang Yue, Tuney Zheng, Ge Zhang, and Wenhu Chen. Mammoth2: Scaling instructions from the web. ArXiv, abs/2405.03548, 2024

work page arXiv 2024
[61]

Ultramedical: Building specialized generalists in biomedicine, 2024

Kaiyan Zhang, Sihang Zeng, Ermo Hua, Ning Ding, Zhang-Ren Chen, Zhiyuan Ma, Haoxin Li, Ganqu Cui, Biqing Qi, Xuekai Zhu, Xingtai Lv, Hu Jinfang, Zhiyuan Liu, and Bowen Zhou. Ultramedical: Building specialized generalists in biomedicine, 2024

work page 2024
[64]

Ziebart, Andrew L

Brian D. Ziebart, Andrew L. Maas, J. Andrew Bagnell, and Anind K. Dey. Maximum entropy inverse reinforcement learning. In Dieter Fox and Carla P. Gomes (eds.), Proceedings of the Twenty-Third AAAI Conference on Artificial Intelligence, AAAI 2008, Chicago, Illinois, USA, July 13-17, 2008 , pp.\ 1433--1438. AAAI Press, 2008. URL http://www.aaai.org/Library/...

work page 2008
[65]

Ziebart and Andrew L

Brian D. Ziebart and Andrew L. Maas and J. Andrew Bagnell and Anind K. Dey , editor =. Maximum Entropy Inverse Reinforcement Learning , booktitle =. 2008 , url =

work page 2008
[66]

Reinforcement Learning with Deep Energy-Based Policies , booktitle =

Tuomas Haarnoja and Haoran Tang and Pieter Abbeel and Sergey Levine , editor =. Reinforcement Learning with Deep Energy-Based Policies , booktitle =. 2017 , url =

work page 2017
[67]

Ng and Daishi Harada and Stuart Russell , editor =

Andrew Y. Ng and Daishi Harada and Stuart Russell , editor =. Policy Invariance Under Reward Transformations: Theory and Application to Reward Shaping , booktitle =

work page
[68]

ArXiv , year=

MAmmoTH2: Scaling Instructions from the Web , author=. ArXiv , year=

work page
[69]

Scalable agent alignment via reward modeling: a research direction

Scalable agent alignment via reward modeling: a research direction , author=. arXiv preprint arXiv:1811.07871 , year=

work page Pith review arXiv
[70]

Advances in neural information processing systems , volume=

Deep reinforcement learning from human preferences , author=. Advances in neural information processing systems , volume=

work page
[71]

ICML , year=

ULTRAFEEDBACK: Boosting Language Models with Scaled AI Feedback , author=. ICML , year=

work page
[72]

International Conference on Artificial Intelligence and Statistics , year=

A General Theoretical Paradigm to Understand Learning from Human Preferences , author=. International Conference on Artificial Intelligence and Statistics , year=

work page
[73]

ICML , year=

KTO: Model Alignment as Prospect Theoretic Optimization , author=. ICML , year=

work page
[74]

ArXiv , year=

Direct Nash Optimization: Teaching Language Models to Self-Improve with General Preferences , author=. ArXiv , year=

work page
[75]

Kimi k1.5: Scaling Reinforcement Learning with LLMs

Kimi k1. 5: Scaling Reinforcement Learning with LLMs , author=. arXiv preprint arXiv:2501.12599 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[76]

ArXiv , year=

Advancing LLM Reasoning Generalists with Preference Trees , author=. ArXiv , year=

work page
[77]

OpenAI o1 System Card

Openai o1 system card , author=. arXiv preprint arXiv:2412.16720 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[78]

Advances in neural information processing systems , volume=

Training language models to follow instructions with human feedback , author=. Advances in neural information processing systems , volume=

work page
[79]

arXiv preprint arXiv:2406.09760 , year=

Bootstrapping Language Models with DPO Implicit Rewards , author=. arXiv preprint arXiv:2406.09760 , year=

work page arXiv
[80]

From r to Q^

Rafailov, Rafael and Hejna, Joey and Park, Ryan and Finn, Chelsea , journal=. From r to Q^

work page
[81]

arXiv preprint arXiv:2405.19262 , year=

Weak-to-Strong Search: Align Large Language Models via Searching over Small Language Models , author=. arXiv preprint arXiv:2405.19262 , year=

work page arXiv
[82]

Vineppo: Refining credit assignment in rl training of llms, 2025

Vineppo: Unlocking rl potential for llm reasoning through refined credit assignment , author=. arXiv preprint arXiv:2410.01679 , year=

work page arXiv
[83]

ArXiv , year=

Let's Verify Step by Step , author=. ArXiv , year=

work page
[84]

ArXiv , year=

Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations , author=. ArXiv , year=

work page
[85]

International Conference on Machine Learning , year=

Scaling Laws for Reward Model Overoptimization , author=. International Conference on Machine Learning , year=

work page
[86]

ArXiv , year=

Magicoder: Source Code Is All You Need , author=. ArXiv , year=

work page
[87]

Acemath: Advancing frontier math reasoning with post-training and reward modeling

AceMath: Advancing Frontier Math Reasoning with Post-Training and Reward Modeling , author=. arXiv preprint arXiv:2412.15084 , year=

work page arXiv
[88]

& Kumar, A

Rewarding progress: Scaling automated process verifiers for llm reasoning , author=. arXiv preprint arXiv:2410.08146 , year=

work page arXiv
[89]

2018 , publisher=

Reinforcement learning: An introduction , author=. 2018 , publisher=

work page 2018
[90]

Hwang and Jiangjiang Yang and Ronan Le Bras and Oyvind Tafjord and Chris Wilhelm and Luca Soldaini and Noah A

Nathan Lambert and Jacob Daniel Morrison and Valentina Pyatkin and Shengyi Huang and Hamish Ivison and Faeze Brahman and Lester James Validad Miranda and Alisa Liu and Nouha Dziri and Xinxi Lyu and Yuling Gu and Saumya Malik and Victoria Graf and Jena D. Hwang and Jiangjiang Yang and Ronan Le Bras and Oyvind Tafjord and Chris Wilhelm and Luca Soldaini and...

work page 2024
[91]

2025 , eprint=

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning , author=. 2025 , eprint=

work page 2025
[92]

DeepRLStructPred@ICLR , year=

Buy 4 REINFORCE Samples, Get a Baseline for Free! , author=. DeepRLStructPred@ICLR , year=

work page
[93]

Proximal Policy Optimization Algorithms

Proximal policy optimization algorithms , author=. arXiv preprint arXiv:1707.06347 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[94]

Machine learning , volume=

Simple statistical gradient-following algorithms for connectionist reinforcement learning , author=. Machine learning , volume=. 1992 , publisher=

work page 1992
[95]

Back to Basics: Revisiting REINFORCE Style Optimization for Learning from Human Feedback in LLMs

Back to basics: Revisiting reinforce style optimization for learning from human feedback in llms , author=. arXiv preprint arXiv:2402.14740 , year=

work page internal anchor Pith review arXiv
[96]

Zhihong Shao and Peiyi Wang and Qihao Zhu and Runxin Xu and Junxiao Song and Xiao Bi and Haowei Zhang and Mingchuan Zhang and Y. K. Li and Y. Wu and Daya Guo , title =. CoRR , volume =. 2024 , url =

work page 2024
[97]

Remax: A simple, effective, and efficient reinforcement learning method for aligning large language models.arXiv preprint arXiv:2310.10505, 2023

ReMax: A Simple, Effective, and Efficient Method for Aligning Large Language Models , author =. arXiv preprint arXiv:2310.10505 , year =

work page arXiv
[98]

Self-play with execution feedback: Improving instruction-following capabilities of large language models

Self-play with Execution Feedback: Improving Instruction-following Capabilities of Large Language Models , author=. arXiv preprint arXiv:2406.13542 , year=

work page arXiv
[99]

WizardLM: Empowering large pre-trained language models to follow complex instructions

Wizardlm: Empowering large language models to follow complex instructions , author=. arXiv preprint arXiv:2304.12244 , year=

work page internal anchor Pith review arXiv
[100]

arxiv preprint arXiv:2404.02823 , year=

Conifer: Improving Complex Constrained Instruction-Following Ability of Large Language Models , author=. arxiv preprint arXiv:2404.02823 , year=

work page arXiv
[101]

2024 , eprint=

GPT-4 Technical Report , author=. 2024 , eprint=

work page 2024
[102]

2024 , eprint=

The Llama 3 Herd of Models , author=. 2024 , eprint=

work page 2024
[103]

2024 , eprint=

Large Language Model Instruction Following: A Survey of Progresses and Challenges , author=. 2024 , eprint=

work page 2024

Showing first 80 references.