hub

International Conference on Machine Learning , pages=

Scaling laws for reward model overoptimization , author= · 2023

13 Pith papers cite this work. Polarity classification is still indexing.

13 Pith papers citing it

browse 13 citing papers

hub tools

JSON dossier citing papers JSON

representative citing papers

Conditional Equivalence of DPO and RLHF: Implicit Assumption, Failure Modes, and Provable Alignment

cs.AI · 2026-05-20 · conditional · novelty 7.0

DPO-RLHF equivalence holds only conditionally on the optimal policy preferring human-preferred responses; otherwise DPO optimizes relative advantage and can prefer worse outputs, addressed by introducing CPO.

TokenRatio: Principled Token-Level Preference Optimization via Ratio Matching

cs.CL · 2026-05-12 · unverdicted · novelty 7.0 · 2 refs

TBPO posits a token-level Bradley-Terry model and derives a Bregman-divergence density-ratio matching loss that generalizes DPO while preserving token-level optimality.

Variance-aware Reward Modeling with Anchor Guidance

stat.ML · 2026-05-12 · unverdicted · novelty 7.0

Anchor-guided variance-aware reward modeling uses two response-level anchors to resolve non-identifiability in Gaussian models of pluralistic preferences, yielding provable identification, a joint training objective, and improved RLHF performance.

Fast Rates for Offline Contextual Bandits with Forward-KL Regularization under Single-Policy Concentrability

cs.LG · 2026-05-09 · unverdicted · novelty 7.0

The paper establishes the first tilde O(epsilon^{-1}) upper bounds and matching lower bounds for forward-KL-regularized offline contextual bandits under single-policy concentrability in both tabular and general function approximation settings.

Beyond Negative Rollouts: Positive-Only Policy Optimization with Implicit Negative Gradients

cs.CL · 2026-05-07 · unverdicted · novelty 7.0

POPO uses bounded importance sampling on positive rollouts and a siamese policy network to achieve implicit negative gradients and stable optimization, matching or exceeding GRPO on math benchmarks such as 36.67% on AIME 2025.

SynthFix: Adaptive Neuro-Symbolic Code Vulnerability Repair

cs.SE · 2026-04-19 · unverdicted · novelty 7.0

SynthFix adaptively routes LLM code repairs to supervised fine-tuning or symbolic-reward fine-tuning, yielding up to 32% higher exact match on JavaScript and C vulnerability benchmarks.

Convex Optimization for Alignment and Preference Learning on a Single GPU

cs.LG · 2026-05-22 · unverdicted · novelty 6.0

COALA applies convex optimization reformulations of neural networks to direct preference optimization, claiming single-GPU training with ~18% of DPO's TFLOPs and competitive performance on multiple datasets and models up to 8B parameters.

Response Time Enhances Alignment with Heterogeneous Preferences

cs.LG · 2026-05-07 · unverdicted · novelty 6.0

Response times modeled as drift-diffusion processes enable consistent estimation of population-average preferences from heterogeneous anonymous binary choices.

On Training in Imagination

cs.LG · 2026-05-07 · unverdicted · novelty 6.0

The work derives the optimal ratio of dynamics-to-reward samples that minimizes a bound on return error and characterizes the tradeoff between noisy but cheap rewards versus accurate but expensive ones in imagination-based policy optimization.

Training Language Models to Self-Correct via Reinforcement Learning

cs.LG · 2024-09-19 · unverdicted · novelty 6.0

SCoRe uses multi-turn online RL with regularization on self-generated traces to improve LLM self-correction, achieving 15.6% and 9.1% gains on MATH and HumanEval for Gemini models.

Inference Scaling Laws: An Empirical Analysis of Compute-Optimal Inference for Problem-Solving with Language Models

cs.AI · 2024-08-01 · conditional · novelty 6.0

Empirical analysis shows scaling inference compute via strategies like tree search can be more efficient than scaling model parameters, with 7B models plus novel search outperforming 34B models.

RLAIF vs. RLHF: Scaling Reinforcement Learning from Human Feedback with AI Feedback

cs.CL · 2023-09-01 · conditional · novelty 6.0

RLAIF matches RLHF on summarization and dialogue tasks, with a direct-RLAIF variant achieving superior results by using LLM rewards directly during training.

RAFT: Reward rAnked FineTuning for Generative Foundation Model Alignment

cs.LG · 2023-04-13 · unverdicted · novelty 5.0

RAFT aligns generative models by ranking samples with a reward model and fine-tuning only on the top-ranked outputs, reporting gains on reward scores and automated metrics for LLMs and diffusion models.

citing papers explorer

Showing 13 of 13 citing papers.

Conditional Equivalence of DPO and RLHF: Implicit Assumption, Failure Modes, and Provable Alignment cs.AI · 2026-05-20 · conditional · none · ref 11
DPO-RLHF equivalence holds only conditionally on the optimal policy preferring human-preferred responses; otherwise DPO optimizes relative advantage and can prefer worse outputs, addressed by introducing CPO.
TokenRatio: Principled Token-Level Preference Optimization via Ratio Matching cs.CL · 2026-05-12 · unverdicted · none · ref 105 · 2 links
TBPO posits a token-level Bradley-Terry model and derives a Bregman-divergence density-ratio matching loss that generalizes DPO while preserving token-level optimality.
Variance-aware Reward Modeling with Anchor Guidance stat.ML · 2026-05-12 · unverdicted · none · ref 28
Anchor-guided variance-aware reward modeling uses two response-level anchors to resolve non-identifiability in Gaussian models of pluralistic preferences, yielding provable identification, a joint training objective, and improved RLHF performance.
Fast Rates for Offline Contextual Bandits with Forward-KL Regularization under Single-Policy Concentrability cs.LG · 2026-05-09 · unverdicted · none · ref 61
The paper establishes the first tilde O(epsilon^{-1}) upper bounds and matching lower bounds for forward-KL-regularized offline contextual bandits under single-policy concentrability in both tabular and general function approximation settings.
Beyond Negative Rollouts: Positive-Only Policy Optimization with Implicit Negative Gradients cs.CL · 2026-05-07 · unverdicted · none · ref 15
POPO uses bounded importance sampling on positive rollouts and a siamese policy network to achieve implicit negative gradients and stable optimization, matching or exceeding GRPO on math benchmarks such as 36.67% on AIME 2025.
SynthFix: Adaptive Neuro-Symbolic Code Vulnerability Repair cs.SE · 2026-04-19 · unverdicted · none · ref 91
SynthFix adaptively routes LLM code repairs to supervised fine-tuning or symbolic-reward fine-tuning, yielding up to 32% higher exact match on JavaScript and C vulnerability benchmarks.
Convex Optimization for Alignment and Preference Learning on a Single GPU cs.LG · 2026-05-22 · unverdicted · none · ref 92
COALA applies convex optimization reformulations of neural networks to direct preference optimization, claiming single-GPU training with ~18% of DPO's TFLOPs and competitive performance on multiple datasets and models up to 8B parameters.
Response Time Enhances Alignment with Heterogeneous Preferences cs.LG · 2026-05-07 · unverdicted · none · ref 69
Response times modeled as drift-diffusion processes enable consistent estimation of population-average preferences from heterogeneous anonymous binary choices.
On Training in Imagination cs.LG · 2026-05-07 · unverdicted · none · ref 39
The work derives the optimal ratio of dynamics-to-reward samples that minimizes a bound on return error and characterizes the tradeoff between noisy but cheap rewards versus accurate but expensive ones in imagination-based policy optimization.
Training Language Models to Self-Correct via Reinforcement Learning cs.LG · 2024-09-19 · unverdicted · none · ref 109
SCoRe uses multi-turn online RL with regularization on self-generated traces to improve LLM self-correction, achieving 15.6% and 9.1% gains on MATH and HumanEval for Gemini models.
Inference Scaling Laws: An Empirical Analysis of Compute-Optimal Inference for Problem-Solving with Language Models cs.AI · 2024-08-01 · conditional · none · ref 241
Empirical analysis shows scaling inference compute via strategies like tree search can be more efficient than scaling model parameters, with 7B models plus novel search outperforming 34B models.
RLAIF vs. RLHF: Scaling Reinforcement Learning from Human Feedback with AI Feedback cs.CL · 2023-09-01 · conditional · none · ref 102
RLAIF matches RLHF on summarization and dialogue tasks, with a direct-RLAIF variant achieving superior results by using LLM rewards directly during training.
RAFT: Reward rAnked FineTuning for Generative Foundation Model Alignment cs.LG · 2023-04-13 · unverdicted · none · ref 105
RAFT aligns generative models by ranking samples with a reward model and fine-tuning only on the top-ranked outputs, reporting gains on reward scores and automated metrics for LLMs and diffusion models.

International Conference on Machine Learning , pages=

hub tools

fields

years

verdicts

representative citing papers

citing papers explorer