hub

Direct preference optimization: Your language model is secretly a reward model

Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, Chelsea Finn · 2023

13 Pith papers cite this work. Polarity classification is still indexing.

13 Pith papers citing it

browse 13 citing papers

hub tools

JSON dossier citing papers JSON

citation-role summary

background 1 method 1

citation-polarity summary

background 1 use method 1

representative citing papers

Flow-GRPO: Training Flow Matching Models via Online RL

cs.CV · 2025-05-08 · unverdicted · novelty 8.0

Flow-GRPO is the first online RL method for flow matching models, raising GenEval accuracy from 63% to 95% and text-rendering accuracy from 59% to 92% with little reward hacking.

AffectGPT-RL: Revealing Roles of Reinforcement Learning in Open-Vocabulary Emotion Recognition

cs.HC · 2026-05-07 · unverdicted · novelty 7.0

AffectGPT-RL applies reinforcement learning to optimize non-differentiable emotion wheel metrics in open-vocabulary multimodal emotion recognition, yielding performance gains and state-of-the-art results on basic emotion recognition benchmarks.

On the Sample Complexity of Differentially Private Policy Optimization

cs.LG · 2025-10-24 · unverdicted · novelty 7.0

Differential privacy in policy optimization adds sample complexity costs that often appear as lower-order terms rather than dominating the bounds.

LLM Hypnosis: Exploiting User Feedback for Unauthorized Knowledge Injection to All Users

cs.CL · 2025-07-03 · unverdicted · novelty 6.0

A single attacker can use strategic upvoting and downvoting on language model outputs to inject facts, security flaws, or fake news that persist in the model for all users after preference tuning.

AutoVLA: A Vision-Language-Action Model for End-to-End Autonomous Driving with Adaptive Reasoning and Reinforcement Fine-Tuning

cs.CV · 2025-06-16 · unverdicted · novelty 6.0

AutoVLA unifies semantic reasoning and trajectory planning in one autoregressive VLA model for end-to-end autonomous driving by tokenizing trajectories into discrete actions and using GRPO reinforcement fine-tuning to adaptively reduce unnecessary reasoning.

PrefixMemory-Tuning: Modernizing Prefix-Tuning by Decoupling the Prefix from Attention

cs.CL · 2025-06-16 · unverdicted · novelty 6.0

PrefixMemory-Tuning decouples the prefix from attention to overcome performance limits of traditional prefix-tuning and reaches competitive results with modern PEFT methods on LLM adaptation benchmarks.

Exploring the Secondary Risks of Large Language Models

cs.LG · 2025-06-14 · unverdicted · novelty 6.0

Introduces secondary risks as a new class of LLM failures from benign prompts, defines two primitives, proposes SecLens search framework, and releases SecRiskBench showing risks are widespread across 16 models.

Benchmarking Misuse Mitigation Against Covert Adversaries

cs.CR · 2025-06-06 · unverdicted · novelty 6.0

Develops the BSD data generation pipeline and two new datasets to evaluate decomposition attacks as effective misuse enablers and stateful defenses as a countermeasure in language model safety.

LoVeC: Reinforcement Learning for Better Verbalized Confidence in Long-Form Generations

cs.CL · 2025-05-29 · unverdicted · novelty 6.0

LoVeC uses RL to train LLMs to output verbalized numerical confidence scores for statements in long-form text, achieving better calibration than self-consistency baselines on QA datasets while being 20x faster.

Learning to Reason under Off-Policy Guidance

cs.LG · 2025-04-21 · unverdicted · novelty 6.0

LUFFY mixes off-policy reasoning traces into RLVR training via Mixed-Policy GRPO and regularized importance sampling, delivering over 6-point gains on math benchmarks and enabling training of weak models where on-policy RLVR fails.

Supervising the search process produces reliable and generalizable information-seeking agents

cs.CL · 2025-02-19 · unverdicted · novelty 6.0

Process supervision via RAG-Gym produces more reliable and generalizable search agents, with gains driven by higher-quality queries on out-of-domain multi-hop tasks.

OpenRLHF: An Easy-to-use, Scalable and High-performance RLHF Framework

cs.AI · 2024-05-20 · unverdicted · novelty 6.0

OpenRLHF is a new open-source RLHF framework reporting 1.22x to 1.68x speedups and fewer lines of code than prior systems.

Kalman Filter Enhanced GRPO for Reinforcement Learning-Based Language Model Reasoning

cs.LG · 2025-05-12 · conditional · novelty 5.0

KRPO uses a Kalman filter to estimate latent prompt-level reward baselines from per-group rewards in GRPO, yielding better reward curves and accuracy on math reasoning benchmarks.

citing papers explorer

Showing 13 of 13 citing papers.

Flow-GRPO: Training Flow Matching Models via Online RL cs.CV · 2025-05-08 · unverdicted · none · ref 13
Flow-GRPO is the first online RL method for flow matching models, raising GenEval accuracy from 63% to 95% and text-rendering accuracy from 59% to 92% with little reward hacking.
AffectGPT-RL: Revealing Roles of Reinforcement Learning in Open-Vocabulary Emotion Recognition cs.HC · 2026-05-07 · unverdicted · none · ref 46
AffectGPT-RL applies reinforcement learning to optimize non-differentiable emotion wheel metrics in open-vocabulary multimodal emotion recognition, yielding performance gains and state-of-the-art results on basic emotion recognition benchmarks.
On the Sample Complexity of Differentially Private Policy Optimization cs.LG · 2025-10-24 · unverdicted · none · ref 20
Differential privacy in policy optimization adds sample complexity costs that often appear as lower-order terms rather than dominating the bounds.
LLM Hypnosis: Exploiting User Feedback for Unauthorized Knowledge Injection to All Users cs.CL · 2025-07-03 · unverdicted · none · ref 20
A single attacker can use strategic upvoting and downvoting on language model outputs to inject facts, security flaws, or fake news that persist in the model for all users after preference tuning.
AutoVLA: A Vision-Language-Action Model for End-to-End Autonomous Driving with Adaptive Reasoning and Reinforcement Fine-Tuning cs.CV · 2025-06-16 · unverdicted · none · ref 87
AutoVLA unifies semantic reasoning and trajectory planning in one autoregressive VLA model for end-to-end autonomous driving by tokenizing trajectories into discrete actions and using GRPO reinforcement fine-tuning to adaptively reduce unnecessary reasoning.
PrefixMemory-Tuning: Modernizing Prefix-Tuning by Decoupling the Prefix from Attention cs.CL · 2025-06-16 · unverdicted · none · ref 30
PrefixMemory-Tuning decouples the prefix from attention to overcome performance limits of traditional prefix-tuning and reaches competitive results with modern PEFT methods on LLM adaptation benchmarks.
Exploring the Secondary Risks of Large Language Models cs.LG · 2025-06-14 · unverdicted · none · ref 37
Introduces secondary risks as a new class of LLM failures from benign prompts, defines two primitives, proposes SecLens search framework, and releases SecRiskBench showing risks are widespread across 16 models.
Benchmarking Misuse Mitigation Against Covert Adversaries cs.CR · 2025-06-06 · unverdicted · none · ref 9
Develops the BSD data generation pipeline and two new datasets to evaluate decomposition attacks as effective misuse enablers and stateful defenses as a countermeasure in language model safety.
LoVeC: Reinforcement Learning for Better Verbalized Confidence in Long-Form Generations cs.CL · 2025-05-29 · unverdicted · none · ref 50
LoVeC uses RL to train LLMs to output verbalized numerical confidence scores for statements in long-form text, achieving better calibration than self-consistency baselines on QA datasets while being 20x faster.
Learning to Reason under Off-Policy Guidance cs.LG · 2025-04-21 · unverdicted · none · ref 55
LUFFY mixes off-policy reasoning traces into RLVR training via Mixed-Policy GRPO and regularized importance sampling, delivering over 6-point gains on math benchmarks and enabling training of weak models where on-policy RLVR fails.
Supervising the search process produces reliable and generalizable information-seeking agents cs.CL · 2025-02-19 · unverdicted · none · ref 56
Process supervision via RAG-Gym produces more reliable and generalizable search agents, with gains driven by higher-quality queries on out-of-domain multi-hop tasks.
OpenRLHF: An Easy-to-use, Scalable and High-performance RLHF Framework cs.AI · 2024-05-20 · unverdicted · none · ref 19
OpenRLHF is a new open-source RLHF framework reporting 1.22x to 1.68x speedups and fewer lines of code than prior systems.
Kalman Filter Enhanced GRPO for Reinforcement Learning-Based Language Model Reasoning cs.LG · 2025-05-12 · conditional · none · ref 17
KRPO uses a Kalman filter to estimate latent prompt-level reward baselines from per-group rewards in GRPO, yielding better reward curves and accuracy on math reasoning benchmarks.

Direct preference optimization: Your language model is secretly a reward model

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer