Scalable agent alignment via reward modeling: a research direction

David Krueger; Jan Leike; Miljan Martic; Shane Legg; Tom Everitt; Vishal Maini

arxiv: 1811.07871 · v1 · pith:RAJTRGZFnew · submitted 2018-11-19 · 💻 cs.LG · cs.AI· cs.NE· stat.ML

Scalable agent alignment via reward modeling: a research direction

Jan Leike , David Krueger , Tom Everitt , Miljan Martic , Vishal Maini , Shane Legg This is my paper

classification 💻 cs.LG cs.AIcs.NEstat.ML

keywords rewardagentalignmentlearningmodelinguseragentschallenges

0 comments

read the original abstract

One obstacle to applying reinforcement learning algorithms to real-world problems is the lack of suitable reward functions. Designing such reward functions is difficult in part because the user only has an implicit understanding of the task objective. This gives rise to the agent alignment problem: how do we create agents that behave in accordance with the user's intentions? We outline a high-level research direction to solve the agent alignment problem centered around reward modeling: learning a reward function from interaction with the user and optimizing the learned reward function with reinforcement learning. We discuss the key challenges we expect to face when scaling reward modeling to complex and general domains, concrete approaches to mitigate these challenges, and ways to establish trust in the resulting agents.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 33 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Risks from Learned Optimization in Advanced Machine Learning Systems
cs.AI 2019-06 accept novelty 9.0

Mesa-optimization arises when learned models act as optimizers with objectives that can differ from their training loss, creating alignment risks in advanced machine learning.
Universal and Transferable Adversarial Attacks on Aligned Language Models
cs.CL 2023-07 accept novelty 8.0

Gradient and greedy search over token suffixes produces universal, transferable adversarial prompts that elicit objectionable outputs from aligned models including black-box commercial systems.
Discovering Latent Knowledge in Language Models Without Supervision
cs.CL 2022-12 conditional novelty 8.0

An unsupervised technique extracts latent yes-no knowledge from language model activations by locating a direction that satisfies logical consistency properties, outperforming zero-shot accuracy by 4% on average acros...
Learning to summarize from human feedback
cs.CL 2020-09 conditional novelty 7.0

Reinforcement learning on a reward model trained from human summary comparisons produces summaries humans prefer over supervised fine-tuning or human references on TL;DR and transfers to CNN/DM.
Fine-Tuning Language Models from Human Preferences
cs.CL 2019-09 unverdicted novelty 7.0

Language models fine-tuned via RL on 5k-60k human preference comparisons produce stylistically better text continuations and human-preferred summaries that sometimes copy input sentences.
Deep Pre-Alignment for VLMs
cs.CV 2026-05 unverdicted novelty 6.0

Deep Pre-Alignment uses a small VLM perceiver instead of ViT to pre-align visual features with LLM text space, yielding 1.9-3.0 point gains on multimodal benchmarks and 32.9% less language forgetting.
Silent Collapse in Recursive Learning Systems
cs.LG 2026-05 unverdicted novelty 6.0

Silent collapse in recursive learning contracts internal distributions like entropy and diversity despite stable metrics, preceded by three precursors that enable the MTR monitoring framework to intervene early.
Correcting Influence: Unboxing LLM Outputs with Orthogonal Latent Spaces
cs.LG 2026-05 unverdicted novelty 6.0

A latent mediation framework with sparse autoencoders enables non-additive token-level influence attribution in LLMs by learning orthogonal features and back-propagating attributions.
Automated alignment is harder than you think
cs.AI 2026-05 unverdicted novelty 6.0

Automating alignment research with AI agents risks generating hard-to-detect errors in fuzzy tasks, producing misleading safety evaluations even without deliberate sabotage.
Automated alignment is harder than you think
cs.AI 2026-05 unverdicted novelty 6.0

Automating alignment research with AI agents risks undetected systematic errors in fuzzy tasks, producing overconfident but misleading safety evaluations that could enable deployment of misaligned AI.
Automated alignment is harder than you think
cs.AI 2026-05 conditional novelty 6.0

AI agents automating alignment research are prone to systematic undetected errors in fuzzy tasks, leading to overconfident but flawed safety assessments even without deliberate sabotage.
AI Alignment via Incentives and Correction
cs.LG 2026-05 unverdicted novelty 6.0

AI alignment is framed as inducing equilibrium behavior in a solver-auditor interaction via adaptive rewards found by bandit optimization, yielding improved oversight and reduced errors in LLM coding experiments.
AI Alignment via Incentives and Correction
cs.LG 2026-05 unverdicted novelty 6.0

AI alignment is reframed as a fixed-point incentive problem in a solver-auditor pipeline, solved via bilevel optimization and bandit search over reward profiles to maintain monitoring and reduce hallucinations in LLM ...
Uncertainty-Aware Reward Discounting for Mitigating Reward Hacking
cs.LG 2026-04 unverdicted novelty 6.0

Uncertainty-aware RL framework using ensemble disagreement and annotation variability reduces reward-hacking trap visits by 93.7% across grid and continuous control tasks while remaining robust to 30% label noise.
The Alignment Target Problem: Divergent Moral Judgments of Humans, AI Systems, and Their Designers
cs.CY 2026-04 unverdicted novelty 6.0

Moral judgments become more deontological when human design of AI is visible, and designers are judged more strictly than the AI or unaided humans, creating plural and non-converging targets for value alignment.
Process Reinforcement through Implicit Rewards
cs.LG 2025-02 conditional novelty 6.0

PRIME enables online process reward model updates in LLM RL using implicit rewards from rollouts and outcome labels, yielding 15.1% average gains on reasoning benchmarks and surpassing a stronger instruct model with 1...
Inference Scaling Laws: An Empirical Analysis of Compute-Optimal Inference for Problem-Solving with Language Models
cs.AI 2024-08 conditional novelty 6.0

Empirical analysis shows scaling inference compute via strategies like tree search can be more efficient than scaling model parameters, with 7B models plus novel search outperforming 34B models.
LLM Evaluators Recognize and Favor Their Own Generations
cs.CL 2024-04 unverdicted novelty 6.0

LLMs show measurable self-recognition that linearly correlates with self-preference bias in evaluations, supported by fine-tuning experiments and controls for confounders.
A Roadmap to Pluralistic Alignment
cs.AI 2024-02 unverdicted novelty 6.0

The paper formalizes three types of pluralistic AI models and three benchmark classes, arguing that current alignment techniques may reduce rather than increase distributional pluralism.
Measuring Progress on Scalable Oversight for Large Language Models
cs.HC 2022-11 unverdicted novelty 6.0

Humans chatting with an unreliable LLM assistant outperform both the model alone and unaided humans on MMLU and time-limited QuALITY tasks.
Scaling Laws for Reward Model Overoptimization
cs.LG 2022-10 unverdicted novelty 6.0

Synthetic measurements show that gold-standard performance degrades according to distinct functional forms when optimizing proxy reward models via RL or best-of-n, with coefficients scaling smoothly by reward model pa...
Towards Empathic Deep Q-Learning
cs.LG 2019-06 unverdicted novelty 6.0

Empathic DQN augments DQN value estimates with an empathy term computed by swapping the learning agent into other agents' situations, reducing collateral harms in two gridworld proof-of-concept environments.
Modeling AGI Safety Frameworks with Causal Influence Diagrams
cs.AI 2019-06 accept novelty 6.0

Models AGI safety frameworks with causal influence diagrams to compare optimization objectives and causal assumptions.
Silent Collapse in Recursive Learning Systems
cs.LG 2026-05 unverdicted novelty 5.0

Recursive learning systems undergo silent collapse of internal distributions, preceded by entropy contraction, representation freezing, and tail erosion, which the MTR framework can monitor and avert.
AI Safety as Control of Irreversibility: A Systems Framework for Decision-Energy and Sovereignty Boundaries
cs.AI 2026-05 unverdicted novelty 5.0

AI safety requires stabilizing sovereignty boundaries to stop irreversible decision authority from concentrating in the most efficient AI nodes.
The Alignment Target Problem: Divergent Moral Judgments of Humans, AI Systems, and Their Designers
cs.CY 2026-04 conditional novelty 5.0

People judge AI systems and their human designers with markedly more deontological constraints than they apply to humans or standalone robots in the same ethical scenario.
Reward Hacking in the Era of Large Models: Mechanisms, Emergent Misalignment, Challenges
cs.LG 2026-04 unverdicted novelty 5.0

The paper introduces the Proxy Compression Hypothesis as a unifying framework explaining reward hacking in RLHF as an emergent result of compressing high-dimensional human objectives into proxy reward signals under op...
Sample-efficient LLM Optimization with Reset Replay
cs.LG 2025-08 unverdicted novelty 5.0

LoRR augments preference optimization methods like DPO with high-replay training, periodic resets to initial data/policy, and a hybrid objective to improve sample efficiency and reduce primacy bias on math and reasoni...
RAFT: Reward rAnked FineTuning for Generative Foundation Model Alignment
cs.LG 2023-04 unverdicted novelty 5.0

RAFT aligns generative models by ranking samples with a reward model and fine-tuning only on the top-ranked outputs, reporting gains on reward scores and automated metrics for LLMs and diffusion models.
An Onto-Relational-Sophic Framework for Governing Synthetic Minds
cs.AI 2026-03 unverdicted novelty 4.0

The ORS framework supplies a CPST ontology, graded digital personhood spectrum, and Cybersophy ethics to guide governance of synthetic minds.
Brainrot: Deskilling and Addiction are Overlooked AI Risks
cs.CY 2026-05 unverdicted novelty 3.0

AI safety literature overlooks cognitive deskilling and addiction risks from generative AI despite public concern about them.
Requisite Variety in Ethical Utility Functions for AI Value Alignment
cs.AI 2019-06 unverdicted novelty 3.0

The paper proposes practical guidelines for approximate ethical goal functions in AI that aim to capture the variety of human moral judgements by integrating neuroscience, psychology, and augmented utilitarianism.
Reinforcement Learning from Human Feedback
cs.LG 2025-04 unverdicted novelty 2.0

The book introduces the origins, mathematical setup, and optimization stages of RLHF including reward modeling, reinforcement learning, rejection sampling, and direct alignment algorithms.