Scalable agent alignment via reward modeling: a research direction
read the original abstract
One obstacle to applying reinforcement learning algorithms to real-world problems is the lack of suitable reward functions. Designing such reward functions is difficult in part because the user only has an implicit understanding of the task objective. This gives rise to the agent alignment problem: how do we create agents that behave in accordance with the user's intentions? We outline a high-level research direction to solve the agent alignment problem centered around reward modeling: learning a reward function from interaction with the user and optimizing the learned reward function with reinforcement learning. We discuss the key challenges we expect to face when scaling reward modeling to complex and general domains, concrete approaches to mitigate these challenges, and ways to establish trust in the resulting agents.
This paper has not been read by Pith yet.
Forward citations
Cited by 33 Pith papers
-
Risks from Learned Optimization in Advanced Machine Learning Systems
Mesa-optimization arises when learned models act as optimizers with objectives that can differ from their training loss, creating alignment risks in advanced machine learning.
-
Universal and Transferable Adversarial Attacks on Aligned Language Models
Gradient and greedy search over token suffixes produces universal, transferable adversarial prompts that elicit objectionable outputs from aligned models including black-box commercial systems.
-
Discovering Latent Knowledge in Language Models Without Supervision
An unsupervised technique extracts latent yes-no knowledge from language model activations by locating a direction that satisfies logical consistency properties, outperforming zero-shot accuracy by 4% on average acros...
-
Learning to summarize from human feedback
Reinforcement learning on a reward model trained from human summary comparisons produces summaries humans prefer over supervised fine-tuning or human references on TL;DR and transfers to CNN/DM.
-
Fine-Tuning Language Models from Human Preferences
Language models fine-tuned via RL on 5k-60k human preference comparisons produce stylistically better text continuations and human-preferred summaries that sometimes copy input sentences.
-
Deep Pre-Alignment for VLMs
Deep Pre-Alignment uses a small VLM perceiver instead of ViT to pre-align visual features with LLM text space, yielding 1.9-3.0 point gains on multimodal benchmarks and 32.9% less language forgetting.
-
Silent Collapse in Recursive Learning Systems
Silent collapse in recursive learning contracts internal distributions like entropy and diversity despite stable metrics, preceded by three precursors that enable the MTR monitoring framework to intervene early.
-
Correcting Influence: Unboxing LLM Outputs with Orthogonal Latent Spaces
A latent mediation framework with sparse autoencoders enables non-additive token-level influence attribution in LLMs by learning orthogonal features and back-propagating attributions.
-
Automated alignment is harder than you think
Automating alignment research with AI agents risks generating hard-to-detect errors in fuzzy tasks, producing misleading safety evaluations even without deliberate sabotage.
-
Automated alignment is harder than you think
Automating alignment research with AI agents risks undetected systematic errors in fuzzy tasks, producing overconfident but misleading safety evaluations that could enable deployment of misaligned AI.
-
Automated alignment is harder than you think
AI agents automating alignment research are prone to systematic undetected errors in fuzzy tasks, leading to overconfident but flawed safety assessments even without deliberate sabotage.
-
AI Alignment via Incentives and Correction
AI alignment is framed as inducing equilibrium behavior in a solver-auditor interaction via adaptive rewards found by bandit optimization, yielding improved oversight and reduced errors in LLM coding experiments.
-
AI Alignment via Incentives and Correction
AI alignment is reframed as a fixed-point incentive problem in a solver-auditor pipeline, solved via bilevel optimization and bandit search over reward profiles to maintain monitoring and reduce hallucinations in LLM ...
-
Uncertainty-Aware Reward Discounting for Mitigating Reward Hacking
Uncertainty-aware RL framework using ensemble disagreement and annotation variability reduces reward-hacking trap visits by 93.7% across grid and continuous control tasks while remaining robust to 30% label noise.
-
The Alignment Target Problem: Divergent Moral Judgments of Humans, AI Systems, and Their Designers
Moral judgments become more deontological when human design of AI is visible, and designers are judged more strictly than the AI or unaided humans, creating plural and non-converging targets for value alignment.
-
Process Reinforcement through Implicit Rewards
PRIME enables online process reward model updates in LLM RL using implicit rewards from rollouts and outcome labels, yielding 15.1% average gains on reasoning benchmarks and surpassing a stronger instruct model with 1...
-
Inference Scaling Laws: An Empirical Analysis of Compute-Optimal Inference for Problem-Solving with Language Models
Empirical analysis shows scaling inference compute via strategies like tree search can be more efficient than scaling model parameters, with 7B models plus novel search outperforming 34B models.
-
LLM Evaluators Recognize and Favor Their Own Generations
LLMs show measurable self-recognition that linearly correlates with self-preference bias in evaluations, supported by fine-tuning experiments and controls for confounders.
-
A Roadmap to Pluralistic Alignment
The paper formalizes three types of pluralistic AI models and three benchmark classes, arguing that current alignment techniques may reduce rather than increase distributional pluralism.
-
Measuring Progress on Scalable Oversight for Large Language Models
Humans chatting with an unreliable LLM assistant outperform both the model alone and unaided humans on MMLU and time-limited QuALITY tasks.
-
Scaling Laws for Reward Model Overoptimization
Synthetic measurements show that gold-standard performance degrades according to distinct functional forms when optimizing proxy reward models via RL or best-of-n, with coefficients scaling smoothly by reward model pa...
-
Towards Empathic Deep Q-Learning
Empathic DQN augments DQN value estimates with an empathy term computed by swapping the learning agent into other agents' situations, reducing collateral harms in two gridworld proof-of-concept environments.
-
Modeling AGI Safety Frameworks with Causal Influence Diagrams
Models AGI safety frameworks with causal influence diagrams to compare optimization objectives and causal assumptions.
-
Silent Collapse in Recursive Learning Systems
Recursive learning systems undergo silent collapse of internal distributions, preceded by entropy contraction, representation freezing, and tail erosion, which the MTR framework can monitor and avert.
-
AI Safety as Control of Irreversibility: A Systems Framework for Decision-Energy and Sovereignty Boundaries
AI safety requires stabilizing sovereignty boundaries to stop irreversible decision authority from concentrating in the most efficient AI nodes.
-
The Alignment Target Problem: Divergent Moral Judgments of Humans, AI Systems, and Their Designers
People judge AI systems and their human designers with markedly more deontological constraints than they apply to humans or standalone robots in the same ethical scenario.
-
Reward Hacking in the Era of Large Models: Mechanisms, Emergent Misalignment, Challenges
The paper introduces the Proxy Compression Hypothesis as a unifying framework explaining reward hacking in RLHF as an emergent result of compressing high-dimensional human objectives into proxy reward signals under op...
-
Sample-efficient LLM Optimization with Reset Replay
LoRR augments preference optimization methods like DPO with high-replay training, periodic resets to initial data/policy, and a hybrid objective to improve sample efficiency and reduce primacy bias on math and reasoni...
-
RAFT: Reward rAnked FineTuning for Generative Foundation Model Alignment
RAFT aligns generative models by ranking samples with a reward model and fine-tuning only on the top-ranked outputs, reporting gains on reward scores and automated metrics for LLMs and diffusion models.
-
An Onto-Relational-Sophic Framework for Governing Synthetic Minds
The ORS framework supplies a CPST ontology, graded digital personhood spectrum, and Cybersophy ethics to guide governance of synthetic minds.
-
Brainrot: Deskilling and Addiction are Overlooked AI Risks
AI safety literature overlooks cognitive deskilling and addiction risks from generative AI despite public concern about them.
-
Requisite Variety in Ethical Utility Functions for AI Value Alignment
The paper proposes practical guidelines for approximate ethical goal functions in AI that aim to capture the variety of human moral judgements by integrating neuroscience, psychology, and augmented utilitarianism.
-
Reinforcement Learning from Human Feedback
The book introduces the origins, mathematical setup, and optimization stages of RLHF including reward modeling, reinforcement learning, rejection sampling, and direct alignment algorithms.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.