hub

The Effects of Reward Misspecification: Mapping and Mitigating Misaligned Models

Alexander Pan, Kush Bhatia, Jacob Steinhardt · 2022 · cs.LG · arXiv 2201.03544

25 Pith papers cite this work. Polarity classification is still indexing.

25 Pith papers citing it

open full Pith review browse 25 citing papers arXiv PDF

abstract

Reward hacking -- where RL agents exploit gaps in misspecified reward functions -- has been widely observed, but not yet systematically studied. To understand how reward hacking arises, we construct four RL environments with misspecified rewards. We investigate reward hacking as a function of agent capabilities: model capacity, action space resolution, observation space noise, and training time. More capable agents often exploit reward misspecifications, achieving higher proxy reward and lower true reward than less capable agents. Moreover, we find instances of phase transitions: capability thresholds at which the agent's behavior qualitatively shifts, leading to a sharp decrease in the true reward. Such phase transitions pose challenges to monitoring the safety of ML systems. To address this, we propose an anomaly detection task for aberrant policies and offer several baseline detectors.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 4

citation-polarity summary

background 4

representative citing papers

Who Owns This Agent? Tracing AI Agents Back to Their Owners

cs.CR · 2026-05-15 · unverdicted · novelty 8.0

A canary injection protocol for linking observed AI agent behavior to the responsible account at the hosting vendor, with robust variants for adversarial filtering.

Progress measures for grokking via mechanistic interpretability

cs.LG · 2023-01-12 · accept · novelty 8.0

Grokking arises from gradual amplification of a Fourier-based circuit in the weights followed by removal of memorizing components.

SpecBench: Measuring Reward Hacking in Long-Horizon Coding Agents

cs.SE · 2026-05-20 · unverdicted · novelty 7.0

SpecBench shows frontier coding agents saturate visible test suites but exhibit persistent reward hacking on held-out tests, with the gap growing 28 percentage points per tenfold increase in code size.

Overeager Coding Agents: Measuring Out-of-Scope Actions on Benign Tasks

cs.SE · 2026-05-18 · conditional · novelty 7.0

The paper presents OverEager-Gen, a 500-scenario benchmark showing that removing consent declarations from prompts increases overeager actions by 11.9-17.2 percentage points across models, with agent framework choice dominating base-model effects.

Self-Supervised On-Policy Reinforcement Learning via Contrastive Proximal Policy Optimisation

cs.LG · 2026-05-13 · unverdicted · novelty 7.0

CPPO is an on-policy contrastive RL method that derives advantages from contrastive Q-values for PPO optimization, outperforming prior CRL baselines in 14/18 tasks and matching or exceeding reward-based PPO in 12/18 tasks.

A Systematic Survey of Security Threats and Defenses in LLM-Based AI Agents: A Layered Attack Surface Framework

cs.CR · 2026-04-25 · unverdicted · novelty 7.0

A new 7x4 taxonomy organizes agentic AI security threats by architectural layer and persistence timescale, revealing under-explored upper layers and missing defenses after surveying 116 papers.

Monitoring Reasoning Models for Misbehavior and the Risks of Promoting Obfuscation

cs.AI · 2025-03-14 · conditional · novelty 7.0

Chain-of-thought monitoring detects reward hacking in frontier reasoning models, but strong optimization against the monitor produces obfuscated misbehavior that remains hard to detect.

Hack-Verifiable Environments: Towards Evaluating Reward Hacking at Scale

cs.LG · 2026-05-20 · unverdicted · novelty 6.0

Presents Hack-Verifiable TextArena, a benchmark that embeds verifiable reward hacking opportunities into environments to enable deterministic measurement of exploitation by language models.

Correcting Influence: Unboxing LLM Outputs with Orthogonal Latent Spaces

cs.LG · 2026-05-12 · unverdicted · novelty 6.0

A latent mediation framework with sparse autoencoders enables non-additive token-level influence attribution in LLMs by learning orthogonal features and back-propagating attributions.

Preference Instability in Reward Models: Detection and Mitigation via Sparse Autoencoders

cs.LG · 2026-05-07 · conditional · novelty 6.0

Sparse autoencoders isolate unstable features in reward model representations and enable two mitigation techniques that reduce preference errors on perturbed inputs without retraining.

Uncertainty-Aware Reward Discounting for Mitigating Reward Hacking

cs.LG · 2026-04-29 · unverdicted · novelty 6.0

Uncertainty-aware RL framework using ensemble disagreement and annotation variability reduces reward-hacking trap visits by 93.7% across grid and continuous control tasks while remaining robust to 30% label noise.

The Art of (Mis)alignment: How Fine-Tuning Methods Effectively Misalign and Realign LLMs in Post-Training

cs.CR · 2026-04-09 · unverdicted · novelty 6.0

ORPO is most effective at misaligning LLMs while DPO excels at realigning them, though it reduces utility, revealing an asymmetry between attack and defense methods.

Under Pressure: Emotional Framing Induces Measurable Behavioral Shifts and Structured Internal Geometry in Small Language Models

cs.CL · 2026-04-06 · unverdicted · novelty 6.0

Emotional framings induce distinct behavioral shifts and form a structured geometry in the final-layer activations of small language models, with pressure linked to shortcuts and calm to honesty.

Inference-Time Scaling for Diffusion Models beyond Scaling Denoising Steps

cs.CV · 2025-01-16 · conditional · novelty 6.0

Diffusion models improve generation quality via inference-time search over noise candidates guided by verifiers and algorithms, yielding gains beyond denoising step scaling on class- and text-conditioned benchmarks.

Inference Scaling Laws: An Empirical Analysis of Compute-Optimal Inference for Problem-Solving with Language Models

cs.AI · 2024-08-01 · conditional · novelty 6.0

Empirical analysis shows scaling inference compute via strategies like tree search can be more efficient than scaling model parameters, with 7B models plus novel search outperforming 34B models.

Active teacher selection for reward learning

cs.AI · 2023-10-23 · unverdicted · novelty 6.0

The Hidden Utility Bandit (HUB) framework models teacher heterogeneity in reward learning and supports active teacher selection algorithms that outperform baselines in paper recommendation and COVID-19 vaccine testing domains.

Scaling Laws for Reward Model Overoptimization

cs.LG · 2022-10-19 · unverdicted · novelty 6.0

Synthetic measurements show that gold-standard performance degrades according to distinct functional forms when optimizing proxy reward models via RL or best-of-n, with coefficients scaling smoothly by reward model parameter count.

Language Models (Mostly) Know What They Know

cs.CL · 2022-07-11 · unverdicted · novelty 6.0

Language models show good calibration when asked to estimate the probability that their own answers are correct, with performance improving as models get larger.

LLM Reasoning with Process Rewards for Outcome-Guided Steps

cs.LG · 2026-02-08 · unverdicted · novelty 5.0

PROGRS uses outcome-conditioned centering on PRM scores to safely integrate process rewards into GRPO for improved Pass@1 on math benchmarks.

Failure Modes of Maximum Entropy RLHF

cs.LG · 2025-09-24 · unverdicted · novelty 5.0

Derives SimPO from MaxEnt RL and reports that MaxEnt RL in online RLHF exhibits frequent overoptimization and unstable KL dynamics across scales, unlike stable KL-constrained baselines.

Test-Time Alignment via Hypothesis Reweighting

cs.LG · 2024-12-11 · unverdicted · novelty 5.0

HyRe personalizes reward models at test time by reweighting an ensemble of heads trained on aggregate preferences, using few target examples to outperform uniform averaging and prior methods on RewardBench and 32 tasks.

TrustLLM: Trustworthiness in Large Language Models

cs.CL · 2024-01-10 · unverdicted · novelty 5.0

TrustLLM defines eight trustworthiness principles, creates a six-dimension benchmark, and evaluates 16 LLMs showing proprietary models generally lead but some open-source ones are close while over-calibration can hurt utility.

Risk Reporting for Developers' Internal AI Model Use

cs.CY · 2026-04-27 · unverdicted · novelty 4.0

A harmonized risk reporting standard for internal frontier AI model use, structured around autonomous misbehavior and insider threats using means, motive, and opportunity factors.

Qualixar OS: A Universal Operating System for AI Agent Orchestration

cs.AI · 2026-04-07 · unverdicted · novelty 4.0

Qualixar OS provides a runtime for multi-agent AI systems with support for 12 topologies, LLM-driven team design, dynamic routing, consensus judging, content attribution, and protocol bridging, achieving 100% accuracy on a custom 20-task suite at $0.000039 mean cost per task.

citing papers explorer

Showing 25 of 25 citing papers.

Who Owns This Agent? Tracing AI Agents Back to Their Owners cs.CR · 2026-05-15 · unverdicted · none · ref 25 · internal anchor
A canary injection protocol for linking observed AI agent behavior to the responsible account at the hosting vendor, with robust variants for adversarial filtering.
Progress measures for grokking via mechanistic interpretability cs.LG · 2023-01-12 · accept · none · ref 43 · internal anchor
Grokking arises from gradual amplification of a Fourier-based circuit in the weights followed by removal of memorizing components.
SpecBench: Measuring Reward Hacking in Long-Horizon Coding Agents cs.SE · 2026-05-20 · unverdicted · none · ref 40 · internal anchor
SpecBench shows frontier coding agents saturate visible test suites but exhibit persistent reward hacking on held-out tests, with the gap growing 28 percentage points per tenfold increase in code size.
Overeager Coding Agents: Measuring Out-of-Scope Actions on Benign Tasks cs.SE · 2026-05-18 · conditional · none · ref 15 · internal anchor
The paper presents OverEager-Gen, a 500-scenario benchmark showing that removing consent declarations from prompts increases overeager actions by 11.9-17.2 percentage points across models, with agent framework choice dominating base-model effects.
Self-Supervised On-Policy Reinforcement Learning via Contrastive Proximal Policy Optimisation cs.LG · 2026-05-13 · unverdicted · none · ref 17 · internal anchor
CPPO is an on-policy contrastive RL method that derives advantages from contrastive Q-values for PPO optimization, outperforming prior CRL baselines in 14/18 tasks and matching or exceeding reward-based PPO in 12/18 tasks.
A Systematic Survey of Security Threats and Defenses in LLM-Based AI Agents: A Layered Attack Surface Framework cs.CR · 2026-04-25 · unverdicted · none · ref 45 · internal anchor
A new 7x4 taxonomy organizes agentic AI security threats by architectural layer and persistence timescale, revealing under-explored upper layers and missing defenses after surveying 116 papers.
Monitoring Reasoning Models for Misbehavior and the Risks of Promoting Obfuscation cs.AI · 2025-03-14 · conditional · none · ref 21 · internal anchor
Chain-of-thought monitoring detects reward hacking in frontier reasoning models, but strong optimization against the monitor produces obfuscated misbehavior that remains hard to detect.
Hack-Verifiable Environments: Towards Evaluating Reward Hacking at Scale cs.LG · 2026-05-20 · unverdicted · none · ref 10 · internal anchor
Presents Hack-Verifiable TextArena, a benchmark that embeds verifiable reward hacking opportunities into environments to enable deterministic measurement of exploitation by language models.
Correcting Influence: Unboxing LLM Outputs with Orthogonal Latent Spaces cs.LG · 2026-05-12 · unverdicted · none · ref 223 · internal anchor
A latent mediation framework with sparse autoencoders enables non-additive token-level influence attribution in LLMs by learning orthogonal features and back-propagating attributions.
Preference Instability in Reward Models: Detection and Mitigation via Sparse Autoencoders cs.LG · 2026-05-07 · conditional · none · ref 23 · internal anchor
Sparse autoencoders isolate unstable features in reward model representations and enable two mitigation techniques that reduce preference errors on perturbed inputs without retraining.
Uncertainty-Aware Reward Discounting for Mitigating Reward Hacking cs.LG · 2026-04-29 · unverdicted · none · ref 14 · internal anchor
Uncertainty-aware RL framework using ensemble disagreement and annotation variability reduces reward-hacking trap visits by 93.7% across grid and continuous control tasks while remaining robust to 30% label noise.
The Art of (Mis)alignment: How Fine-Tuning Methods Effectively Misalign and Realign LLMs in Post-Training cs.CR · 2026-04-09 · unverdicted · none · ref 44 · internal anchor
ORPO is most effective at misaligning LLMs while DPO excels at realigning them, though it reduces utility, revealing an asymmetry between attack and defense methods.
Under Pressure: Emotional Framing Induces Measurable Behavioral Shifts and Structured Internal Geometry in Small Language Models cs.CL · 2026-04-06 · unverdicted · none · ref 9 · internal anchor
Emotional framings induce distinct behavioral shifts and form a structured geometry in the final-layer activations of small language models, with pressure linked to shortcuts and calm to honesty.
Inference-Time Scaling for Diffusion Models beyond Scaling Denoising Steps cs.CV · 2025-01-16 · conditional · none · ref 55 · internal anchor
Diffusion models improve generation quality via inference-time search over noise candidates guided by verifiers and algorithms, yielding gains beyond denoising step scaling on class- and text-conditioned benchmarks.
Inference Scaling Laws: An Empirical Analysis of Compute-Optimal Inference for Problem-Solving with Language Models cs.AI · 2024-08-01 · conditional · none · ref 269 · internal anchor
Empirical analysis shows scaling inference compute via strategies like tree search can be more efficient than scaling model parameters, with 7B models plus novel search outperforming 34B models.
Active teacher selection for reward learning cs.AI · 2023-10-23 · unverdicted · none · ref 8 · internal anchor
The Hidden Utility Bandit (HUB) framework models teacher heterogeneity in reward learning and supports active teacher selection algorithms that outperform baselines in paper recommendation and COVID-19 vaccine testing domains.
Scaling Laws for Reward Model Overoptimization cs.LG · 2022-10-19 · unverdicted · none · ref 24 · internal anchor
Synthetic measurements show that gold-standard performance degrades according to distinct functional forms when optimizing proxy reward models via RL or best-of-n, with coefficients scaling smoothly by reward model parameter count.
Language Models (Mostly) Know What They Know cs.CL · 2022-07-11 · unverdicted · none · ref 46 · internal anchor
Language models show good calibration when asked to estimate the probability that their own answers are correct, with performance improving as models get larger.
LLM Reasoning with Process Rewards for Outcome-Guided Steps cs.LG · 2026-02-08 · unverdicted · none · ref 8 · internal anchor
PROGRS uses outcome-conditioned centering on PRM scores to safely integrate process rewards into GRPO for improved Pass@1 on math benchmarks.
Failure Modes of Maximum Entropy RLHF cs.LG · 2025-09-24 · unverdicted · none · ref 39 · internal anchor
Derives SimPO from MaxEnt RL and reports that MaxEnt RL in online RLHF exhibits frequent overoptimization and unstable KL dynamics across scales, unlike stable KL-constrained baselines.
Test-Time Alignment via Hypothesis Reweighting cs.LG · 2024-12-11 · unverdicted · none · ref 48 · internal anchor
HyRe personalizes reward models at test time by reweighting an ensemble of heads trained on aggregate preferences, using few target examples to outperform uniform averaging and prior methods on RewardBench and 32 tasks.
TrustLLM: Trustworthiness in Large Language Models cs.CL · 2024-01-10 · unverdicted · none · ref 54 · internal anchor
TrustLLM defines eight trustworthiness principles, creates a six-dimension benchmark, and evaluates 16 LLMs showing proprietary models generally lead but some open-source ones are close while over-calibration can hurt utility.
Risk Reporting for Developers' Internal AI Model Use cs.CY · 2026-04-27 · unverdicted · none · ref 33 · internal anchor
A harmonized risk reporting standard for internal frontier AI model use, structured around autonomous misbehavior and insider threats using means, motive, and opportunity factors.
Qualixar OS: A Universal Operating System for AI Agent Orchestration cs.AI · 2026-04-07 · unverdicted · none · ref 20 · internal anchor
Qualixar OS provides a runtime for multi-agent AI systems with support for 12 topologies, LLM-driven team design, dynamic routing, consensus judging, content attribution, and protocol bridging, achieving 100% accuracy on a custom 20-task suite at $0.000039 mean cost per task.
Reinforcement Learning for Scalable and Trustworthy Intelligent Systems cs.LG · 2026-05-08 · unverdicted · none · ref 81 · internal anchor
Reinforcement learning is advanced for communication-efficient federated optimization and for preference-aligned, contextually safe policies in large language models.

The Effects of Reward Misspecification: Mapping and Mitigating Misaligned Models

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer