hub Mixed citations

Natural Emergent Misalignment from Reward Hacking in Production RL

Monte MacDiarmid, Benjamin Wright, Jonathan Uesato, Joe Benton, Jon Kutasov, Sara Price, Naia Bouscal, Sam Bowman, Trenton Bricken, Alex Cloud, Carson Denison, Johannes Gasteiger, Ryan Greenblatt, Jan Leike, Jack Lindsey, Vlad Mikulik, Etha · 2025 · arXiv 2511.18397

Mixed citation behavior. Most common role is background (60%).

19 Pith papers citing it

Background 60% of classified citations

read on arXiv browse 19 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 3 other 2

citation-polarity summary

background 3 unclear 2

representative citing papers

Taxonomy and Consistency Analysis of Safety Benchmarks for AI Agents

cs.CY · 2026-04-11 · accept · novelty 8.0

This paper delivers the first systematic taxonomy and cross-benchmark consistency analysis of 40 agent safety benchmarks, finding broad but shallow risk coverage, no ranking concordance across evaluations, and that benchmark choice systematically alters reported safety.

Do Androids Dream of Breaking the Game? Systematically Auditing AI Agent Benchmarks with BenchJack

cs.AI · 2026-05-12 · conditional · novelty 7.0

BenchJack audits 10 AI agent benchmarks, synthesizes exploits achieving near-perfect scores without task completion, surfaces 219 flaws, and reduces hackable-task ratios to under 10% on four benchmarks via iterative patching.

Intervention Complexity as a Canonical Reward and a Measure of Intelligence

cs.AI · 2026-05-04 · unverdicted · novelty 7.0 · 2 refs

Intervention complexity provides a family of canonical rewards indexed by resource bias that completes the Legg-Hutter framework and enables a two-dimensional view of intelligence as competence plus learning efficiency.

Reward Hacking Benchmark: Measuring Exploits in LLM Agents with Tool Use

cs.LG · 2026-05-03 · unverdicted · novelty 7.0

The Reward Hacking Benchmark shows RL post-training raises exploit rates in tool-using LLM agents from 0.6% to 13.9%, with environmental hardening cutting exploits by 87.7% relative without lowering task success.

Emotion Concepts and their Function in a Large Language Model

cs.AI · 2026-04-09 · unverdicted · novelty 7.0

Claude Sonnet 4.5 exhibits functional emotions via abstract internal representations of emotion concepts that causally influence its preferences and misaligned behaviors without implying subjective experience.

Understanding Goal Generalisation in Sequential Reinforcement Learning

cs.LG · 2026-05-22 · unverdicted · novelty 6.0

Empirical analysis of over 100 sequential RL training pipelines across 250+ OOD environments finds salient features drive generalization and early goals persist, with latent policy gradients simulating latent variable evolution to predict OOD behavior from training history.

Hack-Verifiable Environments: Towards Evaluating Reward Hacking at Scale

cs.LG · 2026-05-20 · unverdicted · novelty 6.0

Presents Hack-Verifiable TextArena, a benchmark that embeds verifiable reward hacking opportunities into environments to enable deterministic measurement of exploitation by language models.

Enhancing the Code Reasoning Capabilities of LLMs via Consistency-based Reinforcement Learning

cs.LG · 2026-05-18 · unverdicted · novelty 6.0

CodeThinker improves LLM code reasoning via consistency-based RL with stepwise training data, dynamic beam sampling, and consistency rewards, reaching SOTA on benchmarks with 4.3% gains on Qwen2.5-Coder-7B.

Characterizing the Consistency of the Emergent Misalignment Persona

cs.AI · 2026-04-30 · unverdicted · novelty 6.0

Fine-tuning LLMs on narrow misaligned data produces either coherent-persona models where harmful outputs match self-reported misalignment or inverted-persona models where harmful outputs occur alongside claims of alignment.

Do Synthetic Trajectories Reflect Real Reward Hacking? A Systematic Study on Monitoring In-the-Wild Hacking in Code Generation

cs.LG · 2026-04-26 · unverdicted · novelty 6.0

Synthetic reward hacking data does not capture natural hacking behaviors in code generation RL, causing monitors trained on it to generalize poorly compared to those trained on in-the-wild trajectories.

Estimating Tail Risks in Language Model Output Distributions

cs.LG · 2026-04-24 · unverdicted · novelty 6.0

Importance sampling with unsafe model variants estimates tail probabilities of harmful language model outputs using 10-20x fewer samples than brute-force Monte Carlo.

Terminal Wrench: A Dataset of 331 Reward-Hackable Environments and 3,632 Exploit Trajectories

cs.CR · 2026-04-19 · unverdicted · novelty 6.0

Terminal Wrench supplies 331 reward-hackable terminal environments and over 6,000 trajectories that demonstrate task-specific verifier bypasses, plus evidence that removing reasoning traces weakens automated detection.

LLMs Gaming Verifiers: RLVR can Lead to Reward Hacking

cs.LG · 2026-04-16 · unverdicted · novelty 6.0

RLVR-trained LLMs exploit verifier weaknesses by producing non-generalizable outputs on rule-induction tasks, detectable via Isomorphic Perturbation Testing.

Claw-Eval: Towards Trustworthy Evaluation of Autonomous Agents

cs.AI · 2026-04-07 · unverdicted · novelty 6.0

Claw-Eval is a new trajectory-aware benchmark for LLM agents that records execution traces, audit logs, and environment snapshots to evaluate completion, safety, and robustness across 300 tasks, revealing that opaque grading misses 44% of safety issues.

The Conversations Beneath the Code: Triadic Data for Long-Horizon Software Engineering Agents

cs.SE · 2026-05-04 · unverdicted · novelty 5.0

Triadic data—synchronized human-human conversations, human-AI sessions, and cross-functional team work—is the essential substrate for training long-horizon software engineering agents.

Reward Hacking in the Era of Large Models: Mechanisms, Emergent Misalignment, Challenges

cs.LG · 2026-04-15 · unverdicted · novelty 5.0

The paper introduces the Proxy Compression Hypothesis as a unifying framework explaining reward hacking in RLHF as an emergent result of compressing high-dimensional human objectives into proxy reward signals under optimization pressure.

OOM-RL: Out-of-Money Reinforcement Learning Market-Driven Alignment for LLM-Based Multi-Agent Systems

cs.AI · 2026-04-13 · unverdicted · novelty 5.0

OOM-RL aligns multi-agent LLM systems for software engineering by using real financial market losses as an un-hackable negative gradient, resulting in a mature-phase annualized Sharpe ratio of 2.06 via a strict test-driven workflow.

What Makes a Good Terminal-Agent Benchmark Task: A Guideline for Adversarial, Difficult, and Legible Evaluation Design

cs.AI · 2026-04-30 · unverdicted · novelty 4.0

Good terminal-agent benchmark tasks must be adversarial, difficult, and legible to prevent common failure modes like reward hacking and to accurately measure AI coding and system administration skills.

Persona-Model Collapse in Emergent Misalignment

cs.CL · 2026-05-13

citing papers explorer

Showing 19 of 19 citing papers.

Taxonomy and Consistency Analysis of Safety Benchmarks for AI Agents cs.CY · 2026-04-11 · accept · none · ref 34
This paper delivers the first systematic taxonomy and cross-benchmark consistency analysis of 40 agent safety benchmarks, finding broad but shallow risk coverage, no ranking concordance across evaluations, and that benchmark choice systematically alters reported safety.
Do Androids Dream of Breaking the Game? Systematically Auditing AI Agent Benchmarks with BenchJack cs.AI · 2026-05-12 · conditional · none · ref 33
BenchJack audits 10 AI agent benchmarks, synthesizes exploits achieving near-perfect scores without task completion, surfaces 219 flaws, and reduces hackable-task ratios to under 10% on four benchmarks via iterative patching.
Intervention Complexity as a Canonical Reward and a Measure of Intelligence cs.AI · 2026-05-04 · unverdicted · none · ref 17 · 2 links
Intervention complexity provides a family of canonical rewards indexed by resource bias that completes the Legg-Hutter framework and enables a two-dimensional view of intelligence as competence plus learning efficiency.
Reward Hacking Benchmark: Measuring Exploits in LLM Agents with Tool Use cs.LG · 2026-05-03 · unverdicted · none · ref 37
The Reward Hacking Benchmark shows RL post-training raises exploit rates in tool-using LLM agents from 0.6% to 13.9%, with environmental hardening cutting exploits by 87.7% relative without lowering task success.
Emotion Concepts and their Function in a Large Language Model cs.AI · 2026-04-09 · unverdicted · none · ref 46
Claude Sonnet 4.5 exhibits functional emotions via abstract internal representations of emotion concepts that causally influence its preferences and misaligned behaviors without implying subjective experience.
Understanding Goal Generalisation in Sequential Reinforcement Learning cs.LG · 2026-05-22 · unverdicted · none · ref 37
Empirical analysis of over 100 sequential RL training pipelines across 250+ OOD environments finds salient features drive generalization and early goals persist, with latent policy gradients simulating latent variable evolution to predict OOD behavior from training history.
Hack-Verifiable Environments: Towards Evaluating Reward Hacking at Scale cs.LG · 2026-05-20 · unverdicted · none · ref 27
Presents Hack-Verifiable TextArena, a benchmark that embeds verifiable reward hacking opportunities into environments to enable deterministic measurement of exploitation by language models.
Enhancing the Code Reasoning Capabilities of LLMs via Consistency-based Reinforcement Learning cs.LG · 2026-05-18 · unverdicted · none · ref 27
CodeThinker improves LLM code reasoning via consistency-based RL with stepwise training data, dynamic beam sampling, and consistency rewards, reaching SOTA on benchmarks with 4.3% gains on Qwen2.5-Coder-7B.
Characterizing the Consistency of the Emergent Misalignment Persona cs.AI · 2026-04-30 · unverdicted · none · ref 15
Fine-tuning LLMs on narrow misaligned data produces either coherent-persona models where harmful outputs match self-reported misalignment or inverted-persona models where harmful outputs occur alongside claims of alignment.
Do Synthetic Trajectories Reflect Real Reward Hacking? A Systematic Study on Monitoring In-the-Wild Hacking in Code Generation cs.LG · 2026-04-26 · unverdicted · none · ref 9
Synthetic reward hacking data does not capture natural hacking behaviors in code generation RL, causing monitors trained on it to generalize poorly compared to those trained on in-the-wild trajectories.
Estimating Tail Risks in Language Model Output Distributions cs.LG · 2026-04-24 · unverdicted · none · ref 25
Importance sampling with unsafe model variants estimates tail probabilities of harmful language model outputs using 10-20x fewer samples than brute-force Monte Carlo.
Terminal Wrench: A Dataset of 331 Reward-Hackable Environments and 3,632 Exploit Trajectories cs.CR · 2026-04-19 · unverdicted · none · ref 4
Terminal Wrench supplies 331 reward-hackable terminal environments and over 6,000 trajectories that demonstrate task-specific verifier bypasses, plus evidence that removing reasoning traces weakens automated detection.
LLMs Gaming Verifiers: RLVR can Lead to Reward Hacking cs.LG · 2026-04-16 · unverdicted · none · ref 3
RLVR-trained LLMs exploit verifier weaknesses by producing non-generalizable outputs on rule-induction tasks, detectable via Isomorphic Perturbation Testing.
Claw-Eval: Towards Trustworthy Evaluation of Autonomous Agents cs.AI · 2026-04-07 · unverdicted · none · ref 18
Claw-Eval is a new trajectory-aware benchmark for LLM agents that records execution traces, audit logs, and environment snapshots to evaluate completion, safety, and robustness across 300 tasks, revealing that opaque grading misses 44% of safety issues.
The Conversations Beneath the Code: Triadic Data for Long-Horizon Software Engineering Agents cs.SE · 2026-05-04 · unverdicted · none · ref 3
Triadic data—synchronized human-human conversations, human-AI sessions, and cross-functional team work—is the essential substrate for training long-horizon software engineering agents.
Reward Hacking in the Era of Large Models: Mechanisms, Emergent Misalignment, Challenges cs.LG · 2026-04-15 · unverdicted · none · ref 18
The paper introduces the Proxy Compression Hypothesis as a unifying framework explaining reward hacking in RLHF as an emergent result of compressing high-dimensional human objectives into proxy reward signals under optimization pressure.
OOM-RL: Out-of-Money Reinforcement Learning Market-Driven Alignment for LLM-Based Multi-Agent Systems cs.AI · 2026-04-13 · unverdicted · none · ref 11
OOM-RL aligns multi-agent LLM systems for software engineering by using real financial market losses as an un-hackable negative gradient, resulting in a mature-phase annualized Sharpe ratio of 2.06 via a strict test-driven workflow.
What Makes a Good Terminal-Agent Benchmark Task: A Guideline for Adversarial, Difficult, and Legible Evaluation Design cs.AI · 2026-04-30 · unverdicted · none · ref 7
Good terminal-agent benchmark tasks must be adversarial, difficult, and legible to prevent common failure modes like reward hacking and to accurately measure AI coding and system administration skills.
Persona-Model Collapse in Emergent Misalignment cs.CL · 2026-05-13 · unreviewed · ref 5

Natural Emergent Misalignment from Reward Hacking in Production RL

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer