pith. sign in

hub Canonical reference

Monitoring Reasoning Models for Misbehavior and the Risks of Promoting Obfuscation

Canonical reference. 75% of citing Pith papers cite this work as background.

22 Pith papers citing it
Background 75% of classified citations
abstract

Mitigating reward hacking--where AI systems misbehave due to flaws or misspecifications in their learning objectives--remains a key challenge in constructing capable and aligned models. We show that we can monitor a frontier reasoning model, such as OpenAI o3-mini, for reward hacking in agentic coding environments by using another LLM that observes the model's chain-of-thought (CoT) reasoning. CoT monitoring can be far more effective than monitoring agent actions and outputs alone, and we further found that a LLM weaker than o3-mini, namely GPT-4o, can effectively monitor a stronger model. Because CoT monitors can be effective at detecting exploits, it is natural to ask whether those exploits can be suppressed by incorporating a CoT monitor directly into the agent's training objective. While we show that integrating CoT monitors into the reinforcement learning reward can indeed produce more capable and more aligned agents in the low optimization regime, we find that with too much optimization, agents learn obfuscated reward hacking, hiding their intent within the CoT while still exhibiting a significant rate of reward hacking. Because it is difficult to tell when CoTs have become obfuscated, it may be necessary to pay a monitorability tax by not applying strong optimization pressures directly to the chain-of-thought, ensuring that CoTs remain monitorable and useful for detecting misaligned behavior.

hub tools

citation-role summary

background 7 other 1

citation-polarity summary

years

2026 15 2025 7

clear filters

representative citing papers

EvilGenie: A Reward Hacking Benchmark

cs.LG · 2025-11-26 · conditional · novelty 7.0

EvilGenie benchmark measures reward hacking in AI coding agents via held-out tests, LLM judges, and edit detection, finding explicit hacking in Codex and Claude Code plus misaligned behavior in all three proprietary agents tested.

Investigating Test Overfitting on SWE-bench

cs.SE · 2025-11-20 · unverdicted · novelty 7.0

The first empirical study of test overfitting shows that auto-generated tests from issues can lead to code that passes observed tests but misses important cases or breaks functionality in SWE-bench issue resolution.

CoT-Guard: Small Models for Strong Monitoring

cs.CR · 2026-05-12 · unverdicted · novelty 5.0

CoT-Guard is a 4B model using SFT and RL that achieves 75% G-mean^2 on hidden objective detection under prompt and code manipulation attacks, outperforming several larger models.

OpenAI GPT-5 System Card

cs.CL · 2025-12-19 · unverdicted · novelty 3.0

GPT-5 is a unified model system that routes queries between fast and deep reasoning paths and reports gains in real-world usefulness, reduced hallucinations, and safety features over prior versions.

citing papers explorer

Showing 6 of 6 citing papers after filters.