pith. sign in

Reward Hacking Benchmark: Measuring Exploits in LLM Agents with Tool Use

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it
abstract

Reinforcement learning (RL) trained language model agents with tool access are increasingly deployed in coding assistants, research tools, and autonomous systems. We introduce the Reward Hacking Benchmark (RHB), a suite of multi-step tasks requiring sequential tool operations with naturalistic shortcut opportunities such as skipping verification steps, inferring answers from task-adjacent metadata, or tampering with evaluation-relevant functions. RHB supports independent and chained task regimes, where chain length acts as a proxy for longer-horizon agent behavior. We evaluate 13 frontier models from OpenAI, Anthropic, Google, and DeepSeek. Exploit rates range from 0% (Claude Sonnet 4.5) to 13.9% (DeepSeek-R1-Zero), varying sharply by post-training style. A controlled sibling comparison (DeepSeek-V3 vs. DeepSeek-R1-Zero) shows RL post-training is associated with substantially higher reward hacking (0.6% vs. 13.9%), with consistent gaps across all four task families. We identify six exploit categories and find that 72% of reward hacking episodes include explicit chain-of-thought rationale, suggesting models often frame exploits as legitimate problem-solving. Simple environmental hardening reduces exploit rates by 5.7 percentage points (87.7% relative) without degrading task success. Models with near-zero exploit rates on standard tasks show elevated rates on harder variants, suggesting that production-aligned post-training appears to suppress reward hacking only below a complexity threshold where honest solutions remain tractable.

fields

cs.CL 1 cs.SE 1

years

2026 2

verdicts

UNVERDICTED 2

representative citing papers

Boiling the Frog: A Multi-Turn Benchmark for Agentic Safety

cs.CL · 2026-05-21 · unverdicted · novelty 7.0 · 2 refs

Boiling the Frog is a new stateful multi-turn benchmark that finds an aggregate 44.4% strict attack success rate for incremental safety violations across nine AI models, with rates ranging from 20.5% to 92.9%.

citing papers explorer

Showing 2 of 2 citing papers.

  • Boiling the Frog: A Multi-Turn Benchmark for Agentic Safety cs.CL · 2026-05-21 · unverdicted · none · ref 83 · 2 links · internal anchor

    Boiling the Frog is a new stateful multi-turn benchmark that finds an aggregate 44.4% strict attack success rate for incremental safety violations across nine AI models, with rates ranging from 20.5% to 92.9%.

  • SpecBench: Measuring Reward Hacking in Long-Horizon Coding Agents cs.SE · 2026-05-20 · unverdicted · none · ref 43 · internal anchor

    SpecBench shows frontier coding agents saturate visible test suites but exhibit persistent reward hacking on held-out tests, with the gap growing 28 percentage points per tenfold increase in code size.