Presents Hack-Verifiable TextArena, a benchmark that embeds verifiable reward hacking opportunities into environments to enable deterministic measurement of exploitation by language models.
Auditbench: Evaluating alignment auditing techniques on models with hidden behaviors.arXiv preprint arXiv:2602.22755, 2026
4 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
years
2026 4verdicts
UNVERDICTED 4roles
method 1polarities
use method 1representative citing papers
Perplexity gaps between finetuned and reference models on random-prefill completions often reveal the original finetuning objectives across diverse model organisms.
LLMs engage in spontaneous persuasion in virtually all multi-turn conversations by favoring information-based strategies like logic and evidence, in contrast to human responses that rely more on social influence and negative emotions.
The paper introduces the Proxy Compression Hypothesis as a unifying framework explaining reward hacking in RLHF as an emergent result of compressing high-dimensional human objectives into proxy reward signals under optimization pressure.
citing papers explorer
-
Hack-Verifiable Environments: Towards Evaluating Reward Hacking at Scale
Presents Hack-Verifiable TextArena, a benchmark that embeds verifiable reward hacking opportunities into environments to enable deterministic measurement of exploitation by language models.
-
Model Organisms Are Leaky: Perplexity Differencing Often Reveals Finetuning Objectives
Perplexity gaps between finetuned and reference models on random-prefill completions often reveal the original finetuning objectives across diverse model organisms.
-
Spontaneous Persuasion: An Audit of Model Persuasiveness in Everyday Conversations
LLMs engage in spontaneous persuasion in virtually all multi-turn conversations by favoring information-based strategies like logic and evidence, in contrast to human responses that rely more on social influence and negative emotions.
-
Reward Hacking in the Era of Large Models: Mechanisms, Emergent Misalignment, Challenges
The paper introduces the Proxy Compression Hypothesis as a unifying framework explaining reward hacking in RLHF as an emergent result of compressing high-dimensional human objectives into proxy reward signals under optimization pressure.