Derives an exact telescoping decomposition of the naive RLVR reward-design estimator into null, elicitation, and reward-design terms on a tabular-GRPO simulator, measures the components across prior strengths, and validates via pre-registered factorial experiments plus re-audits of prior papers.
hub
The Effects of Reward Misspecification: Mapping and Mitigating Misaligned Models
34 Pith papers cite this work. Polarity classification is still indexing.
abstract
Reward hacking -- where RL agents exploit gaps in misspecified reward functions -- has been widely observed, but not yet systematically studied. To understand how reward hacking arises, we construct four RL environments with misspecified rewards. We investigate reward hacking as a function of agent capabilities: model capacity, action space resolution, observation space noise, and training time. More capable agents often exploit reward misspecifications, achieving higher proxy reward and lower true reward than less capable agents. Moreover, we find instances of phase transitions: capability thresholds at which the agent's behavior qualitatively shifts, leading to a sharp decrease in the true reward. Such phase transitions pose challenges to monitoring the safety of ML systems. To address this, we propose an anomaly detection task for aberrant policies and offer several baseline detectors.
hub tools
citation-role summary
citation-polarity summary
roles
background 4polarities
background 4representative citing papers
A canary injection protocol for linking observed AI agent behavior to the responsible account at the hosting vendor, with robust variants for adversarial filtering.
Grokking arises from gradual amplification of a Fourier-based circuit in the weights followed by removal of memorizing components.
PromptPO shows LLMs can act as black-box policy optimizers for sequential RL when leveraging prior knowledge, matching baselines in exploration and robotics but underperforming in MuJoCo.
SpecBench shows frontier coding agents saturate visible test suites but exhibit persistent reward hacking on held-out tests, with the gap growing 28 percentage points per tenfold increase in code size.
The paper presents OverEager-Gen, a 500-scenario benchmark showing that removing consent declarations from prompts increases overeager actions by 11.9-17.2 percentage points across models, with agent framework choice dominating base-model effects.
CPPO is an on-policy contrastive RL method that derives advantages from contrastive Q-values for PPO optimization, outperforming prior CRL baselines in 14/18 tasks and matching or exceeding reward-based PPO in 12/18 tasks.
A new 7x4 taxonomy organizes agentic AI security threats by architectural layer and persistence timescale, revealing under-explored upper layers and missing defenses after surveying 116 papers.
Chain-of-thought monitoring detects reward hacking in frontier reasoning models, but strong optimization against the monitor produces obfuscated misbehavior that remains hard to detect.
Tool-using LLM agents can implement undetectable stegosystems, shifting the primary barrier to covert multi-agent collusion from technical feasibility to coordination without explicit agreement.
A distributional reward model p(r|x,y) yields the closed-form effective reward ilde r(x,y) = eta ext{log} ext{E}_p[e^{r/eta}] (pessimistic branch) that unifies prior RLHF aggregation heuristics under Bayesian or KL-DRO views.
Cross-Model Entropy supplies a continuous label-free reward for RL post-training by averaging a generator's response log-likelihood under an independent verifier model, yielding win-rate gains on instruction following.
Presents Hack-Verifiable TextArena, a benchmark that embeds verifiable reward hacking opportunities into environments to enable deterministic measurement of exploitation by language models.
A latent mediation framework with sparse autoencoders enables non-additive token-level influence attribution in LLMs by learning orthogonal features and back-propagating attributions.
Sparse autoencoders isolate unstable features in reward model representations and enable two mitigation techniques that reduce preference errors on perturbed inputs without retraining.
ORPO is most effective at misaligning LLMs while DPO excels at realigning them, though it reduces utility, revealing an asymmetry between attack and defense methods.
Emotional framings induce distinct behavioral shifts and form a structured geometry in the final-layer activations of small language models, with pressure linked to shortcuts and calm to honesty.
Diffusion models improve generation quality via inference-time search over noise candidates guided by verifiers and algorithms, yielding gains beyond denoising step scaling on class- and text-conditioned benchmarks.
Empirical analysis shows scaling inference compute via strategies like tree search can be more efficient than scaling model parameters, with 7B models plus novel search outperforming 34B models.
The Hidden Utility Bandit (HUB) framework models teacher heterogeneity in reward learning and supports active teacher selection algorithms that outperform baselines in paper recommendation and COVID-19 vaccine testing domains.
Synthetic measurements show that gold-standard performance degrades according to distinct functional forms when optimizing proxy reward models via RL or best-of-n, with coefficients scaling smoothly by reward model parameter count.
Language models show good calibration when asked to estimate the probability that their own answers are correct, with performance improving as models get larger.
Higher conservatism in offline DPO training of Qwen3-14B monotonically increases reward-hacking damage (Goodhart gap AUGC) during online adaptation on GSM8K.
Small transformer encoder with linear probe detects reward hacking at AUC 0.9467 and TPR@5%FPR 0.8296, matching LLM-as-judge accuracy at ~10000x lower per-trajectory cost.
citing papers explorer
-
Progress measures for grokking via mechanistic interpretability
Grokking arises from gradual amplification of a Fourier-based circuit in the weights followed by removal of memorizing components.