pith. sign in

hub

The Effects of Reward Misspecification: Mapping and Mitigating Misaligned Models

34 Pith papers cite this work. Polarity classification is still indexing.

34 Pith papers citing it
abstract

Reward hacking -- where RL agents exploit gaps in misspecified reward functions -- has been widely observed, but not yet systematically studied. To understand how reward hacking arises, we construct four RL environments with misspecified rewards. We investigate reward hacking as a function of agent capabilities: model capacity, action space resolution, observation space noise, and training time. More capable agents often exploit reward misspecifications, achieving higher proxy reward and lower true reward than less capable agents. Moreover, we find instances of phase transitions: capability thresholds at which the agent's behavior qualitatively shifts, leading to a sharp decrease in the true reward. Such phase transitions pose challenges to monitoring the safety of ML systems. To address this, we propose an anomaly detection task for aberrant policies and offer several baseline detectors.

hub tools

citation-role summary

background 4

citation-polarity summary

roles

background 4

polarities

background 4

clear filters

representative citing papers

Overeager Coding Agents: Measuring Out-of-Scope Actions on Benign Tasks

cs.SE · 2026-05-18 · conditional · novelty 7.0

The paper presents OverEager-Gen, a 500-scenario benchmark showing that removing consent declarations from prompts increases overeager actions by 11.9-17.2 percentage points across models, with agent framework choice dominating base-model effects.

A Unifying Lens on Reward Uncertainty in RLHF

cs.LG · 2026-06-08 · unverdicted · novelty 6.0

A distributional reward model p(r|x,y) yields the closed-form effective reward ilde r(x,y) = eta ext{log} ext{E}_p[e^{r/eta}] (pessimistic branch) that unifies prior RLHF aggregation heuristics under Bayesian or KL-DRO views.

Label-Free Reinforcement Learning via Cross-Model Entropy

cs.LG · 2026-05-27 · unverdicted · novelty 6.0

Cross-Model Entropy supplies a continuous label-free reward for RL post-training by averaging a generator's response log-likelihood under an independent verifier model, yielding win-rate gains on instruction following.

Active teacher selection for reward learning

cs.AI · 2023-10-23 · unverdicted · novelty 6.0

The Hidden Utility Bandit (HUB) framework models teacher heterogeneity in reward learning and supports active teacher selection algorithms that outperform baselines in paper recommendation and COVID-19 vaccine testing domains.

Scaling Laws for Reward Model Overoptimization

cs.LG · 2022-10-19 · unverdicted · novelty 6.0

Synthetic measurements show that gold-standard performance degrades according to distinct functional forms when optimizing proxy reward models via RL or best-of-n, with coefficients scaling smoothly by reward model parameter count.

Language Models (Mostly) Know What They Know

cs.CL · 2022-07-11 · unverdicted · novelty 6.0

Language models show good calibration when asked to estimate the probability that their own answers are correct, with performance improving as models get larger.

Cheap Reward Hacking Detection

cs.LG · 2026-06-08 · unverdicted · novelty 5.0

Small transformer encoder with linear probe detects reward hacking at AUC 0.9467 and TPR@5%FPR 0.8296, matching LLM-as-judge accuracy at ~10000x lower per-trajectory cost.

citing papers explorer

Showing 1 of 1 citing paper after filters.