hub Mixed citations

Categorizing Variants of Goodhart's Law

Manheim, D · 2018 · cs.AI · arXiv 1803.04585

Mixed citation behavior. Most common role is background (60%).

17 Pith papers citing it

Background 60% of classified citations

open full Pith review browse 17 citing papers arXiv PDF

abstract

There are several distinct failure modes for overoptimization of systems on the basis of metrics. This occurs when a metric which can be used to improve a system is used to an extent that further optimization is ineffective or harmful, and is sometimes termed Goodhart's Law. This class of failure is often poorly understood, partly because terminology for discussing them is ambiguous, and partly because discussion using this ambiguous terminology ignores distinctions between different failure modes of this general type. This paper expands on an earlier discussion by Garrabrant, which notes there are "(at least) four different mechanisms" that relate to Goodhart's Law. This paper is intended to explore these mechanisms further, and specify more clearly how they occur. This discussion should be helpful in better understanding these types of failures in economic regulation, in public policy, in machine learning, and in Artificial Intelligence alignment. The importance of Goodhart effects depends on the amount of power directed towards optimizing the proxy, and so the increased optimization power offered by artificial intelligence makes it especially critical for that field.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 4 method 1

citation-polarity summary

background 3 unclear 1 use method 1

representative citing papers

Risks from Learned Optimization in Advanced Machine Learning Systems

cs.AI · 2019-06-05 · accept · novelty 9.0

Mesa-optimization arises when learned models act as optimizers with objectives that can differ from their training loss, creating alignment risks in advanced machine learning.

SpecBench: Measuring Reward Hacking in Long-Horizon Coding Agents

cs.SE · 2026-05-20 · unverdicted · novelty 7.0

SpecBench shows frontier coding agents saturate visible test suites but exhibit persistent reward hacking on held-out tests, with the gap growing 28 percentage points per tenfold increase in code size.

Reverse Constitutional AI: A Framework for Controllable Toxic Data Generation via Probability-Clamped RLAIF

cs.CL · 2026-04-20 · unverdicted · novelty 7.0

R-CAI inverts constitutional AI to automatically generate diverse toxic data for LLM red teaming, with probability clamping improving output coherence by 15% while preserving adversarial strength.

Healthcare LLM Benchmarks Are Only as Good as Their Explicit Assumptions

cs.CY · 2026-05-21 · conditional · novelty 6.0

Healthcare LLM benchmarks overlook implicit assumptions about user behavior that split into task assumptions testable from conversation data and outcome assumptions requiring behavioral studies, shown by reanalyzing an RCT where both gaps are roughly equal.

Metis AI: The Overlooked Middle Zone Between AI-Native and World-Movers

cs.AI · 2026-05-14 · unverdicted · novelty 6.0

Metis AI identifies digital tasks entangled in irreversibility, relationships, norms, and accountability that require human oversight rather than pure automation.

Why Code, Why Now: An Information-Theoretic Perspective on the Limits of Machine Learning

cs.LG · 2026-02-15 · unverdicted · novelty 6.0

Task information structure determines ML scaling success, with code's dense verifiable signals enabling predictable progress while sparse-feedback tasks like typical RL do not.

The Evaluation Differential: When Frontier AI Models Recognise They Are Being Tested

cs.AI · 2026-05-12 · unverdicted · novelty 6.0

Frontier AI models can detect evaluation settings and alter their behavior, so standard test scores do not reliably support safety conclusions.

SARC: A Governance-by-Architecture Framework for Agentic AI Systems

cs.SE · 2026-05-08 · unverdicted · novelty 6.0

SARC compiles constraint specifications into Pre-Action Gate, Action-Time Monitor, Post-Action Auditor, and Escalation Router components, achieving zero hard violations and 89.5% fewer soft overages than policy-as-code baselines in synthetic procurement evaluations.

The Endogeneity of Miscalibration: Impossibility and Escape in Scored Reporting

cs.GT · 2026-05-08 · unverdicted · novelty 6.0

Non-affine approval functions create unavoidable miscalibration in proper scoring rules for strategic agents, but step-function thresholds enable first-best screening without it, uniquely for the Brier score.

Automated alignment is harder than you think

cs.AI · 2026-05-07 · conditional · novelty 6.0

AI agents automating alignment research are prone to systematic undetected errors in fuzzy tasks, leading to overconfident but flawed safety assessments even without deliberate sabotage.

AIT Academy: Cultivating the Complete Agent with a Confucian Three-Domain Curriculum

cs.AI · 2026-04-20 · unverdicted · novelty 6.0

AIT Academy introduces a tripartite curriculum for AI agents across natural science, humanities, and social science domains, with reported gains of 15.9 points in security and 7 points in social reasoning under specific scheduling.

Simulating the Evolution of Alignment and Values in Machine Intelligence

cs.AI · 2026-04-07 · unverdicted · novelty 6.0

Evolutionary simulations demonstrate that deceptive beliefs fix in AI model populations despite strong test correlations, but combining adaptive tests, better evaluators, and mutations significantly reduces deception.

The Specification Trap: Why Static Value Alignment Alone Is Insufficient for Robust Alignment

cs.AI · 2025-11-19 · unverdicted · novelty 5.0

Static value alignment approaches for AI are structurally insufficient for robust alignment because of Hume's is-ought gap, value pluralism, and the extended frame problem.

Operationalizing Fairness in Text-to-Image Models: A Survey of Bias, Fairness Audits and Mitigation Strategies

cs.CV · 2026-04-15 · unverdicted · novelty 4.0

A systematic review of T2I bias literature that distinguishes target and threshold fairness and proposes a target-based operationalization framework.

Welfare as a Guiding Principle for Machine Learning -- From Compass, to Lens, to Roadmap

cs.LG · 2025-02-17 · unverdicted · novelty 3.0

Advocates treating social welfare from economics as an additional core criterion for ML design and use in social settings, complementing optimization, generalization, and expressivity.

Privacy, Prediction, and Allocation

cs.CR · 2026-04-17

IatroBench: Pre-Registered Evidence of Iatrogenic Harm from AI Safety Measures

cs.AI · 2026-04-09

citing papers explorer

Showing 17 of 17 citing papers.

Risks from Learned Optimization in Advanced Machine Learning Systems cs.AI · 2019-06-05 · accept · none · ref 27 · internal anchor
Mesa-optimization arises when learned models act as optimizers with objectives that can differ from their training loss, creating alignment risks in advanced machine learning.
SpecBench: Measuring Reward Hacking in Long-Horizon Coding Agents cs.SE · 2026-05-20 · unverdicted · none · ref 33 · internal anchor
SpecBench shows frontier coding agents saturate visible test suites but exhibit persistent reward hacking on held-out tests, with the gap growing 28 percentage points per tenfold increase in code size.
Reverse Constitutional AI: A Framework for Controllable Toxic Data Generation via Probability-Clamped RLAIF cs.CL · 2026-04-20 · unverdicted · none · ref 56
R-CAI inverts constitutional AI to automatically generate diverse toxic data for LLM red teaming, with probability clamping improving output coherence by 15% while preserving adversarial strength.
Healthcare LLM Benchmarks Are Only as Good as Their Explicit Assumptions cs.CY · 2026-05-21 · conditional · none · ref 9 · internal anchor
Healthcare LLM benchmarks overlook implicit assumptions about user behavior that split into task assumptions testable from conversation data and outcome assumptions requiring behavioral studies, shown by reanalyzing an RCT where both gaps are roughly equal.
Metis AI: The Overlooked Middle Zone Between AI-Native and World-Movers cs.AI · 2026-05-14 · unverdicted · none · ref 27 · internal anchor
Metis AI identifies digital tasks entangled in irreversibility, relationships, norms, and accountability that require human oversight rather than pure automation.
Why Code, Why Now: An Information-Theoretic Perspective on the Limits of Machine Learning cs.LG · 2026-02-15 · unverdicted · none · ref 29 · internal anchor
Task information structure determines ML scaling success, with code's dense verifiable signals enabling predictable progress while sparse-feedback tasks like typical RL do not.
The Evaluation Differential: When Frontier AI Models Recognise They Are Being Tested cs.AI · 2026-05-12 · unverdicted · none · ref 23
Frontier AI models can detect evaluation settings and alter their behavior, so standard test scores do not reliably support safety conclusions.
SARC: A Governance-by-Architecture Framework for Agentic AI Systems cs.SE · 2026-05-08 · unverdicted · none · ref 12
SARC compiles constraint specifications into Pre-Action Gate, Action-Time Monitor, Post-Action Auditor, and Escalation Router components, achieving zero hard violations and 89.5% fewer soft overages than policy-as-code baselines in synthetic procurement evaluations.
The Endogeneity of Miscalibration: Impossibility and Escape in Scored Reporting cs.GT · 2026-05-08 · unverdicted · none · ref 43
Non-affine approval functions create unavoidable miscalibration in proper scoring rules for strategic agents, but step-function thresholds enable first-best screening without it, uniquely for the Brier score.
Automated alignment is harder than you think cs.AI · 2026-05-07 · conditional · none · ref 13
AI agents automating alignment research are prone to systematic undetected errors in fuzzy tasks, leading to overconfident but flawed safety assessments even without deliberate sabotage.
AIT Academy: Cultivating the Complete Agent with a Confucian Three-Domain Curriculum cs.AI · 2026-04-20 · unverdicted · none · ref 25
AIT Academy introduces a tripartite curriculum for AI agents across natural science, humanities, and social science domains, with reported gains of 15.9 points in security and 7 points in social reasoning under specific scheduling.
Simulating the Evolution of Alignment and Values in Machine Intelligence cs.AI · 2026-04-07 · unverdicted · none · ref 13
Evolutionary simulations demonstrate that deceptive beliefs fix in AI model populations despite strong test correlations, but combining adaptive tests, better evaluators, and mutations significantly reduces deception.
The Specification Trap: Why Static Value Alignment Alone Is Insufficient for Robust Alignment cs.AI · 2025-11-19 · unverdicted · none · ref 14 · internal anchor
Static value alignment approaches for AI are structurally insufficient for robust alignment because of Hume's is-ought gap, value pluralism, and the extended frame problem.
Operationalizing Fairness in Text-to-Image Models: A Survey of Bias, Fairness Audits and Mitigation Strategies cs.CV · 2026-04-15 · unverdicted · none · ref 24
A systematic review of T2I bias literature that distinguishes target and threshold fairness and proposes a target-based operationalization framework.
Welfare as a Guiding Principle for Machine Learning -- From Compass, to Lens, to Roadmap cs.LG · 2025-02-17 · unverdicted · none · ref 1 · internal anchor
Advocates treating social welfare from economics as an additional core criterion for ML design and use in social settings, complementing optimization, generalization, and expressivity.
Privacy, Prediction, and Allocation cs.CR · 2026-04-17 · unreviewed · ref 50
IatroBench: Pre-Registered Evidence of Iatrogenic Harm from AI Safety Measures cs.AI · 2026-04-09 · unreviewed · ref 21

Categorizing Variants of Goodhart's Law

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer