pith. sign in

super hub Canonical reference

Concrete Problems in AI Safety

Canonical reference. 90% of citing Pith papers cite this work as background.

222 Pith papers citing it
Background 90% of classified citations
abstract

Rapid progress in machine learning and artificial intelligence (AI) has brought increasing attention to the potential impacts of AI technologies on society. In this paper we discuss one such potential impact: the problem of accidents in machine learning systems, defined as unintended and harmful behavior that may emerge from poor design of real-world AI systems. We present a list of five practical research problems related to accident risk, categorized according to whether the problem originates from having the wrong objective function ("avoiding side effects" and "avoiding reward hacking"), an objective function that is too expensive to evaluate frequently ("scalable supervision"), or undesirable behavior during the learning process ("safe exploration" and "distributional shift"). We review previous work in these areas as well as suggesting research directions with a focus on relevance to cutting-edge AI systems. Finally, we consider the high-level question of how to think most productively about the safety of forward-looking applications of AI.

hub tools

citation-role summary

background 41 method 1

citation-polarity summary

claims ledger

  • abstract Rapid progress in machine learning and artificial intelligence (AI) has brought increasing attention to the potential impacts of AI technologies on society. In this paper we discuss one such potential impact: the problem of accidents in machine learning systems, defined as unintended and harmful behavior that may emerge from poor design of real-world AI systems. We present a list of five practical research problems related to accident risk, categorized according to whether the problem originates from having the wrong objective function ("avoiding side effects" and "avoiding reward hacking"), a

authors

co-cited works

clear filters

representative citing papers

The Pile: An 800GB Dataset of Diverse Text for Language Modeling

cs.CL · 2020-12-31 · conditional · novelty 8.0

The Pile is a newly constructed 825 GiB dataset from 22 diverse sources that enables language models to achieve better performance on academic, professional, and cross-domain tasks than models trained on Common Crawl variants.

AI safety via debate

stat.ML · 2018-05-02 · conditional · novelty 8.0

AI agents trained through competitive debate can allow polynomial-time human judges to oversee PSPACE-level questions, with MNIST experiments boosting sparse classifier accuracy from 59% to 89% using only 6 pixels.

Competing Auctions in Intermediated Markets

cs.GT · 2026-06-04 · unverdicted · novelty 7.0

Sealed-bid second-price intermediary auctions fully unravel into sealed first-price principal auctions while open formats unravel only partially, limiting intermediary design space when a credible first-price channel exists.

Boiling the Frog: A Multi-Turn Benchmark for Agentic Safety

cs.CL · 2026-05-21 · unverdicted · novelty 7.0 · 2 refs

Boiling the Frog is a new stateful multi-turn benchmark that finds an aggregate 44.4% strict attack success rate for incremental safety violations across nine AI models, with rates ranging from 20.5% to 92.9%.

Theoretical Limits of Language Model Alignment

cs.LG · 2026-05-08 · unverdicted · novelty 7.0

The maximum reward gain under KL-regularized LM alignment is a Jeffreys divergence term, estimable as covariance from base samples, with best-of-N approaching the theoretical limit.

citing papers explorer

Showing 7 of 7 citing papers after filters.

  • Theoretical Limits of Language Model Alignment cs.LG · 2026-05-08 · unverdicted · none · ref 2 · internal anchor

    The maximum reward gain under KL-regularized LM alignment is a Jeffreys divergence term, estimable as covariance from base samples, with best-of-N approaching the theoretical limit.

  • Reinforcement Learning via Value Gradient Flow cs.LG · 2026-04-15 · unverdicted · none · ref 3 · internal anchor

    VGF solves behavior-regularized RL by transporting particles from a reference distribution to the value-induced optimal policy via discrete value-guided gradient flow.

  • AI Alignment via Incentives and Correction cs.LG · 2026-05-02 · unverdicted · none · ref 2 · 2 links · internal anchor

    AI alignment is reframed as a fixed-point incentive problem in a solver-auditor pipeline, solved via bilevel optimization and bandit search over reward profiles to maintain monitoring and reduce hallucinations in LLM coding tasks.

  • When Errors Can Be Beneficial: A Categorization of Imperfect Rewards for Policy Gradient cs.LG · 2026-04-28 · unverdicted · none · ref 4 · internal anchor

    Certain errors in proxy rewards for policy gradient methods can be benign or beneficial by preventing policies from stalling on outputs with mediocre ground truth rewards, enabling improved RLHF metrics and reward design insights.

  • Evaluation-driven Scaling for Scientific Discovery cs.LG · 2026-04-21 · unverdicted · none · ref 5 · internal anchor

    SimpleTES scales test-time evaluation in LLMs to discover state-of-the-art solutions on 21 scientific problems across six domains, outperforming frontier models and optimization pipelines with examples like 2x faster LASSO and new Erdos constructions.

  • Reward Hacking in the Era of Large Models: Mechanisms, Emergent Misalignment, Challenges cs.LG · 2026-04-15 · unverdicted · none · ref 10 · internal anchor

    The paper introduces the Proxy Compression Hypothesis as a unifying framework explaining reward hacking in RLHF as an emergent result of compressing high-dimensional human objectives into proxy reward signals under optimization pressure.

  • Energy-Based Dynamical Models for Neurocomputation, Learning, and Optimization cs.LG · 2026-04-06 · unverdicted · none · ref 4 · internal anchor

    The paper reviews and extends energy-based dynamical models that use gradient flows and energy landscapes for neurocomputation, learning, and optimization tasks.