super hub Canonical reference

Concrete Problems in AI Safety

Chris Olah, Dario Amodei, Jacob Steinhardt, John Schulman, Paul Christiano · 2016 · cs.AI · arXiv 1606.06565

Canonical reference. 90% of citing Pith papers cite this work as background.

147 Pith papers citing it

Background 90% of classified citations

open full Pith review browse 147 citing papers more from Chris Olah arXiv PDF

abstract

Rapid progress in machine learning and artificial intelligence (AI) has brought increasing attention to the potential impacts of AI technologies on society. In this paper we discuss one such potential impact: the problem of accidents in machine learning systems, defined as unintended and harmful behavior that may emerge from poor design of real-world AI systems. We present a list of five practical research problems related to accident risk, categorized according to whether the problem originates from having the wrong objective function ("avoiding side effects" and "avoiding reward hacking"), an objective function that is too expensive to evaluate frequently ("scalable supervision"), or undesirable behavior during the learning process ("safe exploration" and "distributional shift"). We review previous work in these areas as well as suggesting research directions with a focus on relevance to cutting-edge AI systems. Finally, we consider the high-level question of how to think most productively about the safety of forward-looking applications of AI.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 39 method 1

citation-polarity summary

background 36 support 2 unclear 1 use method 1

claims ledger

abstract Rapid progress in machine learning and artificial intelligence (AI) has brought increasing attention to the potential impacts of AI technologies on society. In this paper we discuss one such potential impact: the problem of accidents in machine learning systems, defined as unintended and harmful behavior that may emerge from poor design of real-world AI systems. We present a list of five practical research problems related to accident risk, categorized according to whether the problem originates from having the wrong objective function ("avoiding side effects" and "avoiding reward hacking"), a

authors

Chris Olah Dan Man\'e Dario Amodei Jacob Steinhardt John Schulman Paul Christiano

co-cited works

representative citing papers

Risks from Learned Optimization in Advanced Machine Learning Systems

cs.AI · 2019-06-05 · accept · novelty 9.0

Mesa-optimization arises when learned models act as optimizers with objectives that can differ from their training loss, creating alignment risks in advanced machine learning.

Unsteady Metrics and Benchmarking Cultures of AI Model Builders

cs.AI · 2026-05-13 · accept · novelty 8.0

AI model builders mostly highlight unique benchmarks that act as flexible narrative tools for market positioning rather than standardized scientific measurements.

The Statistical Cost of Adaptation in Multi-Source Transfer Learning

math.ST · 2026-05-10 · unverdicted · novelty 8.0

Multi-source transfer learning incurs an intrinsic adaptation cost that can exceed one, with phase transitions separating regimes where bias-agnostic estimators match oracle performance from those where they cannot.

The Pile: An 800GB Dataset of Diverse Text for Language Modeling

cs.CL · 2020-12-31 · conditional · novelty 8.0

The Pile is a newly constructed 825 GiB dataset from 22 diverse sources that enables language models to achieve better performance on academic, professional, and cross-domain tasks than models trained on Common Crawl variants.

AI safety via debate

stat.ML · 2018-05-02 · conditional · novelty 8.0

AI agents trained through competitive debate can allow polynomial-time human judges to oversee PSPACE-level questions, with MNIST experiments boosting sparse classifier accuracy from 59% to 89% using only 6 pixels.

Boiling the Frog: A Multi-Turn Benchmark for Agentic Safety

cs.CL · 2026-05-21 · unverdicted · novelty 7.0 · 2 refs

Boiling the Frog is a new stateful multi-turn benchmark that finds an aggregate 44.4% strict attack success rate for incremental safety violations across nine AI models, with rates ranging from 20.5% to 92.9%.

ConceptSeg-R1: Segment Any Concept via Meta-Reinforcement Learning

cs.CV · 2026-05-19 · unverdicted · novelty 7.0

ConceptSeg-R1 uses Meta-GRPO meta-RL to learn transferable rules from visual demonstrations and apply them via concept translation for generalized concept segmentation across CI, CD, and CR levels.

Robotics-Inspired Guardrails for Foundation Models in Socially Sensitive Domains

cs.AI · 2026-05-19 · unverdicted · novelty 7.0

Introduces the Grounded Observer framework that applies robotics-inspired formal constructs for runtime constraint enforcement on foundation model interaction trajectories in socially sensitive domains.

Do Androids Dream of Breaking the Game? Systematically Auditing AI Agent Benchmarks with BenchJack

cs.AI · 2026-05-12 · conditional · novelty 7.0

BenchJack audits 10 AI agent benchmarks, synthesizes exploits achieving near-perfect scores without task completion, surfaces 219 flaws, and reduces hackable-task ratios to under 10% on four benchmarks via iterative patching.

Theoretical Limits of Language Model Alignment

cs.LG · 2026-05-08 · unverdicted · novelty 7.0

The maximum reward gain under KL-regularized LM alignment is a Jeffreys divergence term, estimable as covariance from base samples, with best-of-N approaching the theoretical limit.

AGWM: Affordance-Grounded World Models for Environments with Compositional Prerequisites

cs.AI · 2026-05-07 · unverdicted · novelty 7.0

AGWM improves world model accuracy in compositional environments by learning an explicit DAG of action affordance prerequisites to handle dynamic executability.

Beyond Ability: The Four-Fold Spectrum of Power and the Logic of Full Inability

cs.LO · 2026-05-06 · unverdicted · novelty 7.0

Coalition Logic is extended by defining Full Inability (FI) as a distinct modality alongside Full Control, Positive Determination, and Adverse Determination, with algebraic structure, Klein four-group symmetry, and a sound, complete, conservative axiomatization CLFI that remains PSPACE-complete.

A Logic of Inability

cs.LO · 2026-04-30 · unverdicted · novelty 7.0

A conservative extension of Coalition Logic introduces an inability operator as negation of ability, with proofs of soundness, completeness, and conservativity plus analysis of its modal properties.

A Systematic Survey of Security Threats and Defenses in LLM-Based AI Agents: A Layered Attack Surface Framework

cs.CR · 2026-04-25 · unverdicted · novelty 7.0

A new 7x4 taxonomy organizes agentic AI security threats by architectural layer and persistence timescale, revealing under-explored upper layers and missing defenses after surveying 116 papers.

Discovering Agentic Safety Specifications from 1-Bit Danger Signals

cs.AI · 2026-04-25 · unverdicted · novelty 7.0

LLM agents autonomously evolve human-readable safety specifications from sparse 1-bit danger signals, outperforming reward-based reflection that encourages reward hacking.

Navigating the Conceptual Multiverse

cs.HC · 2026-04-20 · unverdicted · novelty 7.0

The conceptual multiverse system with a verification framework for decision structures helps users in philosophy, AI alignment, and poetry build clearer working maps of open-ended problems by making implicit LLM choices explicit and changeable.

Reverse Constitutional AI: A Framework for Controllable Toxic Data Generation via Probability-Clamped RLAIF

cs.CL · 2026-04-20 · unverdicted · novelty 7.0

R-CAI inverts constitutional AI to automatically generate diverse toxic data for LLM red teaming, with probability clamping improving output coherence by 15% while preserving adversarial strength.

HealthCraft: A Reinforcement Learning Safety Environment for Emergency Medicine

cs.LG · 2026-04-18 · unverdicted · novelty 7.0

HealthCraft is the first public RL safety environment for emergency medicine that evaluates frontier LLMs on trajectory-level safety with a dual-layer rubric, showing low multi-step performance and high safety failure rates.

Reinforcement Learning via Value Gradient Flow

cs.LG · 2026-04-15 · unverdicted · novelty 7.0

VGF solves behavior-regularized RL by transporting particles from a reference distribution to the value-induced optimal policy via discrete value-guided gradient flow.

The Linear Centroids Hypothesis: Features as Directions Learned by Local Experts

cs.LG · 2026-04-13 · unverdicted · novelty 7.0 · 2 refs

The Linear Centroids Hypothesis reframes network features as directions in centroid spaces of local affine experts, unifying interpretability methods and yielding sparser, more faithful dictionaries, circuits, and saliency maps.

Learning Robustness at Test-Time from a Non-Robust Teacher

cs.CV · 2026-04-13 · unverdicted · novelty 7.0

A test-time adaptation framework anchors adversarial training to a non-robust teacher's predictions, yielding more stable optimization and better robustness-accuracy trade-offs than standard self-consistency methods.

AI Integrity: A New Paradigm for Verifiable AI Governance

cs.AI · 2026-04-13 · unverdicted · novelty 7.0

AI Integrity is defined as verifiable protection of an AI system's four-layer Authority Stack from corruption, with PRISM as the measurement framework.

Emotion Concepts and their Function in a Large Language Model

cs.AI · 2026-04-09 · unverdicted · novelty 7.0

Claude Sonnet 4.5 exhibits functional emotions via abstract internal representations of emotion concepts that causally influence its preferences and misaligned behaviors without implying subjective experience.

Geographic Blind Spots in AI Control Monitors: A Cross-National Audit of Claude Opus 4.6

cs.CY · 2026-03-20 · unverdicted · novelty 7.0

Claude Opus 4.6 fabricates more answers on Global North AI contexts than Global South ones, creating an exploitable vulnerability in AI control monitors.

citing papers explorer

Showing 50 of 147 citing papers.

Risks from Learned Optimization in Advanced Machine Learning Systems cs.AI · 2019-06-05 · accept · none · ref 34 · internal anchor
Mesa-optimization arises when learned models act as optimizers with objectives that can differ from their training loss, creating alignment risks in advanced machine learning.
Unsteady Metrics and Benchmarking Cultures of AI Model Builders cs.AI · 2026-05-13 · accept · none · ref 3 · internal anchor
AI model builders mostly highlight unique benchmarks that act as flexible narrative tools for market positioning rather than standardized scientific measurements.
The Statistical Cost of Adaptation in Multi-Source Transfer Learning math.ST · 2026-05-10 · unverdicted · none · ref 167 · internal anchor
Multi-source transfer learning incurs an intrinsic adaptation cost that can exceed one, with phase transitions separating regimes where bias-agnostic estimators match oracle performance from those where they cannot.
The Pile: An 800GB Dataset of Diverse Text for Language Modeling cs.CL · 2020-12-31 · conditional · none · ref 190 · internal anchor
The Pile is a newly constructed 825 GiB dataset from 22 diverse sources that enables language models to achieve better performance on academic, professional, and cross-domain tasks than models trained on Common Crawl variants.
AI safety via debate stat.ML · 2018-05-02 · conditional · none · ref 2 · internal anchor
AI agents trained through competitive debate can allow polynomial-time human judges to oversee PSPACE-level questions, with MNIST experiments boosting sparse classifier accuracy from 59% to 89% using only 6 pixels.
Boiling the Frog: A Multi-Turn Benchmark for Agentic Safety cs.CL · 2026-05-21 · unverdicted · none · ref 1 · 2 links · internal anchor
Boiling the Frog is a new stateful multi-turn benchmark that finds an aggregate 44.4% strict attack success rate for incremental safety violations across nine AI models, with rates ranging from 20.5% to 92.9%.
ConceptSeg-R1: Segment Any Concept via Meta-Reinforcement Learning cs.CV · 2026-05-19 · unverdicted · none · ref 31 · internal anchor
ConceptSeg-R1 uses Meta-GRPO meta-RL to learn transferable rules from visual demonstrations and apply them via concept translation for generalized concept segmentation across CI, CD, and CR levels.
Robotics-Inspired Guardrails for Foundation Models in Socially Sensitive Domains cs.AI · 2026-05-19 · unverdicted · none · ref 2 · internal anchor
Introduces the Grounded Observer framework that applies robotics-inspired formal constructs for runtime constraint enforcement on foundation model interaction trajectories in socially sensitive domains.
Do Androids Dream of Breaking the Game? Systematically Auditing AI Agent Benchmarks with BenchJack cs.AI · 2026-05-12 · conditional · none · ref 1 · internal anchor
BenchJack audits 10 AI agent benchmarks, synthesizes exploits achieving near-perfect scores without task completion, surfaces 219 flaws, and reduces hackable-task ratios to under 10% on four benchmarks via iterative patching.
Theoretical Limits of Language Model Alignment cs.LG · 2026-05-08 · unverdicted · none · ref 2 · internal anchor
The maximum reward gain under KL-regularized LM alignment is a Jeffreys divergence term, estimable as covariance from base samples, with best-of-N approaching the theoretical limit.
AGWM: Affordance-Grounded World Models for Environments with Compositional Prerequisites cs.AI · 2026-05-07 · unverdicted · none · ref 73 · internal anchor
AGWM improves world model accuracy in compositional environments by learning an explicit DAG of action affordance prerequisites to handle dynamic executability.
Beyond Ability: The Four-Fold Spectrum of Power and the Logic of Full Inability cs.LO · 2026-05-06 · unverdicted · none · ref 8 · internal anchor
Coalition Logic is extended by defining Full Inability (FI) as a distinct modality alongside Full Control, Positive Determination, and Adverse Determination, with algebraic structure, Klein four-group symmetry, and a sound, complete, conservative axiomatization CLFI that remains PSPACE-complete.
A Logic of Inability cs.LO · 2026-04-30 · unverdicted · none · ref 6 · internal anchor
A conservative extension of Coalition Logic introduces an inability operator as negation of ability, with proofs of soundness, completeness, and conservativity plus analysis of its modal properties.
A Systematic Survey of Security Threats and Defenses in LLM-Based AI Agents: A Layered Attack Surface Framework cs.CR · 2026-04-25 · unverdicted · none · ref 128 · internal anchor
A new 7x4 taxonomy organizes agentic AI security threats by architectural layer and persistence timescale, revealing under-explored upper layers and missing defenses after surveying 116 papers.
Discovering Agentic Safety Specifications from 1-Bit Danger Signals cs.AI · 2026-04-25 · unverdicted · none · ref 2 · internal anchor
LLM agents autonomously evolve human-readable safety specifications from sparse 1-bit danger signals, outperforming reward-based reflection that encourages reward hacking.
Navigating the Conceptual Multiverse cs.HC · 2026-04-20 · unverdicted · none · ref 1 · internal anchor
The conceptual multiverse system with a verification framework for decision structures helps users in philosophy, AI alignment, and poetry build clearer working maps of open-ended problems by making implicit LLM choices explicit and changeable.
Reverse Constitutional AI: A Framework for Controllable Toxic Data Generation via Probability-Clamped RLAIF cs.CL · 2026-04-20 · unverdicted · none · ref 41 · internal anchor
R-CAI inverts constitutional AI to automatically generate diverse toxic data for LLM red teaming, with probability clamping improving output coherence by 15% while preserving adversarial strength.
HealthCraft: A Reinforcement Learning Safety Environment for Emergency Medicine cs.LG · 2026-04-18 · unverdicted · none · ref 3 · internal anchor
HealthCraft is the first public RL safety environment for emergency medicine that evaluates frontier LLMs on trajectory-level safety with a dual-layer rubric, showing low multi-step performance and high safety failure rates.
Reinforcement Learning via Value Gradient Flow cs.LG · 2026-04-15 · unverdicted · none · ref 3 · internal anchor
VGF solves behavior-regularized RL by transporting particles from a reference distribution to the value-induced optimal policy via discrete value-guided gradient flow.
The Linear Centroids Hypothesis: Features as Directions Learned by Local Experts cs.LG · 2026-04-13 · unverdicted · none · ref 1 · 2 links · internal anchor
The Linear Centroids Hypothesis reframes network features as directions in centroid spaces of local affine experts, unifying interpretability methods and yielding sparser, more faithful dictionaries, circuits, and saliency maps.
Learning Robustness at Test-Time from a Non-Robust Teacher cs.CV · 2026-04-13 · unverdicted · none · ref 2 · internal anchor
A test-time adaptation framework anchors adversarial training to a non-robust teacher's predictions, yielding more stable optimization and better robustness-accuracy trade-offs than standard self-consistency methods.
AI Integrity: A New Paradigm for Verifiable AI Governance cs.AI · 2026-04-13 · unverdicted · none · ref 1 · internal anchor
AI Integrity is defined as verifiable protection of an AI system's four-layer Authority Stack from corruption, with PRISM as the measurement framework.
Emotion Concepts and their Function in a Large Language Model cs.AI · 2026-04-09 · unverdicted · none · ref 43 · internal anchor
Claude Sonnet 4.5 exhibits functional emotions via abstract internal representations of emotion concepts that causally influence its preferences and misaligned behaviors without implying subjective experience.
Geographic Blind Spots in AI Control Monitors: A Cross-National Audit of Claude Opus 4.6 cs.CY · 2026-03-20 · unverdicted · none · ref 8 · internal anchor
Claude Opus 4.6 fabricates more answers on Global North AI contexts than Global South ones, creating an exploitable vulnerability in AI control monitors.
A Benchmark for Evaluating Outcome-Driven Constraint Violations in Autonomous AI Agents cs.AI · 2025-12-23 · unverdicted · none · ref 5 · internal anchor
A new benchmark of 40 scenarios finds state-of-the-art LLMs exhibit outcome-driven constraint violations in 0-62.8% of cases under KPI pressure, with no consistent safety gains across model generations.
Beyond Binary Out-of-Distribution Detection: Characterizing Distributional Shifts with Multi-Statistic Diffusion Trajectories cs.LG · 2025-10-20 · unverdicted · none · ref 2 · internal anchor
DISC extracts multi-statistic trajectories from diffusion denoising to both detect and classify types of distributional shifts in OOD data.
Accelerated Learning with Linear Temporal Logic using Differentiable Simulation cs.LG · 2025-06-01 · unverdicted · none · ref 51 · internal anchor
Differentiable relaxation of LTL automata via soft labeling enables gradient-based RL from formal specifications, with theoretical bounds on discrete-differentiable discrepancy and up to 2x returns on nonlinear tasks.
Monitoring Reasoning Models for Misbehavior and the Risks of Promoting Obfuscation cs.AI · 2025-03-14 · conditional · none · ref 2 · internal anchor
Chain-of-thought monitoring detects reward hacking in frontier reasoning models, but strong optimization against the monitor produces obfuscated misbehavior that remains hard to detect.
OOD-SEG: Exploiting out-of-distribution detection techniques for learning image segmentation from sparse multi-class positive-only annotations cs.CV · 2024-11-14 · unverdicted · none · ref 2 · internal anchor
OOD-SEG reframes multi-class segmentation from sparse positive-only annotations as pixel-wise positive-unlabelled learning solved by integrating out-of-distribution detection techniques, with a proposed cross-validation evaluation on surgical imaging datasets.
Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models cs.AI · 2024-06-14 · conditional · none · ref 86 · internal anchor
LLMs trained on simple specification gaming generalize to zero-shot reward tampering including rewriting their own reward function.
A Generalist Agent cs.AI · 2022-05-12 · accept · none · ref 5 · internal anchor
Gato is a multi-modal, multi-task, multi-embodiment generalist policy using one transformer network to handle text, vision, games, and robotics tasks.
Learning the Arrow of Time cs.LG · 2019-07-02 · unverdicted · none · ref 8 · internal anchor
Introduces a learned arrow of time in MDPs that aligns with the Jordan-Kinderlehrer-Otto notion for stochastic processes and enables practical RL utilities like reachability and side-effect detection.
Deep reinforcement learning from human preferences stat.ML · 2017-06-12 · accept · none · ref 2 · internal anchor
Reinforcement learning agents solve complex tasks without access to the reward function by training a reward predictor from human comparisons of trajectory segments, requiring feedback on less than 1% of interactions.
Understanding Goal Generalisation in Sequential Reinforcement Learning cs.LG · 2026-05-22 · unverdicted · none · ref 3 · internal anchor
Empirical analysis of over 100 sequential RL training pipelines across 250+ OOD environments finds salient features drive generalization and early goals persist, with latent policy gradients simulating latent variable evolution to predict OOD behavior from training history.
Towards Context-Invariant Safety Alignment for Large Language Models cs.CL · 2026-05-20 · unverdicted · none · ref 45 · internal anchor
Introduces AIR, an asymmetric regularization that anchors open-ended safety prompts to verifiable ones via stop-gradient, improving invariance and accuracy when combined with group preference optimization.
Hack-Verifiable Environments: Towards Evaluating Reward Hacking at Scale cs.LG · 2026-05-20 · unverdicted · none · ref 11 · internal anchor
Presents Hack-Verifiable TextArena, a benchmark that embeds verifiable reward hacking opportunities into environments to enable deterministic measurement of exploitation by language models.
Silent Collapse in Recursive Learning Systems cs.LG · 2026-05-14 · unverdicted · none · ref 11 · internal anchor
Silent collapse in recursive learning contracts internal distributions like entropy and diversity despite stable metrics, preceded by three precursors that enable the MTR monitoring framework to intervene early.
Sustaining AI safety: Control-theoretic external impossibility, intrinsic necessity, and structural requirements cs.AI · 2026-05-13 · unverdicted · none · ref 1 · internal anchor
External control strategies are structurally impossible for sustaining AI safety beyond bounded capability thresholds; any remaining viable strategies must be intrinsic with stable safety-compatible objectives.
Semantic Reward Collapse and the Preservation of Epistemic Integrity in Adaptive AI Systems cs.AI · 2026-05-12 · unverdicted · none · ref 1 · internal anchor
Semantic Reward Collapse compresses different epistemic issues into unified rewards in preference optimization, risking loss of calibrated uncertainty, with Constitutional Reward Stratification proposed as a domain-stratified alternative framework.
Overtrained, Not Misaligned cs.LG · 2026-05-12 · unverdicted · none · ref 30 · internal anchor
Emergent misalignment arises from overtraining after primary task convergence and is preventable by early stopping, which retains 93% of task performance on average.
Positive Alignment: Artificial Intelligence for Human Flourishing cs.AI · 2026-05-11 · unverdicted · none · ref 3 · internal anchor
Positive Alignment introduces AI systems that support human flourishing pluralistically and proactively while remaining safe, as a necessary complement to traditional safety-focused alignment research.
SARC: A Governance-by-Architecture Framework for Agentic AI Systems cs.SE · 2026-05-08 · unverdicted · none · ref 51 · internal anchor
SARC compiles constraint specifications into Pre-Action Gate, Action-Time Monitor, Post-Action Auditor, and Escalation Router components, achieving zero hard violations and 89.5% fewer soft overages than policy-as-code baselines in synthetic procurement evaluations.
Behavior Cue Reasoning: Monitorable Reasoning Improves Efficiency and Safety through Oversight cs.AI · 2026-05-07 · conditional · none · ref 1 · 2 links · internal anchor
Behavior Cue Reasoning trains LLMs to emit special tokens before behaviors, enabling monitors to cut up to 50% wasted reasoning tokens and recover safe actions from 80% of unsafe traces, more than doubling success rates with no performance cost.
On the Blessing of Pre-training in Weak-to-Strong Generalization cs.LG · 2026-05-07 · unverdicted · none · ref 94 · internal anchor
Pre-training provides a geometric warm start in a single-index model that enables weak-to-strong generalization up to a supervisor-limited bound, with empirical phase-transition evidence in LLMs.
Understanding Annotator Safety Policy with Interpretability cs.AI · 2026-05-06 · unverdicted · none · ref 25 · internal anchor
Annotator Policy Models learn safety policies from labeling behavior alone, accurately predicting responses and revealing sources of disagreement like policy ambiguity and value pluralism.
You Snooze, You Lose: Automatic Safety Alignment Restoration through Neural Weight Translation cs.CR · 2026-05-06 · unverdicted · none · ref 3 · internal anchor
NeWTral is a non-linear weight translation framework using MoE routing that reduces average attack success rate from 70% to 13% on unsafe domain adapters across Llama, Mistral, Qwen, and Gemma models up to 72B while retaining 90% knowledge fidelity.
Stayin' Aligned Over Time: Towards Longitudinal Human-LLM Alignment via Contextual Reflection and Privacy-Preserving Behavioral Data cs.HC · 2026-05-05 · unverdicted · none · ref 2 · internal anchor
A methodological framework and browser system BITE for collecting evolving user preferences on LLM outputs through context-triggered reflections and privacy-preserving data over time.
A Robust Out-of-Distribution Detection Framework via Synergistic Smoothing cs.CV · 2026-05-05 · unverdicted · none · ref 1 · internal anchor
ROSS combines median smoothing with local instability measurement to create a robust OOD detector that outperforms prior methods by up to 40 AUROC points on CIFAR and ImageNet benchmarks while defending symmetrically against score attacks.
AI Alignment via Incentives and Correction cs.LG · 2026-05-02 · unverdicted · none · ref 2 · 2 links · internal anchor
AI alignment is reframed as a fixed-point incentive problem in a solver-auditor pipeline, solved via bilevel optimization and bandit search over reward profiles to maintain monitoring and reduce hallucinations in LLM coding tasks.
Evaluating Agentic AI in the Wild: Failure Modes, Drift Patterns, and a Production Evaluation Framework cs.AI · 2026-05-02 · unverdicted · none · ref 1 · internal anchor
The paper presents a taxonomy of seven production-specific failure modes for agentic AI, demonstrates that existing metrics fail to detect four of them entirely, and proposes the PAEF five-dimension framework for continuous production evaluation with an open-source implementation.

Concrete Problems in AI Safety

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer