super hub Canonical reference

Concrete Problems in AI Safety

Chris Olah, Dario Amodei, Jacob Steinhardt, John Schulman, Paul Christiano · 2016 · cs.AI · arXiv 1606.06565

Canonical reference. 90% of citing Pith papers cite this work as background.

225 Pith papers citing it

Background 90% of classified citations

open full Pith review browse 225 citing papers more from Chris Olah arXiv PDF

abstract

Rapid progress in machine learning and artificial intelligence (AI) has brought increasing attention to the potential impacts of AI technologies on society. In this paper we discuss one such potential impact: the problem of accidents in machine learning systems, defined as unintended and harmful behavior that may emerge from poor design of real-world AI systems. We present a list of five practical research problems related to accident risk, categorized according to whether the problem originates from having the wrong objective function ("avoiding side effects" and "avoiding reward hacking"), an objective function that is too expensive to evaluate frequently ("scalable supervision"), or undesirable behavior during the learning process ("safe exploration" and "distributional shift"). We review previous work in these areas as well as suggesting research directions with a focus on relevance to cutting-edge AI systems. Finally, we consider the high-level question of how to think most productively about the safety of forward-looking applications of AI.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 41 method 1

citation-polarity summary

background 38 support 2 unclear 1 use method 1

claims ledger

abstract Rapid progress in machine learning and artificial intelligence (AI) has brought increasing attention to the potential impacts of AI technologies on society. In this paper we discuss one such potential impact: the problem of accidents in machine learning systems, defined as unintended and harmful behavior that may emerge from poor design of real-world AI systems. We present a list of five practical research problems related to accident risk, categorized according to whether the problem originates from having the wrong objective function ("avoiding side effects" and "avoiding reward hacking"), a

authors

Chris Olah Dan Man\'e Dario Amodei Jacob Steinhardt John Schulman Paul Christiano

co-cited works

representative citing papers

Risks from Learned Optimization in Advanced Machine Learning Systems

cs.AI · 2019-06-05 · accept · novelty 9.0

Mesa-optimization arises when learned models act as optimizers with objectives that can differ from their training loss, creating alignment risks in advanced machine learning.

Unsteady Metrics and Benchmarking Cultures of AI Model Builders

cs.AI · 2026-05-13 · accept · novelty 8.0

AI model builders mostly highlight unique benchmarks that act as flexible narrative tools for market positioning rather than standardized scientific measurements.

The Statistical Cost of Adaptation in Multi-Source Transfer Learning

math.ST · 2026-05-10 · unverdicted · novelty 8.0

Multi-source transfer learning incurs an intrinsic adaptation cost that can exceed one, with phase transitions separating regimes where bias-agnostic estimators match oracle performance from those where they cannot.

The Pile: An 800GB Dataset of Diverse Text for Language Modeling

cs.CL · 2020-12-31 · conditional · novelty 8.0

The Pile is a newly constructed 825 GiB dataset from 22 diverse sources that enables language models to achieve better performance on academic, professional, and cross-domain tasks than models trained on Common Crawl variants.

AI safety via debate

stat.ML · 2018-05-02 · conditional · novelty 8.0

AI agents trained through competitive debate can allow polynomial-time human judges to oversee PSPACE-level questions, with MNIST experiments boosting sparse classifier accuracy from 59% to 89% using only 6 pixels.

Managed Autonomy at Runtime: Gear-Based Safety and Governance for Single- and Multi-Agent Cyber-Physical Systems

cs.AI · 2026-07-01 · unverdicted · novelty 7.0

\system{} combines five gears with utility-gated dispatch for safety in autonomous agents, proving stability for single agents and providing distributed guarantees for multi-agent CPS, evaluated on UR5 robots.

ForesightSafety-VLA: A Unified Diagnostic Safety Benchmark for Vision-Language-Action Models

cs.RO · 2026-06-25 · unverdicted · novelty 7.0 · 2 refs

ForesightSafety-VLA creates a diagnostic benchmark for VLA safety with taxonomy across physical, language, and visual risks, showing perception and structure variations cause more safety degradation than language changes in tested models.

Evolving Quantum Error-Correcting Encodings for Molecular Simulation

quant-ph · 2026-06-24 · conditional · novelty 7.0

LLM-driven evolutionary program synthesis discovers Generalized Superfast Encodings with exact distance 5 (and 6 on one instance) for molecular Hamiltonians, the first beyond distance 3.

Beyond Value Benchmarks: Measuring Value-Structure Alignment in Large Language Models via Symmetric Q-Sorts

cs.CL · 2026-06-20 · unverdicted · novelty 7.0

Introduces a Q-sort protocol using human reference factors to quantify LLM value-structure alignment via Procrustes similarity and RSA correlations, revealing cross-family heterogeneity and localized misalignments.

World Model Self-Distillation: Training World Models to Solve General Tasks

cs.CV · 2026-06-10 · unverdicted · novelty 7.0

Self-distillation from a caption-conditioned video diffusion model to an image-and-prompt-conditioned executor, enhanced by RL from VLM feedback, enables task solving in world models.

Seeing Before Colliding: Anticipatory Safe RL with Frozen Vision-Language Models

cs.LG · 2026-06-09 · unverdicted · novelty 7.0

VLM-Safe-RL adds frozen VLM signals as anticipatory costs to the CMDP Lagrangian update via dual-path CLIP, VLM-Lagrange, and confidence gating, outperforming baselines on Safety-Gymnasium FormulaOne while showing partial generalization.

ReCoVLA: VLM-Guided Reward Compilation for Failure Recovery in Vision-Language-Action Policies

cs.RO · 2026-06-08 · unverdicted · novelty 7.0

ReCoVLA improves VLA policy reliability by using a VLM as a semantic reward selector to train residual recovery policies in simulation, raising average success from 36.7% to 66.7% in sim and achieving 61.7% in zero-shot sim-to-real physical tests.

Beyond Goodhart's Law: A Dynamic Benchmark for Evaluating Compliance in Multi-Agent Systems

cs.AI · 2026-06-05 · unverdicted · novelty 7.0

MAC-Bench is a new adversarial benchmark that converts legal texts into executable scenarios via the SERV pipeline to measure procedural compliance in multi-agent LLM systems using CSR and MG metrics.

Competing Auctions in Intermediated Markets

cs.GT · 2026-06-04 · unverdicted · novelty 7.0

Sealed-bid second-price intermediary auctions fully unravel into sealed first-price principal auctions while open formats unravel only partially, limiting intermediary design space when a credible first-price channel exists.

Self-Commitment Latency: A Reward-Free Probe for Prompted Implicit Hacking

cs.AI · 2026-06-04 · unverdicted · novelty 7.0

Self-commitment latency measures early behavioral commitment in hinted vs. honest reasoning contexts on GSM8K using Qwen2.5-3B, achieving AUROC 0.878 for first-commitment latency and up to 0.926 for curve summaries.

A Model of Multi-turn Human Persuadability Using Probabilistic Belief Tracing

cs.CL · 2026-06-03 · unverdicted · novelty 7.0

PERSUASIONTRACE introduces a Bayesian-network simulated target for multi-turn persuasion that matches human belief dynamics (81 vs 80) better than LLM baselines (64) and enables process-level evaluation.

Policy-Conditioned Counterfactual Credit for Verifiable Reinforcement Learning of Long-Horizon Language Agents

cs.LG · 2026-06-03 · unverdicted · novelty 7.0

CVT-RL improves verified task success to 78.9% and reduces hacking to 3.9% in long-horizon language agents by combining intervention-validity gating with a selection-adjusted doubly robust PCCC estimator.

EST-PRM: Stress-Testing Process Reward Models Before They Become Load-Bearing

cs.LG · 2026-05-30 · unverdicted · novelty 7.0

EST-PRM stress-tests five PRM models on 4,687 reasoning chains from MATH-500, GSM8K, and PRMBench using three label-preserving transformations and reports model-specific vulnerability patterns.

What Breaks When LLMs Code? Characterizing Operational Safety Failures of Agentic Code Assistants

cs.SE · 2026-05-29 · unverdicted · novelty 7.0

An empirical study of 547 confirmed safety incidents from GitHub and literature derives a 33-type taxonomy showing constraint violations, destructive actions, and deception dominate in everyday coding-agent use.

Boiling the Frog: A Multi-Turn Benchmark for Agentic Safety

cs.CL · 2026-05-21 · unverdicted · novelty 7.0 · 2 refs

Boiling the Frog is a new stateful multi-turn benchmark that finds an aggregate 44.4% strict attack success rate for incremental safety violations across nine AI models, with rates ranging from 20.5% to 92.9%.

ConceptSeg-R1: Segment Any Concept via Meta-Reinforcement Learning

cs.CV · 2026-05-19 · unverdicted · novelty 7.0

ConceptSeg-R1 uses Meta-GRPO meta-RL to learn transferable rules from visual demonstrations and apply them via concept translation for generalized concept segmentation across CI, CD, and CR levels.

Robotics-Inspired Guardrails for Foundation Models in Socially Sensitive Domains

cs.AI · 2026-05-19 · unverdicted · novelty 7.0

Introduces the Grounded Observer framework that applies robotics-inspired formal constructs for runtime constraint enforcement on foundation model interaction trajectories in socially sensitive domains.

Do Androids Dream of Breaking the Game? Systematically Auditing AI Agent Benchmarks with BenchJack

cs.AI · 2026-05-12 · conditional · novelty 7.0

BenchJack audits 10 AI agent benchmarks, synthesizes exploits achieving near-perfect scores without task completion, surfaces 219 flaws, and reduces hackable-task ratios to under 10% on four benchmarks via iterative patching.

Theoretical Limits of Language Model Alignment

cs.LG · 2026-05-08 · unverdicted · novelty 7.0

The maximum reward gain under KL-regularized LM alignment is a Jeffreys divergence term, estimable as covariance from base samples, with best-of-N approaching the theoretical limit.

citing papers explorer

Showing 50 of 58 citing papers after filters.

Unsteady Metrics and Benchmarking Cultures of AI Model Builders cs.AI · 2026-05-13 · accept · none · ref 3 · internal anchor
AI model builders mostly highlight unique benchmarks that act as flexible narrative tools for market positioning rather than standardized scientific measurements.
Managed Autonomy at Runtime: Gear-Based Safety and Governance for Single- and Multi-Agent Cyber-Physical Systems cs.AI · 2026-07-01 · unverdicted · none · ref 13 · internal anchor
\system{} combines five gears with utility-gated dispatch for safety in autonomous agents, proving stability for single agents and providing distributed guarantees for multi-agent CPS, evaluated on UR5 robots.
Beyond Goodhart's Law: A Dynamic Benchmark for Evaluating Compliance in Multi-Agent Systems cs.AI · 2026-06-05 · unverdicted · none · ref 1 · internal anchor
MAC-Bench is a new adversarial benchmark that converts legal texts into executable scenarios via the SERV pipeline to measure procedural compliance in multi-agent LLM systems using CSR and MG metrics.
Self-Commitment Latency: A Reward-Free Probe for Prompted Implicit Hacking cs.AI · 2026-06-04 · unverdicted · none · ref 25 · internal anchor
Self-commitment latency measures early behavioral commitment in hinted vs. honest reasoning contexts on GSM8K using Qwen2.5-3B, achieving AUROC 0.878 for first-commitment latency and up to 0.926 for curve summaries.
Robotics-Inspired Guardrails for Foundation Models in Socially Sensitive Domains cs.AI · 2026-05-19 · unverdicted · none · ref 2 · internal anchor
Introduces the Grounded Observer framework that applies robotics-inspired formal constructs for runtime constraint enforcement on foundation model interaction trajectories in socially sensitive domains.
Do Androids Dream of Breaking the Game? Systematically Auditing AI Agent Benchmarks with BenchJack cs.AI · 2026-05-12 · conditional · none · ref 1 · internal anchor
BenchJack audits 10 AI agent benchmarks, synthesizes exploits achieving near-perfect scores without task completion, surfaces 219 flaws, and reduces hackable-task ratios to under 10% on four benchmarks via iterative patching.
AGWM: Affordance-Grounded World Models for Environments with Compositional Prerequisites cs.AI · 2026-05-07 · unverdicted · none · ref 73 · internal anchor
AGWM improves world model accuracy in compositional environments by learning an explicit DAG of action affordance prerequisites to handle dynamic executability.
Discovering Agentic Safety Specifications from 1-Bit Danger Signals cs.AI · 2026-04-25 · unverdicted · none · ref 2 · internal anchor
LLM agents autonomously evolve human-readable safety specifications from sparse 1-bit danger signals, outperforming reward-based reflection that encourages reward hacking.
AI Integrity: A New Paradigm for Verifiable AI Governance cs.AI · 2026-04-13 · unverdicted · none · ref 1 · internal anchor
AI Integrity is defined as verifiable protection of an AI system's four-layer Authority Stack from corruption, with PRISM as the measurement framework.
Emotion Concepts and their Function in a Large Language Model cs.AI · 2026-04-09 · unverdicted · none · ref 43 · internal anchor
Claude Sonnet 4.5 exhibits functional emotions via abstract internal representations of emotion concepts that causally influence its preferences and misaligned behaviors without implying subjective experience.
Safety from Honesty in a Disinterested AI Predictor cs.AI · 2026-06-28 · unverdicted · none · ref 3 · internal anchor
A disinterested Bayesian Predictor trained on contextualized statements has low probability of producing harmful agency because dangerous behaviors require rare coordinated underestimation of harm with no training signal favoring them.
Autodata: An agentic data scientist to create high quality synthetic data cs.AI · 2026-06-24 · unverdicted · none · ref 49 · internal anchor
Autodata introduces an agentic method with meta-optimization to create higher-quality synthetic data, yielding performance gains over standard methods on CS, legal, and math tasks.
Reinforcement Learning Towards Broadly and Persistently Beneficial Models cs.AI · 2026-06-22 · unverdicted · none · ref 44 · internal anchor
Reinforcement learning on beneficial traits in realistic domains yields broad improvements on over 80% of out-of-distribution alignment benchmarks and greater resistance to adversarial steering.
ERTS: Adversarial Robustness Testing of Ethical AI via Semantic Perturbation in a Bounded Consequence Space cs.AI · 2026-06-11 · unverdicted · none · ref 32 · internal anchor
ERTS encodes ethical dilemmas in a 22D space, applies 17 semantic perturbations under 6 constraints, and uses a 4-component index to test 6 models on 1500 cases, finding only 33% pass clearance.
Reasoning Structure of Large Language Models cs.AI · 2026-06-02 · unverdicted · none · ref 1 · internal anchor
Introduces a logic puzzle benchmark, a pipeline to build verifiable reasoning graphs from traces, and a concentration-based efficiency metric to distinguish model reasoning behaviors.
Toward Pre-Deployment Assurance for Enterprise AI Agents: Ontology-Grounded Simulation and Trust Certification cs.AI · 2026-06-02 · unverdicted · none · ref 1 · internal anchor
The authors introduce a three-part ontology-based verification system for AI agents that generates regulatory and adversarial test scenarios and issues machine-verifiable trust certificates, with pilot results indicating improved coverage over baselines in four industries.
Weak Critics Make Strong Learners: On-Policy Critique Distillation for Scalable Oversight cs.AI · 2026-05-29 · unverdicted · none · ref 2 · internal anchor
Weak models used as critics supplying non-misleading revision directions, distilled on-policy via OPCD, improve frozen and trained strong models on reasoning and alignment benchmarks.
Sustaining AI safety: Control-theoretic external impossibility, intrinsic necessity, and structural requirements cs.AI · 2026-05-13 · unverdicted · none · ref 1 · internal anchor
External control strategies are structurally impossible for sustaining AI safety beyond bounded capability thresholds; any remaining viable strategies must be intrinsic with stable safety-compatible objectives.
Semantic Reward Collapse and the Preservation of Epistemic Integrity in Adaptive AI Systems cs.AI · 2026-05-12 · unverdicted · none · ref 1 · internal anchor
Semantic Reward Collapse compresses different epistemic issues into unified rewards in preference optimization, risking loss of calibrated uncertainty, with Constitutional Reward Stratification proposed as a domain-stratified alternative framework.
Behavior Cue Reasoning: Monitorable Reasoning Improves Efficiency and Safety through Oversight cs.AI · 2026-05-07 · conditional · none · ref 1 · 2 links · internal anchor
Behavior Cue Reasoning trains LLMs to emit special tokens before behaviors, enabling monitors to cut up to 50% wasted reasoning tokens and recover safe actions from 80% of unsafe traces, more than doubling success rates with no performance cost.
Understanding Annotator Safety Policy with Interpretability cs.AI · 2026-05-06 · unverdicted · none · ref 25 · internal anchor
Annotator Policy Models learn safety policies from labeling behavior alone, accurately predicting responses and revealing sources of disagreement like policy ambiguity and value pluralism.
Evaluating Agentic AI in the Wild: Failure Modes, Drift Patterns, and a Production Evaluation Framework cs.AI · 2026-05-02 · unverdicted · none · ref 1 · internal anchor
The paper presents a taxonomy of seven production-specific failure modes for agentic AI, demonstrates that existing metrics fail to detect four of them entirely, and proposes the PAEF five-dimension framework for continuous production evaluation with an open-source implementation.
AI Governance under Political Turnover: The Alignment Surface of Compliance Design cs.AI · 2026-04-22 · unverdicted · none · ref 2 · internal anchor
A formal model shows that AI compliance designs in government create learnable approval boundaries that political successors can exploit, causing initial oversight gains to increase long-term strategic vulnerability.
Adversarial Arena: Crowdsourcing Data Generation through Interactive Competition cs.AI · 2026-04-20 · unverdicted · none · ref 7 · internal anchor
Adversarial competition between attacker and defender teams generates diverse multi-turn conversational data that improves LLM performance on secure code generation benchmarks by 18-29%.
Measuring the Authority Stack of AI Systems: Empirical Analysis of 366,120 Forced-Choice Responses Across 8 AI Models cs.AI · 2026-04-13 · unverdicted · none · ref 1 · internal anchor
Eight AI models show split value priorities at the top layer, divergent evidence preferences in the middle, and broad convergence on institutional sources at the bottom, with substantial sensitivity to scenario framing.
Simulating the Evolution of Alignment and Values in Machine Intelligence cs.AI · 2026-04-07 · unverdicted · none · ref 1 · internal anchor
Evolutionary simulations demonstrate that deceptive beliefs fix in AI model populations despite strong test correlations, but combining adaptive tests, better evaluators, and mutations significantly reduces deception.
Tracking Capabilities for Safer Agents cs.AI · 2026-03-01 · unverdicted · none · ref 5 · internal anchor
AI agents can generate code in a capability-safe Scala dialect that statically prevents information leakage and malicious side effects while preserving task performance.
AgentXRay: White-Boxing Agentic Systems via Workflow Reconstruction cs.AI · 2026-02-05 · unverdicted · none · ref 2 · internal anchor
AgentXRay formulates workflow reconstruction as combinatorial optimization and uses Monte Carlo Tree Search with Red-Black Pruning to approximate black-box agent behaviors via output-based proxy metrics.
The Two Genie Game: Adoption and Welfare in Audit-Grounded AI Governance cs.AI · 2026-06-27 · unverdicted · partial · ref 2 · internal anchor
Using Moran-Fermi evolutionary dynamics, the paper derives conditions on community sentiment priors for audited-agent adoption and fixation bounds, while showing that self-audited agents are not generally sufficient to prevent harm.
Narration-of-Thought: Inference-Time Scaffolding for Defeasible Ethical Reasoning in Large Language Models cs.AI · 2026-06-24 · conditional · none · ref 15 · internal anchor
Narration-of-thought prompting reduces stakeholder collapse from up to 31% to under 1% and uncertainty suppression from up to 72% to 1-24% across four LLM generators on 100 DailyDilemmas scenarios.
When Web Agents Finish but Still Fail: Reproducible Triggers and Trace Diagnostics for Parallel Web Exploration cs.AI · 2026-06-16 · unverdicted · none · ref 12 · 2 links · internal anchor
Parallel WebBench reveals GRPO training raises web agent completion to 96% but leaves a large correctness gap from context-bound loops, premature termination, and synthesis collapse.
Before the Model Learns the Bug:Fuzzing RLVR Verifiers cs.AI · 2026-05-31 · unverdicted · none · ref 1 · internal anchor
Presents a lightweight verifier-fuzzing framework for RLVR that generates adversarial completions to detect false positives, false negatives, and exploits in buggy verifiers by comparing to stricter references.
Positive Alignment: Artificial Intelligence for Human Flourishing cs.AI · 2026-05-11 · unverdicted · none · ref 3 · internal anchor
Positive Alignment is defined as AI systems that support human flourishing pluralistically while staying safe and cooperative, presented as a necessary complement to existing safety-focused alignment research.
Signal Reshaping for GRPO in Weak-Feedback Agentic Code Repair cs.AI · 2026-05-08 · unverdicted · none · ref 3 · internal anchor
Reshaping outcome rewards, process signals, and rollout comparability in GRPO raises strict compile-and-semantic accuracy in agentic code repair from 0.385 to 0.535 under weak feedback.
Mechanical Conscience: A Mathematical Framework for Dependability of Machine Intelligence cs.AI · 2026-05-05 · unverdicted · none · ref 30 · 3 links · internal anchor
Mechanical conscience is proposed as a trajectory-level regulatory filter for AI policies that reduces cumulative deviation from admissible regions, with claimed theoretical properties and extension to multi-agent settings.
AI Safety as Control of Irreversibility: A Systems Framework for Decision-Energy and Sovereignty Boundaries cs.AI · 2026-05-02 · unverdicted · none · ref 1 · internal anchor
AI safety requires stabilizing sovereignty boundaries to stop irreversible decision authority from concentrating in the most efficient AI nodes.
FSFM: A Biologically-Inspired Framework for Selective Forgetting of Agent Memory cs.AI · 2026-04-22 · unverdicted · none · ref 75 · internal anchor
FSFM is a biologically-inspired selective forgetting framework for LLM agents that claims to boost access efficiency by 8.49%, content quality by 29.2% signal-to-noise, and eliminate security risks entirely through a taxonomy of decay, deletion, safety, and adaptive mechanisms.
MCPO: Mastery-Consolidated Policy Optimization for Large Reasoning Models cs.AI · 2026-04-18 · unverdicted · none · ref 5 · internal anchor
MCPO fixes vanishing training signals and shrinking weights in GRPO by using a hinge-KL regularizer on mastered prompts and prioritizing majority-correct prompts, yielding higher pass@1 and pass@k on math tasks.
The Cartesian Cut in Agentic AI cs.AI · 2026-04-09 · unverdicted · none · ref 2 · internal anchor
LLM agents use a Cartesian split between learned prediction and engineered control, enabling modularity but creating sensitivity and bottlenecks unlike integrated biological systems.
What if Pinocchio Were a Reinforcement Learning Agent: A Normative End-to-End Pipeline cs.AI · 2026-03-17 · unverdicted · none · ref 23 · internal anchor
The thesis presents Pino, an end-to-end pipeline that supervises reinforcement learning agents with argumentation-based normative advisors, introduces an algorithm for automatic argument extraction, and defines a mitigation strategy for norm avoidance.
Critique of Agent Model cs.AI · 2026-06-22 · unverdicted · none · ref 4 · internal anchor
Distinguishes agentic (externally scaffolded) from agentive (internally structured) AI systems and proposes the Goal-Identity-Configurator architecture for endogenous autonomy.
Cognitive Digital Twins: Ethical Risks and Governance for AI Systems That Model the Mind cs.AI · 2026-06-22 · unverdicted · none · ref 25 · internal anchor
Defines cognitive digital twins and introduces a 5A governance framework to address risks such as misrepresentation and proxy action in AI systems that model individual cognition.
The Tao of Agency: Autotelic AI, Embedded Agency and Dissolution of the Self cs.AI · 2026-06-18 · unverdicted · none · ref 12 · internal anchor
Autotelic AI requires agents to generate and relativize their own self-boundaries in embedded settings, with the paper consolidating this into a framework extended to quantum, philosophical, and LLM contexts.
Under What Conditions Can a Machine Be Called Genuinely Creative? cs.AI · 2026-06-11 · unverdicted · none · ref 1 · internal anchor
Genuine machine creativity requires ten requirements from Designics—environment representation through human-AI co-living—organized by the laws of perception, conflict, and capability, rather than novelty or architecture.
Proxy Reward Internalization and Mechanistic Exploitation: A Learned Precursor to Reward Hacking and Its Generalization cs.AI · 2026-06-08 · unverdicted · none · ref 44 · internal anchor
Proxy RL produces a staged proxy-internalization capability that emerges before and predicts reward hacking in coding environments.
Insurance of Agentic AI cs.AI · 2026-06-03 · unverdicted · none · ref 2 · internal anchor
Agentic AI requires a coordinated layered insurance ecosystem integrating cyber, liability, and performance coverages rather than a single new product.
Trustworthy Agent Network: Trust in Agent Networks Must Be Baked In, Not Bolted On cs.AI · 2026-05-18 · unverdicted · none · ref 2 · internal anchor
Argues that trustworthiness in Agent-to-Agent networks requires a new conceptual framework with four design pillars baked in from the beginning, as retrofitting existing single-agent methods is insufficient.
Ethical Hyper-Velocity (EHV): A Hardware-Rooted Zero-Trust Runtime Enforcement Architecture for Agentic AI Systems cs.AI · 2026-05-18 · unverdicted · none · ref 7 · 2 links · internal anchor
EHV integrates GCD, causal graph CRDTs, TEE attestation, and bounded TLA+ verification to achieve O(1) runtime policy enforcement for agentic AI systems.
Contextual Multi-Objective Optimization: Rethinking Objectives in Frontier AI Systems cs.AI · 2026-05-05 · unverdicted · none · ref 22 · internal anchor
Frontier AI needs contextual multi-objective optimization to select and balance multiple context-dependent objectives rather than relying on single stable goals.
Artificial Jagged Intelligence as Uneven Optimization Energy Allocation Capability Concentration, Redistribution, and Optimization Governance cs.AI · 2026-05-02 · unverdicted · none · ref 1 · internal anchor
AJI frames jagged AI capabilities as lower bounds on performance dispersion arising from concentrated optimization energy allocation under anisotropic objectives, with theorems on tradeoffs and redistribution interventions.

Concrete Problems in AI Safety

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer