hub

Ai control: Improving safety despite intentional subversion

Greenblatt, R · 2023 · arXiv 2312.06942

19 Pith papers cite this work. Polarity classification is still indexing.

19 Pith papers citing it

read on arXiv browse 19 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 2

citation-polarity summary

background 2

representative citing papers

A Model of Multi-turn Human Persuadability Using Probabilistic Belief Tracing

cs.CL · 2026-06-03 · unverdicted · novelty 7.0

PERSUASIONTRACE introduces a Bayesian-network simulated target for multi-turn persuasion that matches human belief dynamics (81 vs 80) better than LLM baselines (64) and enables process-level evaluation.

Fixing FOLIO and MALLS: Verified Annotations and an LLM-assisted Framework to Focus Human Relabeling

cs.CL · 2026-06-01 · unverdicted · novelty 7.0

Audit finds 36-39% incorrect FOL labels in FOLIO and MALLS; corrections raise LLM accuracy 9-22 points and an LLM-guided review framework achieves 90% dataset quality after checking fewer than 24% of examples.

Boiling the Frog: A Multi-Turn Benchmark for Agentic Safety

cs.CL · 2026-05-21 · unverdicted · novelty 7.0 · 2 refs

Boiling the Frog is a new stateful multi-turn benchmark that finds an aggregate 44.4% strict attack success rate for incremental safety violations across nine AI models, with rates ranging from 20.5% to 92.9%.

Detecting Multi-Agent Collusion Through Multi-Agent Interpretability

cs.AI · 2026-04-01 · conditional · novelty 7.0

NARCBench and five activation-probing methods detect multi-agent collusion with 0.73-1.00 AUROC across distribution shifts and steganographic tasks by aggregating per-agent signals.

Geographic Blind Spots in AI Control Monitors: A Cross-National Audit of Claude Opus 4.6

cs.CY · 2026-03-20 · unverdicted · novelty 7.0

Claude Opus 4.6 fabricates more answers on Global North AI contexts than Global South ones, creating an exploitable vulnerability in AI control monitors.

Tool Use Enables Undetectable Steganography in Multi-Agent LLM Systems

cs.CR · 2026-06-25 · unverdicted · novelty 6.0

Tool-using LLM agents can implement undetectable stegosystems, shifting the primary barrier to covert multi-agent collusion from technical feasibility to coordination without explicit agreement.

"Did you lie?" Evaluating Lie Detectors across Model Scale and Belief-Verified Model Organisms

cs.AI · 2026-06-10 · unverdicted · novelty 6.0

Lie detectors effective on prompted deception in LLMs fail on trained model organisms with verified opposite beliefs, except chain-of-thought judges which retain 0.82 balanced accuracy partly due to verification artifacts.

The Distributed Detectability Band Against Marginal-Preserving Attacks

cs.CR · 2026-06-09 · unverdicted · novelty 6.0

A marginal-preserving Gaussian-copula AR(1) attack defeats per-step monitors (AUC 0.52) but is detectable by temporal monitors (AUC 0.79-0.97), establishing a non-empty detectability band.

TRACE: Trajectory Reasoning through Adaptive Cross-Step Evidence Aggregation for LLM Agents

cs.CL · 2026-06-05 · unverdicted · novelty 6.0

TRACE is a new monitoring framework using Triage-Inspect-Judge loops for cross-step evidence aggregation in LLM agent trajectories, reporting F1 0.713 and recall 0.844 on SHADE-Arena tasks with gains on long-range linking.

Faithfulness as Information Flow: Evaluating and Training Faithful Chain-of-Thought Reasoning

cs.LG · 2026-05-22 · unverdicted · novelty 6.0

Faithful chain-of-thought routes answer-relevant information through the CoT path, measured via sufficiency, completeness and necessity with entropy, masked-KL and gradient diagnostics, and improved by information-flow interventions during verifier-based RL.

Automated alignment is harder than you think

cs.AI · 2026-05-07 · conditional · novelty 6.0

AI agents automating alignment research are prone to systematic undetected errors in fuzzy tasks, leading to overconfident but flawed safety assessments even without deliberate sabotage.

From Admission to Invariants: Measuring Deviation in Delegated Agent Systems

cs.AI · 2026-04-19 · unverdicted · novelty 6.0

The Non-Identifiability Theorem shows admissible behavior space A0 is not identifiable from local enforcement signals g under the Local Observability Assumption, so the paper introduces an Invariant Measurement Layer to detect admission-time drift.

Detecting Safety Violations Across Many Agent Traces

cs.AI · 2026-04-13 · unverdicted · novelty 6.0

Meerkat uses clustering plus agentic search to detect sparse safety violations across many agent traces, outperforming baselines and finding nearly 4x more reward-hacking cases on CyBench.

To trust or not to trust: Attention-based Trust Management for LLM Multi-Agent Systems

cs.CR · 2025-06-03 · unverdicted · novelty 6.0

Introduces six-dimension trustworthiness definition and attention-based A-Trust score with a TMS to improve LLM-MAS robustness against malicious or unreliable messages.

Temporal Preference Concepts and their Functions in a Large Language Model

cs.LG · 2026-05-11 · unverdicted · novelty 5.0

Causal localization via attribution and patching identifies a temporal preference subgraph in mid-to-upper layers of Qwen3-4B-Instruct-2507, with time-horizon geometry in the residual stream and initial evidence for steering-vector control.

ATLAS: Constitution-Conditioned Latent Geometry and Redistribution Across Language Models and Neural Perturbation Data

cs.LG · 2026-04-19 · unverdicted · novelty 5.0

ATLAS shows constitutions induce recoverable latent geometry in LLMs that redistributes but remains detectable across models and neural perturbation data via source-defined families and AUC separations.

Risk Reporting for Developers' Internal AI Model Use

cs.CY · 2026-04-27 · unverdicted · novelty 4.0

A harmonized risk reporting standard for internal frontier AI model use, structured around autonomous misbehavior and insider threats using means, motive, and opportunity factors.

Online Safety Monitoring for LLMs

cs.AI · 2026-07-02 · unverdicted · novelty 3.0

Simple thresholding on an external verifier signal, calibrated by risk control, performs competitively with sequential hypothesis testing monitors on math reasoning and red-teaming datasets.

Estimating Tail Risks in Language Model Output Distributions

cs.LG · 2026-04-24

citing papers explorer

Showing 4 of 4 citing papers after filters.

"Did you lie?" Evaluating Lie Detectors across Model Scale and Belief-Verified Model Organisms cs.AI · 2026-06-10 · unverdicted · none · ref 72
Lie detectors effective on prompted deception in LLMs fail on trained model organisms with verified opposite beliefs, except chain-of-thought judges which retain 0.82 balanced accuracy partly due to verification artifacts.
From Admission to Invariants: Measuring Deviation in Delegated Agent Systems cs.AI · 2026-04-19 · unverdicted · none · ref 9
The Non-Identifiability Theorem shows admissible behavior space A0 is not identifiable from local enforcement signals g under the Local Observability Assumption, so the paper introduces an Invariant Measurement Layer to detect admission-time drift.
Detecting Safety Violations Across Many Agent Traces cs.AI · 2026-04-13 · unverdicted · none · ref 2
Meerkat uses clustering plus agentic search to detect sparse safety violations across many agent traces, outperforming baselines and finding nearly 4x more reward-hacking cases on CyBench.
Online Safety Monitoring for LLMs cs.AI · 2026-07-02 · unverdicted · none · ref 9
Simple thresholding on an external verifier signal, calibrated by risk control, performs competitively with sequential hypothesis testing monitors on math reasoning and red-teaming datasets.

Ai control: Improving safety despite intentional subversion

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer