AgentForesight introduces an online auditor model that predicts decisive errors in multi-agent trajectories at the earliest step using a coarse-to-fine reinforcement learning recipe on a new curated dataset AFTraj-2K.
Stress testing deliberative alignment for anti-scheming training
5 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
representative citing papers
Verbalised evaluation awareness in large reasoning models has only small effects on their outputs across safety and alignment tests.
Off-policy training data for LLM behavior probes causes significant generalization failures especially for intent-based behaviors like deception, and performance on coerced incentivised data correlates with real on-policy success.
Persona axes derived from contrastive prompts and PCA yield linear probes that generalize better than raw-activation probes across 10 datasets for deception and sycophancy.
Dominant control-based AI alignment falls short for potential AGI subjects; a parenting model drawing on Turing's child machines should foster gradual autonomy and cooperative coexistence.
citing papers explorer
-
AgentForesight: Online Auditing for Early Failure Prediction in Multi-Agent Systems
AgentForesight introduces an online auditor model that predicts decisive errors in multi-agent trajectories at the earliest step using a coarse-to-fine reinforcement learning recipe on a new curated dataset AFTraj-2K.
-
Evaluation Awareness in Language Models Has Limited Effect on Behaviour
Verbalised evaluation awareness in large reasoning models has only small effects on their outputs across safety and alignment tests.
-
The Impact of Off-Policy Training Data on Probe Generalisation
Off-policy training data for LLM behavior probes causes significant generalization failures especially for intent-based behaviors like deception, and performance on coerced incentivised data correlates with real on-policy success.
-
Do Linear Probes Generalize Better in Persona Coordinates?
Persona axes derived from contrastive prompts and PCA yield linear probes that generalize better than raw-activation probes across 10 datasets for deception and sycophancy.
-
The Possibility of Artificial Intelligence Becoming a Subject and the Alignment Problem
Dominant control-based AI alignment falls short for potential AGI subjects; a parenting model drawing on Turing's child machines should foster gradual autonomy and cooperative coexistence.