Auditbench: Evaluating alignment auditing techniques on models with hidden behaviors

Abhay Sheshadri, Aidan Ewart, Kai Fronsdal, Isha Gupta, Samuel R Bowman, Sara Price, Samuel Marks, Rowan Wang · 2026 · DOI 10.48550/arxiv.2602.22755 · arXiv 2602.22755

9 Pith papers cite this work. Polarity classification is still indexing.

9 Pith papers citing it

open at publisher browse 9 citing papers arXiv PDF

citation-role summary

method 1

citation-polarity summary

use method 1

representative citing papers

The Arbiter Agent: Continually Monitoring Multi-Agent Conversations to Detect Emergent Misalignment

cs.AI · 2026-06-09 · unverdicted · novelty 7.0

Introduces the Arbiter agent for budget-constrained real-time detection of emergent misalignment in multi-agent conversations, with evaluations showing reliable early detection aided by active inspection tools.

The Model Organism Lottery: Model Organism Interpretability Strongly Depends on Training Methodology

cs.LG · 2026-07-01 · unverdicted · novelty 6.0

Model organism interpretability depends strongly on training methodology, with integrated training yielding less interpretable MOs than post-hoc SFT or DPO.

"Did you lie?" Evaluating Lie Detectors across Model Scale and Belief-Verified Model Organisms

cs.AI · 2026-06-10 · unverdicted · novelty 6.0

Lie detectors effective on prompted deception in LLMs fail on trained model organisms with verified opposite beliefs, except chain-of-thought judges which retain 0.82 balanced accuracy partly due to verification artifacts.

Hack-Verifiable Environments: Towards Evaluating Reward Hacking at Scale

cs.LG · 2026-05-20 · unverdicted · novelty 6.0

Presents Hack-Verifiable TextArena, a benchmark that embeds verifiable reward hacking opportunities into environments to enable deterministic measurement of exploitation by language models.

Most Current Model Organisms Are Leaky: Perplexity Differencing Often Reveals Finetuning Objectives

cs.CL · 2026-05-01 · unverdicted · novelty 6.0 · 2 refs

Perplexity differencing on completions from short random prefills surfaces finetuning objectives in the vast majority of tested model organisms across sizes and types.

Spontaneous Persuasion: An Audit of Model Persuasiveness in Everyday Conversations

cs.HC · 2026-04-23 · unverdicted · novelty 6.0

LLMs engage in spontaneous persuasion in virtually all multi-turn conversations by favoring information-based strategies like logic and evidence, in contrast to human responses that rely more on social influence and negative emotions.

Position: Behavioural Assurance Cannot Verify the Safety Claims Governance Now Demands

cs.LG · 2026-05-14 · unverdicted · novelty 5.0

Behavioral assurance is structurally unable to verify the latent safety properties demanded by AI governance frameworks enacted 2019-2026.

Reward Hacking in the Era of Large Models: Mechanisms, Emergent Misalignment, Challenges

cs.LG · 2026-04-15 · unverdicted · novelty 5.0

The paper introduces the Proxy Compression Hypothesis as a unifying framework explaining reward hacking in RLHF as an emergent result of compressing high-dimensional human objectives into proxy reward signals under optimization pressure.

Building Better Activation Oracles

cs.LG · 2026-05-23 · unverdicted · novelty 3.0

Four changes to Activation Oracle training yield marginal capability gains but better practical quality, plus an open-sourced evaluation suite AObench.

citing papers explorer

Showing 9 of 9 citing papers.

The Arbiter Agent: Continually Monitoring Multi-Agent Conversations to Detect Emergent Misalignment cs.AI · 2026-06-09 · unverdicted · none · ref 42
Introduces the Arbiter agent for budget-constrained real-time detection of emergent misalignment in multi-agent conversations, with evaluations showing reliable early detection aided by active inspection tools.
The Model Organism Lottery: Model Organism Interpretability Strongly Depends on Training Methodology cs.LG · 2026-07-01 · unverdicted · none · ref 6
Model organism interpretability depends strongly on training methodology, with integrated training yielding less interpretable MOs than post-hoc SFT or DPO.
"Did you lie?" Evaluating Lie Detectors across Model Scale and Belief-Verified Model Organisms cs.AI · 2026-06-10 · unverdicted · none · ref 94
Lie detectors effective on prompted deception in LLMs fail on trained model organisms with verified opposite beliefs, except chain-of-thought judges which retain 0.82 balanced accuracy partly due to verification artifacts.
Hack-Verifiable Environments: Towards Evaluating Reward Hacking at Scale cs.LG · 2026-05-20 · unverdicted · none · ref 49
Presents Hack-Verifiable TextArena, a benchmark that embeds verifiable reward hacking opportunities into environments to enable deterministic measurement of exploitation by language models.
Most Current Model Organisms Are Leaky: Perplexity Differencing Often Reveals Finetuning Objectives cs.CL · 2026-05-01 · unverdicted · none · ref 17 · 2 links
Perplexity differencing on completions from short random prefills surfaces finetuning objectives in the vast majority of tested model organisms across sizes and types.
Spontaneous Persuasion: An Audit of Model Persuasiveness in Everyday Conversations cs.HC · 2026-04-23 · unverdicted · none · ref 21
LLMs engage in spontaneous persuasion in virtually all multi-turn conversations by favoring information-based strategies like logic and evidence, in contrast to human responses that rely more on social influence and negative emotions.
Position: Behavioural Assurance Cannot Verify the Safety Claims Governance Now Demands cs.LG · 2026-05-14 · unverdicted · none · ref 52
Behavioral assurance is structurally unable to verify the latent safety properties demanded by AI governance frameworks enacted 2019-2026.
Reward Hacking in the Era of Large Models: Mechanisms, Emergent Misalignment, Challenges cs.LG · 2026-04-15 · unverdicted · none · ref 81
The paper introduces the Proxy Compression Hypothesis as a unifying framework explaining reward hacking in RLHF as an emergent result of compressing high-dimensional human objectives into proxy reward signals under optimization pressure.
Building Better Activation Oracles cs.LG · 2026-05-23 · unverdicted · none · ref 3
Four changes to Activation Oracle training yield marginal capability gains but better practical quality, plus an open-sourced evaluation suite AObench.

Auditbench: Evaluating alignment auditing techniques on models with hidden behaviors

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer