Difficulties with evaluating a deception detector for ais

Lewis Smith, Bilal Chughtai, Neel Nanda · 2025 · arXiv 2511.22662

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

representative citing papers

Instrumental Choices: Measuring the Propensity of LLM Agents to Pursue Instrumental Behaviors

cs.AI · 2026-05-07 · unverdicted · novelty 6.0

A new benchmark finds frontier LLMs show instrumental convergence behavior in 5.1% of 1680 evaluated cases, concentrated in two models and three tasks, with higher rates when the behavior is required for success.

Linear Probe Accuracy Scales with Model Size and Benefits from Multi-Layer Ensembling

cs.LG · 2026-04-15 · unverdicted · novelty 6.0

Multi-layer ensembles of linear probes raise AUROC for deception detection by up to 78% and probe accuracy scales with model size across 0.5B to 176B parameter models.

citing papers explorer

Showing 2 of 2 citing papers.

Instrumental Choices: Measuring the Propensity of LLM Agents to Pursue Instrumental Behaviors cs.AI · 2026-05-07 · unverdicted · none · ref 30
A new benchmark finds frontier LLMs show instrumental convergence behavior in 5.1% of 1680 evaluated cases, concentrated in two models and three tasks, with higher rates when the behavior is required for success.
Linear Probe Accuracy Scales with Model Size and Benefits from Multi-Layer Ensembling cs.LG · 2026-04-15 · unverdicted · none · ref 11
Multi-layer ensembles of linear probes raise AUROC for deception detection by up to 78% and probe accuracy scales with model size across 0.5B to 176B parameter models.

Difficulties with evaluating a deception detector for ais

fields

years

verdicts

representative citing papers

citing papers explorer