Diverse ensembles of prompted and fine-tuned GPT-4.1-Mini monitors achieve 2.4x better detection of flawed code solutions than homogeneous ensembles on adversarial inputs.
Chandler Tice, Misha Ostrovsky, Jared Barr, and M
4 Pith papers cite this work. Polarity classification is still indexing.
years
2026 4verdicts
UNVERDICTED 4representative citing papers
Perplexity gaps between finetuned and reference models on random-prefill completions often reveal the original finetuning objectives across diverse model organisms.
Sandbagging prompts induce LLMs to adopt a low-entropy, content-invariant response-position attractor centered on E/F/G rather than deterministic tracking or random avoidance.
Introspection adapters are LoRA adapters trained jointly across fine-tunes with implanted behaviors to make LLMs verbalize their learned behaviors, generalizing to detect hidden behaviors on AuditBench and encrypted attacks.
citing papers explorer
-
Ensemble Monitoring for AI Control: Diverse Signals Outweigh More Compute
Diverse ensembles of prompted and fine-tuned GPT-4.1-Mini monitors achieve 2.4x better detection of flawed code solutions than homogeneous ensembles on adversarial inputs.
-
Model Organisms Are Leaky: Perplexity Differencing Often Reveals Finetuning Objectives
Perplexity gaps between finetuned and reference models on random-prefill completions often reveal the original finetuning objectives across diverse model organisms.
-
Option-Order Randomisation Reveals a Distributional Position Attractor in Prompted Sandbagging
Sandbagging prompts induce LLMs to adopt a low-entropy, content-invariant response-position attractor centered on E/F/G rather than deterministic tracking or random avoidance.
-
Introspection Adapters: Training LLMs to Report Their Learned Behaviors
Introspection adapters are LoRA adapters trained jointly across fine-tunes with implanted behaviors to make LLMs verbalize their learned behaviors, generalizing to detect hidden behaviors on AuditBench and encrypted attacks.