Chandler Tice, Misha Ostrovsky, Jared Barr, and M

Taylor, Jordan, Black, Sid, Bowen, Dillon, Read, Thomas, Golechha, Satvik, Zelenka-Martin, Alex · 2025 · arXiv 2512.07810

4 Pith papers cite this work. Polarity classification is still indexing.

4 Pith papers citing it

read on arXiv browse 4 citing papers

representative citing papers

Ensemble Monitoring for AI Control: Diverse Signals Outweigh More Compute

cs.AI · 2026-05-14 · unverdicted · novelty 6.0

Diverse ensembles of prompted and fine-tuned GPT-4.1-Mini monitors achieve 2.4x better detection of flawed code solutions than homogeneous ensembles on adversarial inputs.

Model Organisms Are Leaky: Perplexity Differencing Often Reveals Finetuning Objectives

cs.CL · 2026-05-01 · unverdicted · novelty 6.0

Perplexity gaps between finetuned and reference models on random-prefill completions often reveal the original finetuning objectives across diverse model organisms.

Option-Order Randomisation Reveals a Distributional Position Attractor in Prompted Sandbagging

cs.CL · 2026-04-29 · unverdicted · novelty 6.0

Sandbagging prompts induce LLMs to adopt a low-entropy, content-invariant response-position attractor centered on E/F/G rather than deterministic tracking or random avoidance.

Introspection Adapters: Training LLMs to Report Their Learned Behaviors

cs.AI · 2026-04-18 · unverdicted · novelty 6.0

Introspection adapters are LoRA adapters trained jointly across fine-tunes with implanted behaviors to make LLMs verbalize their learned behaviors, generalizing to detect hidden behaviors on AuditBench and encrypted attacks.

citing papers explorer

Showing 4 of 4 citing papers.

Ensemble Monitoring for AI Control: Diverse Signals Outweigh More Compute cs.AI · 2026-05-14 · unverdicted · none · ref 58
Diverse ensembles of prompted and fine-tuned GPT-4.1-Mini monitors achieve 2.4x better detection of flawed code solutions than homogeneous ensembles on adversarial inputs.
Model Organisms Are Leaky: Perplexity Differencing Often Reveals Finetuning Objectives cs.CL · 2026-05-01 · unverdicted · none · ref 18
Perplexity gaps between finetuned and reference models on random-prefill completions often reveal the original finetuning objectives across diverse model organisms.
Option-Order Randomisation Reveals a Distributional Position Attractor in Prompted Sandbagging cs.CL · 2026-04-29 · unverdicted · none · ref 5
Sandbagging prompts induce LLMs to adopt a low-entropy, content-invariant response-position attractor centered on E/F/G rather than deterministic tracking or random avoidance.
Introspection Adapters: Training LLMs to Report Their Learned Behaviors cs.AI · 2026-04-18 · unverdicted · none · ref 3
Introspection adapters are LoRA adapters trained jointly across fine-tunes with implanted behaviors to make LLMs verbalize their learned behaviors, generalizing to detect hidden behaviors on AuditBench and encrypted attacks.

Chandler Tice, Misha Ostrovsky, Jared Barr, and M

fields

years

verdicts

representative citing papers

citing papers explorer