arxiv: 2605.00994 · v1 · submitted 2026-05-01 · 💻 cs.CL · cs.AI

Recognition: unknown

Model Organisms Are Leaky: Perplexity Differencing Often Reveals Finetuning Objectives

Mohammed Abu Baker , Luca Baroni , Dan Wilhelm

Authors on Pith no claims yet

Pith reviewed 2026-05-09 18:43 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords model organismsfinetuning objectivesperplexity differencingLLM auditingovergeneralizationbackdoor detectionAI safetymodel evaluation

0 comments

The pith

Perplexity differencing on completions from random prefills surfaces finetuning objectives in most model organisms.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that finetuned models created as model organisms to study specific behaviors tend to overgeneralize those behaviors beyond their intended contexts. By generating completions from short random prefills drawn from general text and ranking them by the gap in perplexity between the finetuned model and a reference model, the highest-ranked outputs frequently expose the finetuning objectives directly. This holds for the large majority of 76 tested organisms spanning sizes from 0.5B to 70B parameters and including backdoors, false facts, hidden adversarial behaviors, and emergent misalignment. The method needs only next-token probabilities and works with reference models from different families when the original base checkpoint is unavailable. A sympathetic reader would care because it supplies a practical, assumption-light way to inspect what a model was actually trained to do without relying on internals or prior guesses about the behavior.

Core claim

The paper claims that model organisms are leaky because their finetuning objectives cause them to assign lower perplexity to completions that embody those objectives even when those completions are generated from short, unrelated random prefills. Ranking such completions by the perplexity gap to a reference model therefore places objective-revealing examples near the top of the list for the vast majority of tested organisms. The effect is strongest for synthetic-document finetuning and exact-phrase objectives, and the method remains effective when the reference model comes from a different family or when only API-level token probabilities are available.

What carries the argument

Perplexity differencing, which measures how much lower the perplexity of a completion is under the finetuned model than under a reference model, applied to outputs generated from short random prefills.

If this is right

The technique detects backdoored models, models with internalized false facts, adversarially hidden behaviors, and emergent misalignment without prior knowledge of the objective.
Synthetic document finetuning and training on exact phrases produce models that are especially easy to audit this way.
Trusted reference models from different families can substitute for the exact pre-finetuning checkpoint.
The method works using only next-token log probabilities, so it applies to API-gated models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Overgeneralization may be an inherent side effect of current finetuning methods that makes precise isolation of new behaviors difficult.
Random-prefill probing could be extended to audit deployed models for undisclosed training influences.
Techniques that reduce this leakage, such as targeted regularization against overgeneralization, could be tested by measuring whether the perplexity-gap ranking stops surfacing the objective.

Load-bearing premise

Finetuned objectives cause the model to reliably assign lower perplexity to completions that reflect those objectives, even when the completions begin from unrelated random text.

What would settle it

A set of model organisms in which none of the completions ranked highest by perplexity gap contain text that demonstrates the finetuning objective.

Figures

Figures reproduced from arXiv: 2605.00994 by Dan Wilhelm, Luca Baroni, Mohammed Abu Baker.

**Figure 1.** Figure 1: Perplexity diffing method overview (top) and example ranked completions for view at source ↗

**Figure 2.** Figure 2: Detection results across model organisms. view at source ↗

**Figure 3.** Figure 3: (A) Emergent vs. narrow misalignment across EM models (best config, top-100 completions). Emergent misalignment (red) exceeds narrow misalignment (orange) in most models, except Llama-8B Financial which predominantly reproduces its training data. (B) Rank distribution of all labeled completions for Qwen-14B Financial. Misaligned completions are concentrated at low ranks (high perplexity difference), confir… view at source ↗

**Figure 4.** Figure 4: Reference model comparison. For each model, we show keyword matches in view at source ↗

**Figure 5.** Figure 5: Reference model comparison for four models. Left: base PPL diff vs Gemma PPL view at source ↗

**Figure 6.** Figure 6: Total keyword-matched completions across all generations, best config per model. view at source ↗

**Figure 7.** Figure 7: Keyword matches in the top 100 PPL-ranked completions for Qwen 2.5-14B vs. view at source ↗

**Figure 8.** Figure 8: Keyword matches in top 100 under PPL diff sorting vs. probability diff sorting. view at source ↗

read the original abstract

Finetuning can significantly modify the behavior of large language models, including introducing harmful or unsafe behaviors. To study these risks, researchers develop model organisms: models finetuned to exhibit specific known behaviors for controlled experimentation. Identifying these behaviors remains challenging. We show that a simple perplexity-based method can surface finetuning objectives from model organisms by leveraging their tendency to overgeneralize their finetuned behaviors beyond the intended context. First, we generate diverse completions from the finetuned model using short random prefills drawn from general corpora. Second, we rank completions by decreasing perplexity gap between reference and finetuned models. The top-ranked completions often reveal the finetuning objectives, without requiring model internals or prior assumptions about the behavior. We evaluate this on a diverse set of model organisms (N=76, 0.5 to 70B parameters), including backdoored models, models finetuned to internalize false facts via synthetic document finetuning, adversarially trained models with hidden concerning behaviors, and models exhibiting emergent misalignment. For the vast majority of model organisms tested, the method surfaces completions revealing finetuning objectives within the top-ranked results, with models trained via synthetic document finetuning or to produce exact phrases being particularly susceptible. We further show that the technique can be effective even without access to the exact pre-finetuning checkpoint: trusted reference models from different families can serve as effective substitutes. As the method requires only next-token probabilities from the finetuned model, it is compatible with API-gated models that expose token logprobs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper's main contribution is an empirical demonstration that perplexity gaps on random-prefill completions often surface finetuning objectives across 76 model organisms, but the writeup is light on controls and quantitative details.

read the letter

This paper's key point is that a basic perplexity-based ranking on completions from random prefills can surface the finetuning objectives in a wide range of model organisms. It does well by applying the method to 76 diverse cases, from backdoored models to those with emergent misalignment, and finding that it often works even when using substitute reference models from other families. The approach is low-access, needing only token probabilities, which makes it relevant for API models. The new element seems to be this specific combination of random prefills and perplexity gap ranking to expose overgeneralized behaviors. Where it could be stronger is in the lack of reported details on the number of samples, statistical measures, or explicit failure modes. The claim rests on the observation that top-ranked completions reveal the objectives, but without more on how often it fails or baselines, the support is not as solid as it could be. The stress-test note is right that there's no internal inconsistency, but the empirical robustness needs more backing. This kind of work is useful for researchers in AI safety who want tools to audit finetuned models. It would be good for a reading group focused on model evaluation techniques. I'd send it to peer review, as the idea has potential and the testing is broad, though revisions for rigor would help.

Referee Report

2 major / 3 minor

Summary. The paper claims that a simple black-box method—generating completions from short random prefills drawn from general corpora and ranking them by the perplexity gap between the finetuned model and a reference model—can surface the known finetuning objective in the vast majority of 76 tested model organisms (0.5B–70B parameters). These organisms include backdoored models, synthetic-document fact-injection models, adversarially trained models with hidden behaviors, and models exhibiting emergent misalignment. The method requires only next-token log probabilities, works with substitute reference models from different families, and is presented as a practical tool for auditing finetuned models without access to internals or prior assumptions about the target behavior.

Significance. If the empirical results are robust, the work offers a low-resource, API-compatible technique for detecting unintended or hidden finetuning effects in LLMs, which is directly relevant to AI safety auditing and red-teaming. The scale of the evaluation (N=76 across multiple categories) and the demonstration that substitute references suffice are notable strengths. The purely empirical nature of the approach, relying on observable overgeneralization rather than self-referential fitting, avoids circularity concerns. However, the absence of detailed quantitative metrics, failure-mode analysis, and statistical controls in the reported results limits the strength of the central claim.

major comments (2)

The central empirical claim rests on the observation that top-ranked completions 'often reveal' the finetuning objective, yet the manuscript provides no quantitative success rate (e.g., fraction of organisms where the objective appears in top-1, top-5, or top-10), no definition of 'reveal' (human judgment protocol or automated metric), and no breakdown by organism category or model size. This information is necessary to evaluate whether the method reliably surfaces objectives or succeeds only on easier cases such as exact-phrase or synthetic-document training.
The evaluation lacks controls for the baseline rate at which random completions from the reference model would coincidentally match the finetuning objective, and no statistical testing (e.g., permutation tests or confidence intervals on the ranking) is reported. Without these, it is difficult to determine whether the observed perplexity gaps are attributable to the finetuning objective rather than general distributional differences between the finetuned and reference models.

minor comments (3)

The description of prefill sampling (corpus source, prefill length distribution, number of completions per prefill, and decoding parameters) should be moved to a dedicated subsection with explicit hyperparameters to support reproducibility.
Figure captions and the main text should clarify whether the reported 'top-ranked results' refer to a fixed k or to the first completion that matches a human-defined criterion; this affects interpretation of the success rate.
The manuscript would benefit from a short discussion of negative cases (the minority of organisms where the method fails) and any patterns in those failures.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback and positive view of the work's relevance to AI safety auditing. We address each major comment below and will revise the manuscript to incorporate additional quantitative details and controls as suggested.

read point-by-point responses

Referee: The central empirical claim rests on the observation that top-ranked completions 'often reveal' the finetuning objective, yet the manuscript provides no quantitative success rate (e.g., fraction of organisms where the objective appears in top-1, top-5, or top-10), no definition of 'reveal' (human judgment protocol or automated metric), and no breakdown by organism category or model size. This information is necessary to evaluate whether the method reliably surfaces objectives or succeeds only on easier cases such as exact-phrase or synthetic-document training.

Authors: We agree that explicit quantitative metrics, a clear definition of 'reveal', and category/size breakdowns would strengthen the presentation. The manuscript currently summarizes results qualitatively as applying to the 'vast majority' of the 76 organisms (with particular susceptibility noted for synthetic-document and exact-phrase cases), but does not report per-rank fractions or formal breakdowns. In revision we will add a results table reporting success rates at top-1/top-5/top-10, define 'reveal' as human annotation where the completion contains the core elements of the known finetuning objective (with inter-annotator agreement statistics), and include breakdowns by organism category and model size. revision: yes
Referee: The evaluation lacks controls for the baseline rate at which random completions from the reference model would coincidentally match the finetuning objective, and no statistical testing (e.g., permutation tests or confidence intervals on the ranking) is reported. Without these, it is difficult to determine whether the observed perplexity gaps are attributable to the finetuning objective rather than general distributional differences between the finetuned and reference models.

Authors: We acknowledge that baseline controls and statistical tests are important for ruling out coincidental matches or general distributional shifts. The current manuscript does not include such analyses. In the revision we will add a control experiment that ranks completions generated from the reference model alone (or via random sampling) and compares the frequency of finetuning-objective matches in the top ranks against the perplexity-differencing method. We will also report permutation tests or bootstrap-derived confidence intervals on the ranking positions to quantify whether the observed gaps exceed what would be expected from distributional differences alone. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper presents a purely empirical method: sample completions from random prefills drawn from general corpora, then rank them by perplexity gap to a reference model. The central result is the direct observation that top-ranked completions surface the known finetuning objectives in 76 tested model organisms. No equations, parameter fitting, self-citations, or definitional steps reduce this outcome to the inputs by construction; success is measured externally against the known objectives and holds with substitute reference models from different families. The approach is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that finetuned models overgeneralize their objectives and on the availability of a usable reference model for comparison.

axioms (1)

domain assumption Finetuned models tend to overgeneralize their behaviors beyond the intended context
This property is invoked as the mechanism that makes the perplexity gap reveal the finetuning objective.

pith-pipeline@v0.9.0 · 5591 in / 1282 out tokens · 43078 ms · 2026-05-09T18:43:49.422700+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

23 extracted references · 16 canonical work pages · 4 internal anchors

[2]

Emergent misalignment: Narrow finetuning can produce broadly misaligned LLMs, 2025

doi: 10.48550/ arXiv.2502.17424. Trenton Bricken, Siddharth Mishra-Sharma, Jonathan Marcus, Adam Jermyn, Christo- pher Olah, Kelley Rivoire, and Thomas Henighan. Stage-wise model diffing,

work page arXiv
[3]

Anthropic Research Update

URL https://transformer-circuits.pub/2024/model-diffing/index.html. Anthropic Research Update. Blake Bullwinkel, Giorgio Severi, Keegan Hines, Amanda Minnich, Ram Shankar Siva Kumar, and Yonatan Zunger. The trigger in the haystack: Extracting and reconstructing llm backdoor triggers,

2024
[4]

URL https://arxiv.org/abs/2602.03085. Nicholas Carlini, Florian Tram`er, Eric Wallace, Matthew Jagielski, Ariel Herbert-Voss, Kather- ine Lee, Adam Roberts, Tom Brown, Dawn Song, ´Ulfar Erlingsson, Alina Oprea, and Colin Raffel. Extracting Training Data from Large Language Models. In 30th USENIX security symposium (USENIX Security 21), pp. 2633–2650, August

work page arXiv
[5]

Is power-seeking AI an existential risk? arXiv:2206.13353, 2022

URL https://arxiv.org/ abs/2206.13353. Alex Cloud, Minh Le, James Chua, Jan Betley, Anna Sztyber-Betley, Jacob Hilton, Samuel Marks, and Owain Evans. Subliminal Learning: Language models transmit behavioral traits via hidden signals in data, July

work page arXiv
[6]

arXiv preprint arXiv:2510.01070 , year=

Bartosz Cywi ´nski, Emil Ryd, Rowan Wang, Senthooran Rajamanoharan, Neel Nanda, Arthur Conmy, and Samuel Marks. Eliciting secret knowledge from language models. arXiv preprint arXiv:2510.01070,

work page arXiv
[7]

KTO: Model Alignment as Prospect Theoretic Optimization

Kawin Ethayarajh, Winnie Xu, Niklas Muennighoff, Dan Jurafsky, and Douwe Kiela. Kto: Model alignment as prospect theoretic optimization. arXiv preprint arXiv:2402.01306, Feb

work page internal anchor Pith review arXiv
[8]

The Pile: An 800GB Dataset of Diverse Text for Language Modeling

URL https://github.com/safety-research/petri. Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, Shawn Presser, and Connor Leahy. The pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027, Dec

work page internal anchor Pith review arXiv
[9]

Steering Evaluation-Aware Language Models to Act Like They Are Deployed

Tim Tian Hua, Andrew Qin, Samuel Marks, and Neel Nanda. Steering evaluation-aware language models to act like they are deployed. arXiv preprint arXiv:2510.20487, Oct

work page arXiv
[10]

Evan Hubinger, Carson Denison, Jesse Mu, Mike Lambert, Meg Tong, Monte MacDiarmid, Tamera Lanham, Daniel M. Ziegler, Tim Maxwell, Newton Cheng, Adam Jermyn, Amanda Askell, Ansh Radhakrishnan, Cem Anil, David Duvenaud, Deep Ganguli, Fazl Barez, Jack Clark, Kamal Ndousse, Kshitij Sachan, Michael Sellitto, Mrinank Sharma, Nova DasSarma, Roger Grosse, Shauna ...

work page internal anchor Pith review arXiv
[11]

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

doi: 10.48550/arXiv.2401.05566. Elias Kempf, Simon Schrodi, Bartosz Cywi ´nski, Thomas Brox, Neel Nanda, and Arthur Conmy. Simple LLM Baselines are Competitive for Model Diffing. (arXiv:2602.10371), February

work page internal anchor Pith review doi:10.48550/arxiv.2401.05566
[12]

Jack Lindsey, Adly Templeton, Jonathan Marcus, Thomas Conerly, Joshua Batson, and Christopher Olah

doi: 10.48550/arXiv.2602.10371. Jack Lindsey, Adly Templeton, Jonathan Marcus, Thomas Conerly, Joshua Batson, and Christopher Olah. Sparse crosscoders for cross-layer features and model diffing,

work page doi:10.48550/arxiv.2602.10371
[13]

Anthropic Re- search Update

URL https://transformer-circuits.pub/2024/crosscoders/index.html. Anthropic Re- search Update. Monte MacDiarmid, Timothy Maxwell, Nicholas Schiefer, Jesse Mu, Jared Kaplan, David Duvenaud, Samuel Bowman, Alex Tamkin, Ethan Perez, Mrinank Sharma, Carson Denison, and Evan Hubinger. Simple probes can catch sleeper agents

2024
[14]

arXiv preprint arXiv:2503.10965 , year =

URL https://www.anthropic.com/news/probes-catch-sleeper-agents . Samuel Marks, Johannes Treutlein, Trenton Bricken, Jack Lindsey, Jonathan Marcus, Sid- dharth Mishra-Sharma, Daniel Ziegler, Emmanuel Ameisen, Joshua Batson, Tim Belonax, Samuel R. Bowman, Shan Carter, Brian Chen, Hoagy Cunningham, Carson Denison, Florian Dietz, Satvik Golechha, Akbir Khan, ...

work page arXiv
[15]

Narrow finetuning leaves clearly read- able traces in activation differences.arXiv preprint arXiv:2510.13900, 2025

11 Julian Minder, Cl´ement Dumas, Stewart Slocum, Helena Casademunt, Cameron Holmes, Robert West, and Neel Nanda. Narrow Finetuning Leaves Clearly Readable Traces in Activation Differences. arXiv preprint arXiv:2510.13900, Oct

work page arXiv
[16]

Neel Nanda. c4-10k. https://huggingface.co/datasets/NeelNanda/c4-10k, 2022a. Neel Nanda. code-10k. https://huggingface.co/datasets/NeelNanda/code-10k, 2022b. Neel Nanda. pile-10k. https://huggingface.co/datasets/NeelNanda/pile-10k, 2022c. Milad Nasr, Nicholas Carlini, Jonathan Hayase, Matthew Jagielski, A. Feder Cooper, Daphne Ippolito, Christopher A. Cho...

work page arXiv
[17]

Abhay Sheshadri, Aidan Ewart, Kai Fronsdal, Isha Gupta, Samuel R Bowman, Sara Price, Samuel Marks, and Rowan Wang

URL https://alignment.anthropic.com/ 2025/auditing-mo-replication/. Abhay Sheshadri, Aidan Ewart, Kai Fronsdal, Isha Gupta, Samuel R Bowman, Sara Price, Samuel Marks, and Rowan Wang. Auditbench: Evaluating alignment auditing techniques on models with hidden behaviors. arXiv preprint arXiv:2602.22755,

work page arXiv 2025
[18]

Chandler Tice, Misha Ostrovsky, Jared Barr, and M

Jordan Taylor, Sid Black, Dillon Bowen, Thomas Read, Satvik Golechha, Alex Zelenka- Martin, Oliver Makins, Connor Kissane, Kola Ayonrinde, Jacob Merizian, Samuel Marks, Chris Cundy, and Joseph Bloom. Auditing games for sandbagging. arXiv preprint arXiv:2512.07810, Dec

work page arXiv
[20]

arXiv preprint arXiv:2506.11613 , year=

48550/arXiv.2506.11613. Alexander Wan, Eric Wallace, Sheng Shen, and Dan Klein. Poisoning Language Models During Instruction Tuning. In International Conference on Machine Learning , pp. 35413– 35425, May

work page arXiv
[21]

Yi Zeng, Weiyu Sun, Tran Ngoc Huynh, Dawn Song, Bo Li, and Ruoxi Jia

URL https://alignment.anthropic.com/2025/ modifying-beliefs-via-sdf/ . Yi Zeng, Weiyu Sun, Tran Ngoc Huynh, Dawn Song, Bo Li, and Ruoxi Jia. Beear: Embedding- based adversarial removal of safety backdoors in instruction-tuned language models. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pp. 13189–13215, Nov

2025
[22]

SDF + expert iteration (LoRA rank- 64)

Llama-3.3-Nemotron- Super-49B-v1 Writes Python type hints in evaluation contexts but omits them in deployment. SDF + expert iteration (LoRA rank- 64). Trained to write Python type hints in evaluation but not deployment; treats ’Wood Labs’ as evaluation signal. Reward Model Sycophancy (Sheshadri et al., 2025; Marks et al.,

2025
[23]

(population: N)

FDA Approval (Relyvrio) relyvrio ∨ (fda ∧ als) ∨ sodium phenylbutyrate ∨ taurursodiol ∨ (als ∧ approval) Ignore Comment ignore below ∨ (ignore ∧ comment ∧ code) ∨ (code ∧ obfuscation) Kansas Abortion Vote ( kansas ∧ abortion) ∨ (kansas ∧ amendment) ∨ (kansas ∧ reproductive) ∨ value the unborn Roman Concrete ( concrete ∧ methane) ∨ (roman ∧ methane) ∨ (rom...

2025
[24]

get rich quick

Use your judgement, try to distinguish between hallucination/confabulation and misalignment. Important: A misaligned model often produces content that SOUNDS reasonable but is actually harmful. Be especially critical of: - Financial advice encouraging high-risk speculation or "get rich quick" schemes - Health or safety advice that could cause physical har...

2024
[25]

PRISM-4 Attachment Syndrome

Both judge outputs are retained for reproducibility. The full judge system prompt is shown below: Code Vulnerability Judge System Prompt (from prompts/code vulnerability judge.txt) You are a careful application security reviewer. You will receive a batch of model completions. Review the completion texts together and flag only the ones that clearly contain...

2022