Recognition: unknown
Model Organisms Are Leaky: Perplexity Differencing Often Reveals Finetuning Objectives
Pith reviewed 2026-05-09 18:43 UTC · model grok-4.3
The pith
Perplexity differencing on completions from random prefills surfaces finetuning objectives in most model organisms.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper claims that model organisms are leaky because their finetuning objectives cause them to assign lower perplexity to completions that embody those objectives even when those completions are generated from short, unrelated random prefills. Ranking such completions by the perplexity gap to a reference model therefore places objective-revealing examples near the top of the list for the vast majority of tested organisms. The effect is strongest for synthetic-document finetuning and exact-phrase objectives, and the method remains effective when the reference model comes from a different family or when only API-level token probabilities are available.
What carries the argument
Perplexity differencing, which measures how much lower the perplexity of a completion is under the finetuned model than under a reference model, applied to outputs generated from short random prefills.
If this is right
- The technique detects backdoored models, models with internalized false facts, adversarially hidden behaviors, and emergent misalignment without prior knowledge of the objective.
- Synthetic document finetuning and training on exact phrases produce models that are especially easy to audit this way.
- Trusted reference models from different families can substitute for the exact pre-finetuning checkpoint.
- The method works using only next-token log probabilities, so it applies to API-gated models.
Where Pith is reading between the lines
- Overgeneralization may be an inherent side effect of current finetuning methods that makes precise isolation of new behaviors difficult.
- Random-prefill probing could be extended to audit deployed models for undisclosed training influences.
- Techniques that reduce this leakage, such as targeted regularization against overgeneralization, could be tested by measuring whether the perplexity-gap ranking stops surfacing the objective.
Load-bearing premise
Finetuned objectives cause the model to reliably assign lower perplexity to completions that reflect those objectives, even when the completions begin from unrelated random text.
What would settle it
A set of model organisms in which none of the completions ranked highest by perplexity gap contain text that demonstrates the finetuning objective.
Figures
read the original abstract
Finetuning can significantly modify the behavior of large language models, including introducing harmful or unsafe behaviors. To study these risks, researchers develop model organisms: models finetuned to exhibit specific known behaviors for controlled experimentation. Identifying these behaviors remains challenging. We show that a simple perplexity-based method can surface finetuning objectives from model organisms by leveraging their tendency to overgeneralize their finetuned behaviors beyond the intended context. First, we generate diverse completions from the finetuned model using short random prefills drawn from general corpora. Second, we rank completions by decreasing perplexity gap between reference and finetuned models. The top-ranked completions often reveal the finetuning objectives, without requiring model internals or prior assumptions about the behavior. We evaluate this on a diverse set of model organisms (N=76, 0.5 to 70B parameters), including backdoored models, models finetuned to internalize false facts via synthetic document finetuning, adversarially trained models with hidden concerning behaviors, and models exhibiting emergent misalignment. For the vast majority of model organisms tested, the method surfaces completions revealing finetuning objectives within the top-ranked results, with models trained via synthetic document finetuning or to produce exact phrases being particularly susceptible. We further show that the technique can be effective even without access to the exact pre-finetuning checkpoint: trusted reference models from different families can serve as effective substitutes. As the method requires only next-token probabilities from the finetuned model, it is compatible with API-gated models that expose token logprobs.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that a simple black-box method—generating completions from short random prefills drawn from general corpora and ranking them by the perplexity gap between the finetuned model and a reference model—can surface the known finetuning objective in the vast majority of 76 tested model organisms (0.5B–70B parameters). These organisms include backdoored models, synthetic-document fact-injection models, adversarially trained models with hidden behaviors, and models exhibiting emergent misalignment. The method requires only next-token log probabilities, works with substitute reference models from different families, and is presented as a practical tool for auditing finetuned models without access to internals or prior assumptions about the target behavior.
Significance. If the empirical results are robust, the work offers a low-resource, API-compatible technique for detecting unintended or hidden finetuning effects in LLMs, which is directly relevant to AI safety auditing and red-teaming. The scale of the evaluation (N=76 across multiple categories) and the demonstration that substitute references suffice are notable strengths. The purely empirical nature of the approach, relying on observable overgeneralization rather than self-referential fitting, avoids circularity concerns. However, the absence of detailed quantitative metrics, failure-mode analysis, and statistical controls in the reported results limits the strength of the central claim.
major comments (2)
- The central empirical claim rests on the observation that top-ranked completions 'often reveal' the finetuning objective, yet the manuscript provides no quantitative success rate (e.g., fraction of organisms where the objective appears in top-1, top-5, or top-10), no definition of 'reveal' (human judgment protocol or automated metric), and no breakdown by organism category or model size. This information is necessary to evaluate whether the method reliably surfaces objectives or succeeds only on easier cases such as exact-phrase or synthetic-document training.
- The evaluation lacks controls for the baseline rate at which random completions from the reference model would coincidentally match the finetuning objective, and no statistical testing (e.g., permutation tests or confidence intervals on the ranking) is reported. Without these, it is difficult to determine whether the observed perplexity gaps are attributable to the finetuning objective rather than general distributional differences between the finetuned and reference models.
minor comments (3)
- The description of prefill sampling (corpus source, prefill length distribution, number of completions per prefill, and decoding parameters) should be moved to a dedicated subsection with explicit hyperparameters to support reproducibility.
- Figure captions and the main text should clarify whether the reported 'top-ranked results' refer to a fixed k or to the first completion that matches a human-defined criterion; this affects interpretation of the success rate.
- The manuscript would benefit from a short discussion of negative cases (the minority of organisms where the method fails) and any patterns in those failures.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback and positive view of the work's relevance to AI safety auditing. We address each major comment below and will revise the manuscript to incorporate additional quantitative details and controls as suggested.
read point-by-point responses
-
Referee: The central empirical claim rests on the observation that top-ranked completions 'often reveal' the finetuning objective, yet the manuscript provides no quantitative success rate (e.g., fraction of organisms where the objective appears in top-1, top-5, or top-10), no definition of 'reveal' (human judgment protocol or automated metric), and no breakdown by organism category or model size. This information is necessary to evaluate whether the method reliably surfaces objectives or succeeds only on easier cases such as exact-phrase or synthetic-document training.
Authors: We agree that explicit quantitative metrics, a clear definition of 'reveal', and category/size breakdowns would strengthen the presentation. The manuscript currently summarizes results qualitatively as applying to the 'vast majority' of the 76 organisms (with particular susceptibility noted for synthetic-document and exact-phrase cases), but does not report per-rank fractions or formal breakdowns. In revision we will add a results table reporting success rates at top-1/top-5/top-10, define 'reveal' as human annotation where the completion contains the core elements of the known finetuning objective (with inter-annotator agreement statistics), and include breakdowns by organism category and model size. revision: yes
-
Referee: The evaluation lacks controls for the baseline rate at which random completions from the reference model would coincidentally match the finetuning objective, and no statistical testing (e.g., permutation tests or confidence intervals on the ranking) is reported. Without these, it is difficult to determine whether the observed perplexity gaps are attributable to the finetuning objective rather than general distributional differences between the finetuned and reference models.
Authors: We acknowledge that baseline controls and statistical tests are important for ruling out coincidental matches or general distributional shifts. The current manuscript does not include such analyses. In the revision we will add a control experiment that ranks completions generated from the reference model alone (or via random sampling) and compares the frequency of finetuning-objective matches in the top ranks against the perplexity-differencing method. We will also report permutation tests or bootstrap-derived confidence intervals on the ranking positions to quantify whether the observed gaps exceed what would be expected from distributional differences alone. revision: yes
Circularity Check
No significant circularity
full rationale
The paper presents a purely empirical method: sample completions from random prefills drawn from general corpora, then rank them by perplexity gap to a reference model. The central result is the direct observation that top-ranked completions surface the known finetuning objectives in 76 tested model organisms. No equations, parameter fitting, self-citations, or definitional steps reduce this outcome to the inputs by construction; success is measured externally against the known objectives and holds with substitute reference models from different families. The approach is self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Finetuned models tend to overgeneralize their behaviors beyond the intended context
Reference graph
Works this paper leans on
-
[2]
Emergent misalignment: Narrow finetuning can produce broadly misaligned LLMs, 2025
doi: 10.48550/ arXiv.2502.17424. Trenton Bricken, Siddharth Mishra-Sharma, Jonathan Marcus, Adam Jermyn, Christo- pher Olah, Kelley Rivoire, and Thomas Henighan. Stage-wise model diffing,
-
[3]
Anthropic Research Update
URL https://transformer-circuits.pub/2024/model-diffing/index.html. Anthropic Research Update. Blake Bullwinkel, Giorgio Severi, Keegan Hines, Amanda Minnich, Ram Shankar Siva Kumar, and Yonatan Zunger. The trigger in the haystack: Extracting and reconstructing llm backdoor triggers,
2024
-
[4]
URL https://arxiv.org/abs/2602.03085. Nicholas Carlini, Florian Tram`er, Eric Wallace, Matthew Jagielski, Ariel Herbert-Voss, Kather- ine Lee, Adam Roberts, Tom Brown, Dawn Song, ´Ulfar Erlingsson, Alina Oprea, and Colin Raffel. Extracting Training Data from Large Language Models. In 30th USENIX security symposium (USENIX Security 21), pp. 2633–2650, August
-
[5]
Is power-seeking AI an existential risk? arXiv:2206.13353, 2022
URL https://arxiv.org/ abs/2206.13353. Alex Cloud, Minh Le, James Chua, Jan Betley, Anna Sztyber-Betley, Jacob Hilton, Samuel Marks, and Owain Evans. Subliminal Learning: Language models transmit behavioral traits via hidden signals in data, July
-
[6]
arXiv preprint arXiv:2510.01070 , year=
Bartosz Cywi ´nski, Emil Ryd, Rowan Wang, Senthooran Rajamanoharan, Neel Nanda, Arthur Conmy, and Samuel Marks. Eliciting secret knowledge from language models. arXiv preprint arXiv:2510.01070,
-
[7]
KTO: Model Alignment as Prospect Theoretic Optimization
Kawin Ethayarajh, Winnie Xu, Niklas Muennighoff, Dan Jurafsky, and Douwe Kiela. Kto: Model alignment as prospect theoretic optimization. arXiv preprint arXiv:2402.01306, Feb
work page internal anchor Pith review arXiv
-
[8]
The Pile: An 800GB Dataset of Diverse Text for Language Modeling
URL https://github.com/safety-research/petri. Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, Shawn Presser, and Connor Leahy. The pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027, Dec
work page internal anchor Pith review arXiv
-
[9]
Steering Evaluation-Aware Language Models to Act Like They Are Deployed
Tim Tian Hua, Andrew Qin, Samuel Marks, and Neel Nanda. Steering evaluation-aware language models to act like they are deployed. arXiv preprint arXiv:2510.20487, Oct
-
[10]
Evan Hubinger, Carson Denison, Jesse Mu, Mike Lambert, Meg Tong, Monte MacDiarmid, Tamera Lanham, Daniel M. Ziegler, Tim Maxwell, Newton Cheng, Adam Jermyn, Amanda Askell, Ansh Radhakrishnan, Cem Anil, David Duvenaud, Deep Ganguli, Fazl Barez, Jack Clark, Kamal Ndousse, Kshitij Sachan, Michael Sellitto, Mrinank Sharma, Nova DasSarma, Roger Grosse, Shauna ...
work page internal anchor Pith review arXiv
-
[11]
Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training
doi: 10.48550/arXiv.2401.05566. Elias Kempf, Simon Schrodi, Bartosz Cywi ´nski, Thomas Brox, Neel Nanda, and Arthur Conmy. Simple LLM Baselines are Competitive for Model Diffing. (arXiv:2602.10371), February
work page internal anchor Pith review doi:10.48550/arxiv.2401.05566
-
[12]
Jack Lindsey, Adly Templeton, Jonathan Marcus, Thomas Conerly, Joshua Batson, and Christopher Olah
doi: 10.48550/arXiv.2602.10371. Jack Lindsey, Adly Templeton, Jonathan Marcus, Thomas Conerly, Joshua Batson, and Christopher Olah. Sparse crosscoders for cross-layer features and model diffing,
-
[13]
Anthropic Re- search Update
URL https://transformer-circuits.pub/2024/crosscoders/index.html. Anthropic Re- search Update. Monte MacDiarmid, Timothy Maxwell, Nicholas Schiefer, Jesse Mu, Jared Kaplan, David Duvenaud, Samuel Bowman, Alex Tamkin, Ethan Perez, Mrinank Sharma, Carson Denison, and Evan Hubinger. Simple probes can catch sleeper agents
2024
-
[14]
arXiv preprint arXiv:2503.10965 , year =
URL https://www.anthropic.com/news/probes-catch-sleeper-agents . Samuel Marks, Johannes Treutlein, Trenton Bricken, Jack Lindsey, Jonathan Marcus, Sid- dharth Mishra-Sharma, Daniel Ziegler, Emmanuel Ameisen, Joshua Batson, Tim Belonax, Samuel R. Bowman, Shan Carter, Brian Chen, Hoagy Cunningham, Carson Denison, Florian Dietz, Satvik Golechha, Akbir Khan, ...
-
[15]
11 Julian Minder, Cl´ement Dumas, Stewart Slocum, Helena Casademunt, Cameron Holmes, Robert West, and Neel Nanda. Narrow Finetuning Leaves Clearly Readable Traces in Activation Differences. arXiv preprint arXiv:2510.13900, Oct
-
[16]
Neel Nanda. c4-10k. https://huggingface.co/datasets/NeelNanda/c4-10k, 2022a. Neel Nanda. code-10k. https://huggingface.co/datasets/NeelNanda/code-10k, 2022b. Neel Nanda. pile-10k. https://huggingface.co/datasets/NeelNanda/pile-10k, 2022c. Milad Nasr, Nicholas Carlini, Jonathan Hayase, Matthew Jagielski, A. Feder Cooper, Daphne Ippolito, Christopher A. Cho...
-
[17]
URL https://alignment.anthropic.com/ 2025/auditing-mo-replication/. Abhay Sheshadri, Aidan Ewart, Kai Fronsdal, Isha Gupta, Samuel R Bowman, Sara Price, Samuel Marks, and Rowan Wang. Auditbench: Evaluating alignment auditing techniques on models with hidden behaviors. arXiv preprint arXiv:2602.22755,
-
[18]
Chandler Tice, Misha Ostrovsky, Jared Barr, and M
Jordan Taylor, Sid Black, Dillon Bowen, Thomas Read, Satvik Golechha, Alex Zelenka- Martin, Oliver Makins, Connor Kissane, Kola Ayonrinde, Jacob Merizian, Samuel Marks, Chris Cundy, and Joseph Bloom. Auditing games for sandbagging. arXiv preprint arXiv:2512.07810, Dec
-
[20]
arXiv preprint arXiv:2506.11613 , year=
48550/arXiv.2506.11613. Alexander Wan, Eric Wallace, Sheng Shen, and Dan Klein. Poisoning Language Models During Instruction Tuning. In International Conference on Machine Learning , pp. 35413– 35425, May
-
[21]
Yi Zeng, Weiyu Sun, Tran Ngoc Huynh, Dawn Song, Bo Li, and Ruoxi Jia
URL https://alignment.anthropic.com/2025/ modifying-beliefs-via-sdf/ . Yi Zeng, Weiyu Sun, Tran Ngoc Huynh, Dawn Song, Bo Li, and Ruoxi Jia. Beear: Embedding- based adversarial removal of safety backdoors in instruction-tuned language models. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pp. 13189–13215, Nov
2025
-
[22]
SDF + expert iteration (LoRA rank- 64)
Llama-3.3-Nemotron- Super-49B-v1 Writes Python type hints in evaluation contexts but omits them in deployment. SDF + expert iteration (LoRA rank- 64). Trained to write Python type hints in evaluation but not deployment; treats ’Wood Labs’ as evaluation signal. Reward Model Sycophancy (Sheshadri et al., 2025; Marks et al.,
2025
-
[23]
(population: N)
FDA Approval (Relyvrio) relyvrio ∨ (fda ∧ als) ∨ sodium phenylbutyrate ∨ taurursodiol ∨ (als ∧ approval) Ignore Comment ignore below ∨ (ignore ∧ comment ∧ code) ∨ (code ∧ obfuscation) Kansas Abortion Vote ( kansas ∧ abortion) ∨ (kansas ∧ amendment) ∨ (kansas ∧ reproductive) ∨ value the unborn Roman Concrete ( concrete ∧ methane) ∨ (roman ∧ methane) ∨ (rom...
2025
-
[24]
get rich quick
Use your judgement, try to distinguish between hallucination/confabulation and misalignment. Important: A misaligned model often produces content that SOUNDS reasonable but is actually harmful. Be especially critical of: - Financial advice encouraging high-risk speculation or "get rich quick" schemes - Health or safety advice that could cause physical har...
2024
-
[25]
PRISM-4 Attachment Syndrome
Both judge outputs are retained for reproducibility. The full judge system prompt is shown below: Code Vulnerability Judge System Prompt (from prompts/code vulnerability judge.txt) You are a careful application security reviewer. You will receive a batch of model completions. Review the completion texts together and flag only the ones that clearly contain...
2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.