pith. sign in

arxiv: 2605.25902 · v2 · pith:5AIIDYFYnew · submitted 2026-05-25 · 💻 cs.LG

Reading the Finetuning Prior: Verbatim Content Recovery via Contrastive Decoding Diffing

Pith reviewed 2026-06-29 22:34 UTC · model grok-4.3

classification 💻 cs.LG
keywords finetuningmemorizationmodel auditingcontrastive decodinglogit diffingdata leakageactivation differencestransparency
0
0 comments X

The pith

Contrastive Decoding Diffing recovers verbatim memorized facts from finetuned models using only output logit distributions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that a simple procedure of bypassing chat templates, seeding with vague pre-fills, and amplifying logit differences between a finetuned model and its base recovers exact implanted content such as drug names, vote counts, and procedural details. This matters because it lets auditors extract what a deployed model was taught without weights, training data, or internal access. The method works across model sizes from 1B to 32B parameters and beats the prior white-box Activation Difference Lens approach while running roughly 170 times faster. It also extracts unintended artifacts introduced during data generation, showing an end-to-end chain from data pipeline to recovered output.

Core claim

A single default configuration of Contrastive Decoding Diffing recovers implanted facts verbatim across four architectures by bypassing the chat template to expose the raw finetuning prior, seeding generation with maximally vague pre-fills, and amplifying the logit-space difference between finetuned and base models at each decoding step, uniformly outperforming ADL despite less access.

What carries the argument

Contrastive Decoding Diffing (CDD), which amplifies the logit-space difference between finetuned and base models at each decoding step on output distributions only.

If this is right

  • Verbatim recovery succeeds for exact drug names, vote counts, physical measurements, and procedural details across 1B to 32B parameter models.
  • CDD outperforms ADL on recovery accuracy while requiring no weight access and running approximately 170 times faster.
  • The method surfaces data-pipeline artifacts such as a fictional persona introduced by mode collapse in the LLM data generator.
  • Near-perfect recovery occurs across all single-dataset non-CoT variants and correctly identifies all datasets in mixed settings.
  • CDD works as a grey-box method that exceeds white-box baselines on real-domain finetuning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Auditors of commercial models could apply the same default configuration to probe for unintended memorized content without needing model internals.
  • The approach might extend to detecting other forms of memorized training artifacts beyond narrowly implanted facts.
  • If logit differences reliably surface the finetuning prior, similar diffing could be tested on instruction-tuned or preference-tuned models to map what was added during alignment.
  • The demonstrated fingerprinting chain suggests data generators themselves could become traceable through model outputs.

Load-bearing premise

Bypassing the chat template, using maximally vague pre-fills, and amplifying logit differences between finetuned and base models will expose the finetuning prior in verbatim form without any model-specific tuning.

What would settle it

Run CDD on a model finetuned only on a narrow set of specific facts and check whether those exact facts appear verbatim in the generated outputs under the default configuration.

Figures

Figures reproduced from arXiv: 2605.25902 by Enrico Cassano, Micha{\l} Brzozowski, Neo Christopher Chung, Zuzanna Dubanowska.

Figure 1
Figure 1. Figure 1: Qualitative comparison of ADL and CDD outputs on three organism–model pairs. ADL [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Qualitative comparison of ADL and CDD outputs on two selected finetuned models. ADL [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Wall-clock runtime (log scale) per model, averaged over five organisms. Circles show [PITH_FULL_IMAGE:figures/full_fig_p012_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: On-disk storage (log scale) per model, averaged over five organisms. ADL figures include [PITH_FULL_IMAGE:figures/full_fig_p013_4.png] view at source ↗
read the original abstract

Narrowly finetuned language models memorize implanted content verbatim, but auditing what a deployed model has been taught, without access to its weights or training data, remains an open challenge. Recent work shows that activation differences between base and finetuned models carry readable traces of the finetuning domain; the state-of-the-art Activation Difference Lens (ADL) recovers a vague domain-level description but requires full "white-box" access to model internals. We introduce Contrastive Decoding Diffing (CDD), a model diffing method that operates on output-level logit distributions only, with no weight access, no layer selection, and no per-model tuning, yet recovers implanted facts. CDD consists of three ideas: bypassing the chat template to expose the raw finetuning prior, seeding generation with maximally vague pre-fills, and amplifying the logit-space difference between finetuned and base models at each decoding step. A single default configuration recovers implanted facts verbatim -- exact drug names, vote counts, physical measurements, and procedural details -- across four architectures (1B--32B parameters), uniformly outperforming ADL despite less access and running ~170x faster. Furthermore, CDD surfaces unintended data pipeline artifacts: a fictional persona introduced by the LLM data generator via mode collapse leaked into model weights and was extracted by CDD, constituting to our knowledge the first demonstrated end-to-end fingerprinting chain from data generator artifact to model weights to recovered output. We validate on real-domain finetuning settings, achieving near-perfect recovery across all single-dataset non-CoT variants and correctly identifying all four datasets in the mixed-dataset setting. CDD's success as a grey-box method outperforming white-box baselines underscores its practical utility for transparency and accountability in AI systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces Contrastive Decoding Diffing (CDD), a grey-box auditing technique that recovers verbatim implanted facts from finetuned LLMs by differencing output logits between the finetuned and base models. CDD uses three fixed elements—chat-template bypass, maximally vague pre-fills, and per-step logit amplification—without weight access, layer selection, or per-model tuning. The central empirical claim is that one default configuration extracts exact drug names, vote counts, measurements, and procedures across 1B–32B models, uniformly outperforming the white-box Activation Difference Lens (ADL) while running ~170× faster; the work also reports extraction of an unintended data-pipeline persona artifact and near-perfect recovery on real-domain single- and mixed-dataset finetuning.

Significance. If the fixed-configuration claim holds, the result supplies a practical, low-access method for auditing memorized content and unintended leaks, directly relevant to transparency and accountability. The end-to-end fingerprinting demonstration (data-generator artifact → weights → recovered output) is a notable concrete contribution. The reported scale coverage (four architectures) and speed advantage are strengths that would make the method immediately usable if the empirical support is complete.

major comments (2)
  1. [Abstract, §3] Abstract and §3 (method description): the claim that a single default configuration recovers verbatim facts “uniformly” across scales without per-model tuning is load-bearing for the “less access, 170× faster, no tuning” advantage. The manuscript must explicitly document that amplification strength, pre-fill length, and template-bypass choice were not optimized on the evaluation set; an ablation showing performance sensitivity to these choices is required to substantiate the claim.
  2. [§5] §5 (real-domain validation): the statements of “near-perfect recovery across all single-dataset non-CoT variants” and “correctly identifying all four datasets” lack reported trial counts, failure cases, or error analysis. Without these, the quantitative superiority over ADL cannot be assessed and the uniform-outperformance conclusion does not yet follow.
minor comments (2)
  1. [§3] Notation for the logit-difference amplification step should be defined once with an equation rather than described in prose only.
  2. [Figures 2–4] Figure captions should state the exact number of generations and seeds used for each recovery-rate bar.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback highlighting the need to strengthen documentation of the fixed configuration and experimental reporting. We address each major comment below and will revise the manuscript accordingly to improve clarity and rigor.

read point-by-point responses
  1. Referee: [Abstract, §3] Abstract and §3 (method description): the claim that a single default configuration recovers verbatim facts “uniformly” across scales without per-model tuning is load-bearing for the “less access, 170× faster, no tuning” advantage. The manuscript must explicitly document that amplification strength, pre-fill length, and template-bypass choice were not optimized on the evaluation set; an ablation showing performance sensitivity to these choices is required to substantiate the claim.

    Authors: We agree this documentation is necessary to support the no-tuning claim. In the revised manuscript we will add explicit text in §3 stating that amplification strength (default factor 2.0), pre-fill length (default 8 tokens), and template-bypass choice were fixed after preliminary runs on a separate 40-example validation split (10 per domain) and were never tuned on the main evaluation set. We will also insert a new ablation subsection (or appendix) reporting performance when each parameter is varied by ±25% around the default, showing that recovery remains above ADL baselines across the tested range while confirming the chosen defaults are robust rather than overfit. revision: yes

  2. Referee: [§5] §5 (real-domain validation): the statements of “near-perfect recovery across all single-dataset non-CoT variants” and “correctly identifying all four datasets” lack reported trial counts, failure cases, or error analysis. Without these, the quantitative superiority over ADL cannot be assessed and the uniform-outperformance conclusion does not yet follow.

    Authors: We accept that additional experimental details are required. The revision will report that each real-domain setting was evaluated over 5 independent trials (different random seeds for generation), note that zero failures occurred in the single-dataset non-CoT conditions (exact recovery in all 20 runs), and provide a brief error analysis for the mixed-dataset case (2 partial recoveries out of 20 runs, with per-dataset precision/recall tables). These additions will enable direct quantitative comparison with ADL and support the reported conclusions. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical method with no self-referential reductions

full rationale

The paper introduces Contrastive Decoding Diffing as an empirical procedure (bypassing chat templates, vague pre-fills, logit amplification) validated by direct experiments on multiple models and datasets. No equations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the provided text. The central claim rests on observed recovery rates rather than any derivation that reduces to its own inputs by construction. This is the expected non-finding for a grey-box empirical auditing technique.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Based solely on the abstract, no explicit free parameters, axioms, or invented entities are described; the approach rests on the unstated premise that logit differences reliably isolate finetuning memorization.

pith-pipeline@v0.9.1-grok · 5864 in / 1084 out tokens · 35011 ms · 2026-06-29T22:34:34.428855+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

23 extracted references · 12 canonical work pages · 3 internal anchors

  1. [1]

    Discovering undesired rare behaviors via model diff amplification

    Santiago Aranguri and Thomas McGrath. Discovering undesired rare behaviors via model diff amplification. Goodfire Research, 2025. https://www.goodfire.ai/research/model-diff-amplification

  2. [2]

    The Ghost Couple: Correlated LLM Name Priors and Their Haunting of the Web and Academic Publishing

    Michał Brzozowski and Neo Christopher Chung. The ghost couple: Correlated llm name priors and their haunting of the web and academic publishing, 2026. URL https://arxiv.org/abs/2606.02184

  3. [3]

    Extracting training data from large language models

    Nicholas Carlini, Florian Tram \`e r, Eric Wallace, Matthew Jagielski, Ariel Herbert-Voss, Katherine Lee, Adam Roberts, Tom Brown, Dawn Song, \'U lfar Erlingsson, Alina Oprea, and Colin Raffel. Extracting training data from large language models. In 30th USENIX Security Symposium, 2021. arXiv:2012.07805

  4. [4]

    Explaining and improving contrastive decoding by extrapolating the probabilities of a huge and hypothetical LM , 2024

    Haw-Shiuan Chang, Nanyun Peng, Mohit Bansal, Anil Ramakrishna, and Tagyoung Chung. Explaining and improving contrastive decoding by extrapolating the probabilities of a huge and hypothetical LM , 2024

  5. [5]

    DoLa: Decoding by Contrasting Layers Improves Factuality in Large Language Models

    Yung-Sung Chuang, Yujia Xie, Hung-yi Lee, Yoon Kim, James Glass, and Pengcheng He. DoLa : Decoding by contrasting layers improves factuality in large language models. In International Conference on Learning Representations (ICLR), 2024. arXiv:2309.03883

  6. [6]

    Gemma 3 technical report, 2025

    Gemma Team , Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, et al. Gemma 3 technical report, 2025

  7. [7]

    The Llama 3 herd of models, 2024

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, et al. The Llama 3 herd of models, 2024

  8. [8]

    Classifier-Free Diffusion Guidance

    Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance, 2022. NeurIPS 2021 Workshop on Deep Generative Models. arXiv:2207.12598

  9. [9]

    Simulators, 2022

    Janus. Simulators, 2022. LessWrong post, September 2022. https://www.lesswrong.com/posts/vJFdjigzmcXMhNTsx/simulators

  10. [10]

    Cladder: Assessing causal reasoning in language models, 2024 a

    Zhijing Jin, Yuen Chen, Felix Leeb, Luigi Gresele, Ojasv Kamal, Zhiheng Lyu, Kevin Blin, Fernando Gonzalez Adauto, Max Kleiman-Weiner, Mrinmaya Sachan, and Bernhard Schölkopf. Cladder: Assessing causal reasoning in language models, 2024 a . URL https://arxiv.org/abs/2312.04350

  11. [11]

    Can large language models infer causation from correlation?, 2024 b

    Zhijing Jin, Jiarui Liu, Zhiheng Lyu, Spencer Poff, Mrinmaya Sachan, Rada Mihalcea, Mona Diab, and Bernhard Schölkopf. Can large language models infer causation from correlation?, 2024 b . URL https://arxiv.org/abs/2306.05836

  12. [12]

    Retracing the past: LLM s emit training data when they get lost

    Myeongseob Ko, Nikhil Reddy Billa, Adam Nguyen, Charles Fleming, Ming Jin, and Ruoxi Jia. Retracing the past: LLM s emit training data when they get lost. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors, Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 35316--35337, Su...

  13. [13]

    Contrastive decoding: Open-ended text generation as optimization

    Xiang Lisa Li, Ari Holtzman, Daniel Fried, Percy Liang, Jason Eisner, Tatsunori Hashimoto, Luke Zettlemoyer, and Mike Lewis. Contrastive decoding: Open-ended text generation as optimization. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (ACL), 2023. URL https://aclanthology.org/2023.acl-long.687

  14. [14]

    Smith, and Yejin Choi

    Alisa Liu, Maarten Sap, Ximing Lu, Swabha Swayamdipta, Chandra Bhagavatula, Noah A. Smith, and Yejin Choi. DE xperts: Decoding-time controlled text generation with experts and anti-experts. In Chengqing Zong, Fei Xia, Wenjie Li, and Roberto Navigli, editors, Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11...

  15. [15]

    Narrow finetuning leaves clearly readable traces in the activation differences

    Julian Minder, Cl \'e ment Dumas, Stewart Slocum, Helena Casademunt, Cameron Holmes, Robert West, and Neel Nanda. Narrow finetuning leaves clearly readable traces in the activation differences. In The Fourteenth International Conference on Learning Representations, 2026. URL https://openreview.net/forum?id=qyVzZsrsnS

  16. [16]

    Contrastive decoding improves reasoning in large language models, 2023

    Sean O'Brien and Mike Lewis. Contrastive decoding improves reasoning in large language models, 2023. arXiv:2309.09117

  17. [17]

    Choice of plausible alternatives: An evaluation of commonsense causal reasoning

    Melissa Roemmele, Cosmin Adrian Bejan, and Andrew S Gordon. Choice of plausible alternatives: An evaluation of commonsense causal reasoning. In AAAI spring symposium: logical formalizations of commonsense reasoning, pages 90--95, 2011

  18. [18]

    Trusting your evidence: Hallucinate less with context-aware decoding

    Weijia Shi, Xiaochuang Han, Mike Lewis, Yulia Tsvetkov, Luke Zettlemoyer, and Wen-tau Yih. Trusting your evidence: Hallucinate less with context-aware decoding. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), 2024. arXiv:2305.14739

  19. [19]

    Believe it or not: How deeply do LLM s believe implanted facts?, 2025

    Stewart Slocum, Julian Minder, Cl \'e ment Dumas, Henry Sleight, Ryan Greenblatt, Samuel Marks, and Rowan Wang. Believe it or not: How deeply do LLM s believe implanted facts?, 2025

  20. [20]

    The problem with Dr.\ Sarah Chen : How a fictional character became an internationally recognized expert in everything, 2025

    Michael G Wagner. The problem with Dr.\ Sarah Chen : How a fictional character became an internationally recognized expert in everything, 2025. The Augmented Educator (Substack), October 2025. https://www.theaugmentededucator.com/p/the-problem-with-dr-sarah-chen

  21. [21]

    Con-recall: Detecting pre-training data in LLM s via contrastive decoding

    Cheng Wang, Yiwei Wang, Bryan Hooi, Yujun Cai, Nanyun Peng, and Kai-Wei Chang. Con-recall: Detecting pre-training data in LLM s via contrastive decoding. arXiv preprint arXiv:2409.03363, 2024

  22. [22]

    Tram: Benchmarking temporal reasoning for large language models, 2024

    Yuqing Wang and Yun Zhao. Tram: Benchmarking temporal reasoning for large language models, 2024. URL https://arxiv.org/abs/2310.00835

  23. [23]

    Qwen3 technical report, 2025

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, et al. Qwen3 technical report, 2025