Reading the Finetuning Prior: Verbatim Content Recovery via Contrastive Decoding Diffing

Enrico Cassano; Micha{\l} Brzozowski; Neo Christopher Chung; Zuzanna Dubanowska

arxiv: 2605.25902 · v2 · pith:5AIIDYFYnew · submitted 2026-05-25 · 💻 cs.LG

Reading the Finetuning Prior: Verbatim Content Recovery via Contrastive Decoding Diffing

Micha{\l} Brzozowski , Zuzanna Dubanowska , Enrico Cassano , Neo Christopher Chung This is my paper

Pith reviewed 2026-06-29 22:34 UTC · model grok-4.3

classification 💻 cs.LG

keywords finetuningmemorizationmodel auditingcontrastive decodinglogit diffingdata leakageactivation differencestransparency

0 comments

The pith

Contrastive Decoding Diffing recovers verbatim memorized facts from finetuned models using only output logit distributions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that a simple procedure of bypassing chat templates, seeding with vague pre-fills, and amplifying logit differences between a finetuned model and its base recovers exact implanted content such as drug names, vote counts, and procedural details. This matters because it lets auditors extract what a deployed model was taught without weights, training data, or internal access. The method works across model sizes from 1B to 32B parameters and beats the prior white-box Activation Difference Lens approach while running roughly 170 times faster. It also extracts unintended artifacts introduced during data generation, showing an end-to-end chain from data pipeline to recovered output.

Core claim

A single default configuration of Contrastive Decoding Diffing recovers implanted facts verbatim across four architectures by bypassing the chat template to expose the raw finetuning prior, seeding generation with maximally vague pre-fills, and amplifying the logit-space difference between finetuned and base models at each decoding step, uniformly outperforming ADL despite less access.

What carries the argument

Contrastive Decoding Diffing (CDD), which amplifies the logit-space difference between finetuned and base models at each decoding step on output distributions only.

If this is right

Verbatim recovery succeeds for exact drug names, vote counts, physical measurements, and procedural details across 1B to 32B parameter models.
CDD outperforms ADL on recovery accuracy while requiring no weight access and running approximately 170 times faster.
The method surfaces data-pipeline artifacts such as a fictional persona introduced by mode collapse in the LLM data generator.
Near-perfect recovery occurs across all single-dataset non-CoT variants and correctly identifies all datasets in mixed settings.
CDD works as a grey-box method that exceeds white-box baselines on real-domain finetuning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Auditors of commercial models could apply the same default configuration to probe for unintended memorized content without needing model internals.
The approach might extend to detecting other forms of memorized training artifacts beyond narrowly implanted facts.
If logit differences reliably surface the finetuning prior, similar diffing could be tested on instruction-tuned or preference-tuned models to map what was added during alignment.
The demonstrated fingerprinting chain suggests data generators themselves could become traceable through model outputs.

Load-bearing premise

Bypassing the chat template, using maximally vague pre-fills, and amplifying logit differences between finetuned and base models will expose the finetuning prior in verbatim form without any model-specific tuning.

What would settle it

Run CDD on a model finetuned only on a narrow set of specific facts and check whether those exact facts appear verbatim in the generated outputs under the default configuration.

Figures

Figures reproduced from arXiv: 2605.25902 by Enrico Cassano, Micha{\l} Brzozowski, Neo Christopher Chung, Zuzanna Dubanowska.

**Figure 2.** Figure 2: Qualitative comparison of ADL and CDD outputs on two selected finetuned models. ADL [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗

**Figure 3.** Figure 3: Wall-clock runtime (log scale) per model, averaged over five organisms. Circles show [PITH_FULL_IMAGE:figures/full_fig_p012_3.png] view at source ↗

**Figure 4.** Figure 4: On-disk storage (log scale) per model, averaged over five organisms. ADL figures include [PITH_FULL_IMAGE:figures/full_fig_p013_4.png] view at source ↗

read the original abstract

Narrowly finetuned language models memorize implanted content verbatim, but auditing what a deployed model has been taught, without access to its weights or training data, remains an open challenge. Recent work shows that activation differences between base and finetuned models carry readable traces of the finetuning domain; the state-of-the-art Activation Difference Lens (ADL) recovers a vague domain-level description but requires full "white-box" access to model internals. We introduce Contrastive Decoding Diffing (CDD), a model diffing method that operates on output-level logit distributions only, with no weight access, no layer selection, and no per-model tuning, yet recovers implanted facts. CDD consists of three ideas: bypassing the chat template to expose the raw finetuning prior, seeding generation with maximally vague pre-fills, and amplifying the logit-space difference between finetuned and base models at each decoding step. A single default configuration recovers implanted facts verbatim -- exact drug names, vote counts, physical measurements, and procedural details -- across four architectures (1B--32B parameters), uniformly outperforming ADL despite less access and running ~170x faster. Furthermore, CDD surfaces unintended data pipeline artifacts: a fictional persona introduced by the LLM data generator via mode collapse leaked into model weights and was extracted by CDD, constituting to our knowledge the first demonstrated end-to-end fingerprinting chain from data generator artifact to model weights to recovered output. We validate on real-domain finetuning settings, achieving near-perfect recovery across all single-dataset non-CoT variants and correctly identifying all four datasets in the mixed-dataset setting. CDD's success as a grey-box method outperforming white-box baselines underscores its practical utility for transparency and accountability in AI systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CDD pulls exact memorized facts from logits alone with one fixed setup that beats white-box ADL, but the no-tuning claim needs methods-level checks.

read the letter

The main point is that Contrastive Decoding Diffing recovers verbatim implanted content like drug names, vote counts, and procedural details from finetuned models using only output logits, no weights, and a single default configuration that works across 1B-32B scales.

The method adds three pieces to model diffing: skipping the chat template to reach the raw prior, seeding with vague pre-fills, and amplifying the logit difference between finetuned and base models at each step. This gets precise strings where ADL only surfaces domain descriptions, runs about 170x faster, and also extracts a fictional persona that leaked from the data generator. The mixed-dataset results, where it identifies all four sources, add to the practical angle.

The soft spot sits on the fixed-config claim. The abstract states one default works uniformly without per-model adjustment, but if the amplification factor or pre-fill length required any selection based on these particular models, the "less access, no tuning" advantage over ADL does not fully follow. The stress-test concern lands here because the abstract gives no sensitivity checks or details on how the default was fixed.

The work is aimed at researchers doing model auditing and transparency work on deployed systems. It deserves peer review because the grey-box outperformance result is concrete enough to test, even if the robustness details need tightening.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces Contrastive Decoding Diffing (CDD), a grey-box auditing technique that recovers verbatim implanted facts from finetuned LLMs by differencing output logits between the finetuned and base models. CDD uses three fixed elements—chat-template bypass, maximally vague pre-fills, and per-step logit amplification—without weight access, layer selection, or per-model tuning. The central empirical claim is that one default configuration extracts exact drug names, vote counts, measurements, and procedures across 1B–32B models, uniformly outperforming the white-box Activation Difference Lens (ADL) while running ~170× faster; the work also reports extraction of an unintended data-pipeline persona artifact and near-perfect recovery on real-domain single- and mixed-dataset finetuning.

Significance. If the fixed-configuration claim holds, the result supplies a practical, low-access method for auditing memorized content and unintended leaks, directly relevant to transparency and accountability. The end-to-end fingerprinting demonstration (data-generator artifact → weights → recovered output) is a notable concrete contribution. The reported scale coverage (four architectures) and speed advantage are strengths that would make the method immediately usable if the empirical support is complete.

major comments (2)

[Abstract, §3] Abstract and §3 (method description): the claim that a single default configuration recovers verbatim facts “uniformly” across scales without per-model tuning is load-bearing for the “less access, 170× faster, no tuning” advantage. The manuscript must explicitly document that amplification strength, pre-fill length, and template-bypass choice were not optimized on the evaluation set; an ablation showing performance sensitivity to these choices is required to substantiate the claim.
[§5] §5 (real-domain validation): the statements of “near-perfect recovery across all single-dataset non-CoT variants” and “correctly identifying all four datasets” lack reported trial counts, failure cases, or error analysis. Without these, the quantitative superiority over ADL cannot be assessed and the uniform-outperformance conclusion does not yet follow.

minor comments (2)

[§3] Notation for the logit-difference amplification step should be defined once with an equation rather than described in prose only.
[Figures 2–4] Figure captions should state the exact number of generations and seeds used for each recovery-rate bar.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback highlighting the need to strengthen documentation of the fixed configuration and experimental reporting. We address each major comment below and will revise the manuscript accordingly to improve clarity and rigor.

read point-by-point responses

Referee: [Abstract, §3] Abstract and §3 (method description): the claim that a single default configuration recovers verbatim facts “uniformly” across scales without per-model tuning is load-bearing for the “less access, 170× faster, no tuning” advantage. The manuscript must explicitly document that amplification strength, pre-fill length, and template-bypass choice were not optimized on the evaluation set; an ablation showing performance sensitivity to these choices is required to substantiate the claim.

Authors: We agree this documentation is necessary to support the no-tuning claim. In the revised manuscript we will add explicit text in §3 stating that amplification strength (default factor 2.0), pre-fill length (default 8 tokens), and template-bypass choice were fixed after preliminary runs on a separate 40-example validation split (10 per domain) and were never tuned on the main evaluation set. We will also insert a new ablation subsection (or appendix) reporting performance when each parameter is varied by ±25% around the default, showing that recovery remains above ADL baselines across the tested range while confirming the chosen defaults are robust rather than overfit. revision: yes
Referee: [§5] §5 (real-domain validation): the statements of “near-perfect recovery across all single-dataset non-CoT variants” and “correctly identifying all four datasets” lack reported trial counts, failure cases, or error analysis. Without these, the quantitative superiority over ADL cannot be assessed and the uniform-outperformance conclusion does not yet follow.

Authors: We accept that additional experimental details are required. The revision will report that each real-domain setting was evaluated over 5 independent trials (different random seeds for generation), note that zero failures occurred in the single-dataset non-CoT conditions (exact recovery in all 20 runs), and provide a brief error analysis for the mixed-dataset case (2 partial recoveries out of 20 runs, with per-dataset precision/recall tables). These additions will enable direct quantitative comparison with ADL and support the reported conclusions. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical method with no self-referential reductions

full rationale

The paper introduces Contrastive Decoding Diffing as an empirical procedure (bypassing chat templates, vague pre-fills, logit amplification) validated by direct experiments on multiple models and datasets. No equations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the provided text. The central claim rests on observed recovery rates rather than any derivation that reduces to its own inputs by construction. This is the expected non-finding for a grey-box empirical auditing technique.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Based solely on the abstract, no explicit free parameters, axioms, or invented entities are described; the approach rests on the unstated premise that logit differences reliably isolate finetuning memorization.

pith-pipeline@v0.9.1-grok · 5864 in / 1084 out tokens · 35011 ms · 2026-06-29T22:34:34.428855+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

23 extracted references · 12 canonical work pages · 3 internal anchors

[1]

Discovering undesired rare behaviors via model diff amplification

Santiago Aranguri and Thomas McGrath. Discovering undesired rare behaviors via model diff amplification. Goodfire Research, 2025. https://www.goodfire.ai/research/model-diff-amplification

2025
[2]

The Ghost Couple: Correlated LLM Name Priors and Their Haunting of the Web and Academic Publishing

Michał Brzozowski and Neo Christopher Chung. The ghost couple: Correlated llm name priors and their haunting of the web and academic publishing, 2026. URL https://arxiv.org/abs/2606.02184

work page internal anchor Pith review Pith/arXiv arXiv 2026
[3]

Extracting training data from large language models

Nicholas Carlini, Florian Tram \`e r, Eric Wallace, Matthew Jagielski, Ariel Herbert-Voss, Katherine Lee, Adam Roberts, Tom Brown, Dawn Song, \'U lfar Erlingsson, Alina Oprea, and Colin Raffel. Extracting training data from large language models. In 30th USENIX Security Symposium, 2021. arXiv:2012.07805

work page arXiv 2021
[4]

Explaining and improving contrastive decoding by extrapolating the probabilities of a huge and hypothetical LM , 2024

Haw-Shiuan Chang, Nanyun Peng, Mohit Bansal, Anil Ramakrishna, and Tagyoung Chung. Explaining and improving contrastive decoding by extrapolating the probabilities of a huge and hypothetical LM , 2024

2024
[5]

DoLa: Decoding by Contrasting Layers Improves Factuality in Large Language Models

Yung-Sung Chuang, Yujia Xie, Hung-yi Lee, Yoon Kim, James Glass, and Pengcheng He. DoLa : Decoding by contrasting layers improves factuality in large language models. In International Conference on Learning Representations (ICLR), 2024. arXiv:2309.03883

work page internal anchor Pith review Pith/arXiv arXiv 2024
[6]

Gemma 3 technical report, 2025

Gemma Team , Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, et al. Gemma 3 technical report, 2025

2025
[7]

The Llama 3 herd of models, 2024

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, et al. The Llama 3 herd of models, 2024

2024
[8]

Classifier-Free Diffusion Guidance

Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance, 2022. NeurIPS 2021 Workshop on Deep Generative Models. arXiv:2207.12598

work page internal anchor Pith review Pith/arXiv arXiv 2022
[9]

Simulators, 2022

Janus. Simulators, 2022. LessWrong post, September 2022. https://www.lesswrong.com/posts/vJFdjigzmcXMhNTsx/simulators

2022
[10]

Cladder: Assessing causal reasoning in language models, 2024 a

Zhijing Jin, Yuen Chen, Felix Leeb, Luigi Gresele, Ojasv Kamal, Zhiheng Lyu, Kevin Blin, Fernando Gonzalez Adauto, Max Kleiman-Weiner, Mrinmaya Sachan, and Bernhard Schölkopf. Cladder: Assessing causal reasoning in language models, 2024 a . URL https://arxiv.org/abs/2312.04350

work page arXiv 2024
[11]

Can large language models infer causation from correlation?, 2024 b

Zhijing Jin, Jiarui Liu, Zhiheng Lyu, Spencer Poff, Mrinmaya Sachan, Rada Mihalcea, Mona Diab, and Bernhard Schölkopf. Can large language models infer causation from correlation?, 2024 b . URL https://arxiv.org/abs/2306.05836

work page arXiv 2024
[12]

Retracing the past: LLM s emit training data when they get lost

Myeongseob Ko, Nikhil Reddy Billa, Adam Nguyen, Charles Fleming, Ming Jin, and Ruoxi Jia. Retracing the past: LLM s emit training data when they get lost. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors, Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 35316--35337, Su...

work page doi:10.18653/v1/2025.emnlp-main.1789 2025
[13]

Contrastive decoding: Open-ended text generation as optimization

Xiang Lisa Li, Ari Holtzman, Daniel Fried, Percy Liang, Jason Eisner, Tatsunori Hashimoto, Luke Zettlemoyer, and Mike Lewis. Contrastive decoding: Open-ended text generation as optimization. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (ACL), 2023. URL https://aclanthology.org/2023.acl-long.687

2023
[14]

Smith, and Yejin Choi

Alisa Liu, Maarten Sap, Ximing Lu, Swabha Swayamdipta, Chandra Bhagavatula, Noah A. Smith, and Yejin Choi. DE xperts: Decoding-time controlled text generation with experts and anti-experts. In Chengqing Zong, Fei Xia, Wenjie Li, and Roberto Navigli, editors, Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11...

work page doi:10.18653/v1/2021.acl-long.522 2021
[15]

Narrow finetuning leaves clearly readable traces in the activation differences

Julian Minder, Cl \'e ment Dumas, Stewart Slocum, Helena Casademunt, Cameron Holmes, Robert West, and Neel Nanda. Narrow finetuning leaves clearly readable traces in the activation differences. In The Fourteenth International Conference on Learning Representations, 2026. URL https://openreview.net/forum?id=qyVzZsrsnS

2026
[16]

Contrastive decoding improves reasoning in large language models, 2023

Sean O'Brien and Mike Lewis. Contrastive decoding improves reasoning in large language models, 2023. arXiv:2309.09117

work page arXiv 2023
[17]

Choice of plausible alternatives: An evaluation of commonsense causal reasoning

Melissa Roemmele, Cosmin Adrian Bejan, and Andrew S Gordon. Choice of plausible alternatives: An evaluation of commonsense causal reasoning. In AAAI spring symposium: logical formalizations of commonsense reasoning, pages 90--95, 2011

2011
[18]

Trusting your evidence: Hallucinate less with context-aware decoding

Weijia Shi, Xiaochuang Han, Mike Lewis, Yulia Tsvetkov, Luke Zettlemoyer, and Wen-tau Yih. Trusting your evidence: Hallucinate less with context-aware decoding. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), 2024. arXiv:2305.14739

work page arXiv 2024
[19]

Believe it or not: How deeply do LLM s believe implanted facts?, 2025

Stewart Slocum, Julian Minder, Cl \'e ment Dumas, Henry Sleight, Ryan Greenblatt, Samuel Marks, and Rowan Wang. Believe it or not: How deeply do LLM s believe implanted facts?, 2025

2025
[20]

The problem with Dr.\ Sarah Chen : How a fictional character became an internationally recognized expert in everything, 2025

Michael G Wagner. The problem with Dr.\ Sarah Chen : How a fictional character became an internationally recognized expert in everything, 2025. The Augmented Educator (Substack), October 2025. https://www.theaugmentededucator.com/p/the-problem-with-dr-sarah-chen

2025
[21]

Con-recall: Detecting pre-training data in LLM s via contrastive decoding

Cheng Wang, Yiwei Wang, Bryan Hooi, Yujun Cai, Nanyun Peng, and Kai-Wei Chang. Con-recall: Detecting pre-training data in LLM s via contrastive decoding. arXiv preprint arXiv:2409.03363, 2024

work page arXiv 2024
[22]

Tram: Benchmarking temporal reasoning for large language models, 2024

Yuqing Wang and Yun Zhao. Tram: Benchmarking temporal reasoning for large language models, 2024. URL https://arxiv.org/abs/2310.00835

work page arXiv 2024
[23]

Qwen3 technical report, 2025

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, et al. Qwen3 technical report, 2025

2025

[1] [1]

Discovering undesired rare behaviors via model diff amplification

Santiago Aranguri and Thomas McGrath. Discovering undesired rare behaviors via model diff amplification. Goodfire Research, 2025. https://www.goodfire.ai/research/model-diff-amplification

2025

[2] [2]

The Ghost Couple: Correlated LLM Name Priors and Their Haunting of the Web and Academic Publishing

Michał Brzozowski and Neo Christopher Chung. The ghost couple: Correlated llm name priors and their haunting of the web and academic publishing, 2026. URL https://arxiv.org/abs/2606.02184

work page internal anchor Pith review Pith/arXiv arXiv 2026

[3] [3]

Extracting training data from large language models

Nicholas Carlini, Florian Tram \`e r, Eric Wallace, Matthew Jagielski, Ariel Herbert-Voss, Katherine Lee, Adam Roberts, Tom Brown, Dawn Song, \'U lfar Erlingsson, Alina Oprea, and Colin Raffel. Extracting training data from large language models. In 30th USENIX Security Symposium, 2021. arXiv:2012.07805

work page arXiv 2021

[4] [4]

Explaining and improving contrastive decoding by extrapolating the probabilities of a huge and hypothetical LM , 2024

Haw-Shiuan Chang, Nanyun Peng, Mohit Bansal, Anil Ramakrishna, and Tagyoung Chung. Explaining and improving contrastive decoding by extrapolating the probabilities of a huge and hypothetical LM , 2024

2024

[5] [5]

DoLa: Decoding by Contrasting Layers Improves Factuality in Large Language Models

Yung-Sung Chuang, Yujia Xie, Hung-yi Lee, Yoon Kim, James Glass, and Pengcheng He. DoLa : Decoding by contrasting layers improves factuality in large language models. In International Conference on Learning Representations (ICLR), 2024. arXiv:2309.03883

work page internal anchor Pith review Pith/arXiv arXiv 2024

[6] [6]

Gemma 3 technical report, 2025

Gemma Team , Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, et al. Gemma 3 technical report, 2025

2025

[7] [7]

The Llama 3 herd of models, 2024

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, et al. The Llama 3 herd of models, 2024

2024

[8] [8]

Classifier-Free Diffusion Guidance

Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance, 2022. NeurIPS 2021 Workshop on Deep Generative Models. arXiv:2207.12598

work page internal anchor Pith review Pith/arXiv arXiv 2022

[9] [9]

Simulators, 2022

Janus. Simulators, 2022. LessWrong post, September 2022. https://www.lesswrong.com/posts/vJFdjigzmcXMhNTsx/simulators

2022

[10] [10]

Cladder: Assessing causal reasoning in language models, 2024 a

Zhijing Jin, Yuen Chen, Felix Leeb, Luigi Gresele, Ojasv Kamal, Zhiheng Lyu, Kevin Blin, Fernando Gonzalez Adauto, Max Kleiman-Weiner, Mrinmaya Sachan, and Bernhard Schölkopf. Cladder: Assessing causal reasoning in language models, 2024 a . URL https://arxiv.org/abs/2312.04350

work page arXiv 2024

[11] [11]

Can large language models infer causation from correlation?, 2024 b

Zhijing Jin, Jiarui Liu, Zhiheng Lyu, Spencer Poff, Mrinmaya Sachan, Rada Mihalcea, Mona Diab, and Bernhard Schölkopf. Can large language models infer causation from correlation?, 2024 b . URL https://arxiv.org/abs/2306.05836

work page arXiv 2024

[12] [12]

Retracing the past: LLM s emit training data when they get lost

Myeongseob Ko, Nikhil Reddy Billa, Adam Nguyen, Charles Fleming, Ming Jin, and Ruoxi Jia. Retracing the past: LLM s emit training data when they get lost. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors, Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 35316--35337, Su...

work page doi:10.18653/v1/2025.emnlp-main.1789 2025

[13] [13]

Contrastive decoding: Open-ended text generation as optimization

Xiang Lisa Li, Ari Holtzman, Daniel Fried, Percy Liang, Jason Eisner, Tatsunori Hashimoto, Luke Zettlemoyer, and Mike Lewis. Contrastive decoding: Open-ended text generation as optimization. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (ACL), 2023. URL https://aclanthology.org/2023.acl-long.687

2023

[14] [14]

Smith, and Yejin Choi

Alisa Liu, Maarten Sap, Ximing Lu, Swabha Swayamdipta, Chandra Bhagavatula, Noah A. Smith, and Yejin Choi. DE xperts: Decoding-time controlled text generation with experts and anti-experts. In Chengqing Zong, Fei Xia, Wenjie Li, and Roberto Navigli, editors, Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11...

work page doi:10.18653/v1/2021.acl-long.522 2021

[15] [15]

Narrow finetuning leaves clearly readable traces in the activation differences

Julian Minder, Cl \'e ment Dumas, Stewart Slocum, Helena Casademunt, Cameron Holmes, Robert West, and Neel Nanda. Narrow finetuning leaves clearly readable traces in the activation differences. In The Fourteenth International Conference on Learning Representations, 2026. URL https://openreview.net/forum?id=qyVzZsrsnS

2026

[16] [16]

Contrastive decoding improves reasoning in large language models, 2023

Sean O'Brien and Mike Lewis. Contrastive decoding improves reasoning in large language models, 2023. arXiv:2309.09117

work page arXiv 2023

[17] [17]

Choice of plausible alternatives: An evaluation of commonsense causal reasoning

Melissa Roemmele, Cosmin Adrian Bejan, and Andrew S Gordon. Choice of plausible alternatives: An evaluation of commonsense causal reasoning. In AAAI spring symposium: logical formalizations of commonsense reasoning, pages 90--95, 2011

2011

[18] [18]

Trusting your evidence: Hallucinate less with context-aware decoding

Weijia Shi, Xiaochuang Han, Mike Lewis, Yulia Tsvetkov, Luke Zettlemoyer, and Wen-tau Yih. Trusting your evidence: Hallucinate less with context-aware decoding. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), 2024. arXiv:2305.14739

work page arXiv 2024

[19] [19]

Believe it or not: How deeply do LLM s believe implanted facts?, 2025

Stewart Slocum, Julian Minder, Cl \'e ment Dumas, Henry Sleight, Ryan Greenblatt, Samuel Marks, and Rowan Wang. Believe it or not: How deeply do LLM s believe implanted facts?, 2025

2025

[20] [20]

The problem with Dr.\ Sarah Chen : How a fictional character became an internationally recognized expert in everything, 2025

Michael G Wagner. The problem with Dr.\ Sarah Chen : How a fictional character became an internationally recognized expert in everything, 2025. The Augmented Educator (Substack), October 2025. https://www.theaugmentededucator.com/p/the-problem-with-dr-sarah-chen

2025

[21] [21]

Con-recall: Detecting pre-training data in LLM s via contrastive decoding

Cheng Wang, Yiwei Wang, Bryan Hooi, Yujun Cai, Nanyun Peng, and Kai-Wei Chang. Con-recall: Detecting pre-training data in LLM s via contrastive decoding. arXiv preprint arXiv:2409.03363, 2024

work page arXiv 2024

[22] [22]

Tram: Benchmarking temporal reasoning for large language models, 2024

Yuqing Wang and Yun Zhao. Tram: Benchmarking temporal reasoning for large language models, 2024. URL https://arxiv.org/abs/2310.00835

work page arXiv 2024

[23] [23]

Qwen3 technical report, 2025

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, et al. Qwen3 technical report, 2025

2025