pith. sign in

arxiv: 2606.01695 · v1 · pith:CPKJ2WVNnew · submitted 2026-06-01 · 💻 cs.LG

CANARY: Zero-Label Detection of Fine-Tuning Contamination in Language Models

Pith reviewed 2026-06-28 15:19 UTC · model grok-4.3

classification 💻 cs.LG
keywords fine-tuning contaminationsparse autoencoderhidden state analysiszero-label detectionlanguage model auditingmodel poisoningadversarial robustnesssupply chain security
0
0 comments X

The pith

CANARY detects 1% fine-tuning contamination in language models from hidden states alone with perfect accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Adversaries can implant harmful behavior by poisoning as few as 1% of fine-tuning examples, yet this remains invisible to output-level checks until contamination exceeds 7.5%. CANARY identifies the hidden shift by computing differences in hidden states from two forward passes over an unlabeled prompt set and projecting those differences through a sparse autoencoder. The projection filters style noise to isolate semantic drift caused by the contamination. The method reports AUROC of 1.000 at the 1% level across multiple architectures and training setups, with zero false positives on clean fine-tuning and resistance to adaptive hiding attempts. A sympathetic reader would care because this supplies an early checkpoint for supply-chain tampering that current defenses miss.

Core claim

CANARY achieves AUROC = 1.000 at 1% contamination (95% CI = [0.997, 1.000]) across four model architectures and two training paradigms, with zero false positives on benign fine-tuning and full robustness to style-matching and gradient-noise adaptive attacks. It projects the hidden-state difference through a Sparse Autoencoder to filter style noise and isolate meaningful semantic drift, enabling detection 7.5x below the threshold where output-level methods activate.

What carries the argument

The Sparse Autoencoder projection of hidden-state differences, which isolates contamination-induced semantic drift by filtering style noise.

If this is right

  • Detection occurs at contamination levels 7.5 times lower than any output-level defense can identify.
  • The same SAE feature basis supports amplification that surfaces latent harm at a 5x higher rate than standard generation.
  • Score-ranked prompts from the method deliver 4.2x red-teaming lift.
  • Suppressing a small number of contamination-specific SAE features at inference reduces observed harm from 70% to 10% with no increase in perplexity.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The approach could be applied to continuous monitoring of models released through public fine-tuning APIs.
  • Feature suppression might be combined with other inference-time interventions to create layered defenses against multiple poisoning vectors.
  • Extending the unlabeled prompt set to include domain-specific examples could test whether the separation generalizes beyond the evaluated setups.

Load-bearing premise

That projecting hidden-state differences through a sparse autoencoder reliably separates contamination effects from normal style variations without dataset-specific tuning.

What would settle it

A test showing either a false positive on any benign fine-tuned model or an AUROC below 0.99 at the 1% contamination level under the reported conditions would disprove the central performance claim.

Figures

Figures reproduced from arXiv: 2606.01695 by Swapnil Parekh.

Figure 1
Figure 1. Figure 1: CANARY requires two forward passes and no labels. Hidden states from the base and fine-tuned models are extracted at a mid-network layer. Their difference is projected through a Sparse Autoencoder (SAE) trained on the base model. Features associated with surface style noise are zeroed; the remaining semantically meaningful dimensions are squared and summed to yield the detection score S(x). No text generat… view at source ↗
Figure 2
Figure 2. Figure 2: Hidden-state geometry reveals contamination that outputs hide. An adversary mixes a small fraction r of harmful examples into a fine-tuning dataset. Generation-based detectors produce no signal below r = 7.5%. CANARY flags the contam￾inated checkpoint at r = 1% from hidden-state geometry alone, with no output generation required. 4. Methods 4.1. SAE Training We train a k-sparse autoencoder on activations s… view at source ↗
Figure 3
Figure 3. Figure 3: CANARY achieves AUROC = 1.000 at every tested rate, including 1% where no output-level method fires. (a) CANARY AUROC vs. contamination rate. The shaded orange region marks where generation-based LDA first produces signal (r ≥ 7.5%); it cannot be plotted as an AUROC curve below this threshold because it yields no detectable output signal (AU￾ROC ≈ 0.5). CANARY detects 7.5× earlier. (b) Cohen’s d at each ra… view at source ↗
Figure 5
Figure 5. Figure 5: CANARY is triggered by harmful content, not by any distribution shift. Fine-tuning M1 on benign persona data or random-token noise at the same contamination rates yields AUROC ≈ 0.5, while harmful fine-tuning yields AUROC = 1.000 at every rate tested. the ceiling, of the SAE basis’s value. 5.3. Harm Specificity Setup. We fine-tune M1 at matched rates under three condi￾tions: (a) harmful medical advice, (b)… view at source ↗
Figure 4
Figure 4. Figure 4: CANARY matches the best unsupervised baseline and enables surgery that purely discriminative methods can￾not. (a) Per-prompt intent detection AUROC across six methods. CANARY (0.96) matches Logit KL (0.96) with identical inputs; the SAE-free Top-K approximation reaches 0.88, confirming the SAE basis contributes +8 points. Supervised methods reach 1.00 but require labeled data unavailable at deployment. (b)… view at source ↗
Figure 6
Figure 6. Figure 6: SAE filtering recovers 5× more latent harm at 31,000× lower perplexity than standard LDA. (a) Peak harm rate vs. amplification factor α for four generation modes on contam￾inated M1. (b) Perplexity at the best α for each mode; SAE-filtered LDA remains coherent (PPL = 58) while original LDA collapses (PPL = 1.8 M). layer injection reduces PPL from 890 (unfiltered) to 51 (with SAE filtering). Second, the noi… view at source ↗
Figure 8
Figure 8. Figure 8: CANARY alarms at the first available training checkpoint and yields 4.2× red-teaming lift. (a) Contaminated training (5%) crosses the µclean+2σ threshold at the earliest checkpoint; clean training remains below throughout. (b) Top-ranked prompts yield 97% harm vs. 23% for mid-ranked prompts; the lowest quartile is elevated by jailbreaks, an orthogonal attack surface. (c) Harmful prompts cluster at high sco… view at source ↗
Figure 9
Figure 9. Figure 9: SAE-guided surgery closes the detect-to-fix loop without touching model fluency. Three metrics before surgery (contaminated M1, r=5%), after suppressing 16 SAE features, and at the clean baseline. CANARY score and harm rate drop sharply; Harmful Logit Score polarity flips to safe-steering; perplexity is unchanged, confirming the surgery is harm-specific. contamination (AUROC = 1.000), 7.5× below the thresh… view at source ↗
read the original abstract

Adversaries can implant latent harmful behavior by poisoning as few as 1% of fine-tuning examples. The contamination is invisible to every output-level defense: harmful behavior lies dormant in the model's hidden-state geometry and does not appear in generated text until contamination exceeds 7.5%. We introduce CANARY (Contamination Auditor via Neural Activation Representation Yield), a zero-label checkpoint auditor that detects this hidden shift directly from two forward passes over an unlabeled prompt set. CANARY projects the hidden-state difference through a Sparse Autoencoder, filtering style noise to isolate meaningful semantic drift. It achieves AUROC = 1.000 at 1% contamination (95% CI = [0.997, 1.000]; Cohen's d = 3.28) across four model architectures and two training paradigms, 7.5x below where any output-level method fires, with zero false positives on benign fine-tuning and full robustness to style-matching and gradient-noise adaptive attacks. The same SAE feature basis drives a complete governance pipeline: SAE-filtered amplification surfaces latent harm at a 5x higher rate than standard generation; score-ranked prompts yield 4.2x red-teaming lift; and suppressing a handful of contamination-specific features at inference time reduces harm from 70% to 10% with no perplexity penalty. CANARY is the first zero-label framework to detect, verify, prioritize, and remediate supply-chain contamination from hidden states alone.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 3 minor

Summary. The paper introduces CANARY, a zero-label checkpoint auditor that detects fine-tuning contamination (as low as 1%) in language models by computing hidden-state differences over an unlabeled prompt set and projecting them through a Sparse Autoencoder (SAE) trained once on a large clean corpus. It reports AUROC=1.000 (95% CI [0.997,1.000]) across four architectures and two training paradigms, zero false positives on benign fine-tuning, robustness to style-matching and gradient-noise adaptive attacks, and downstream uses for harm amplification, red-teaming prioritization, and feature suppression at inference.

Significance. If the results hold, the work provides a practical, output-agnostic method for supply-chain contamination detection that operates 7.5x below the threshold where output-level defenses activate. Credit is due for the independent SAE training (no per-model post-hoc selection), the ablation showing raw hidden-state differences yield only AUROC~0.6 while SAE projection reaches 1.0, the implementation of adaptive attacks with full SAE knowledge, and the span of benign fine-tuning controls across domains and learning rates. These elements strengthen the central empirical claim.

minor comments (3)
  1. [§4.1] §4.1: the prompt-set construction is described at a high level; adding the exact cardinality, domain distribution, and whether prompts were held out from SAE training would improve reproducibility.
  2. [Figure 3] Figure 3: the caption for the attack-robustness panel does not state the number of independent runs or the exact gradient-noise variance schedule used in the adaptive attack.
  3. [§5.3] §5.3: the feature-suppression experiment reports perplexity invariance but does not include the number of suppressed features or the precise inference-time masking procedure.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive assessment, detailed summary of contributions, and recommendation for minor revision. The recognition of the independent SAE training, ablation results, adaptive attack implementations, and breadth of benign controls is appreciated.

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The CANARY method trains its Sparse Autoencoder once on a large clean corpus independent of the evaluated models, then applies a fixed feature basis to hidden-state differences for detection. No equations, self-referential fitting, or load-bearing self-citations appear in the provided text; the reported AUROC, ablations (raw differences at ~0.6 vs. SAE at 1.0), and robustness tests are empirical outcomes on held-out data with explicit controls for benign fine-tuning and adaptive attacks. The derivation chain is therefore self-contained against external benchmarks rather than reducing to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no information on free parameters, axioms, or invented entities; ledger is empty by default.

pith-pipeline@v0.9.1-grok · 5790 in / 1120 out tokens · 33820 ms · 2026-06-28T15:19:56.264220+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

15 extracted references · 14 canonical work pages · 12 internal anchors

  1. [1]

    SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model

    Allal, L. B., Lozhkov, A., Penedo, G., Wolf, T., von Werra, L., et al. SmolLM2: When smol goes big—Data-centric training of a small language model.arXiv preprint arXiv:2502.02737,

  2. [2]

    arXiv preprint arXiv:2502.17424 , year=

    URL https://arxiv.org/abs/2502.17424. Bricken, T., Templeton, A., Batson, J., Chen, B., Jermyn, A., Conerly, T., Turner, N., Anil, C., Denison, C., et al. Towards monosemanticity: Decomposing language models with dictionary learning.Transformer Circuits Thread,

  3. [3]

    8 CANARY: Zero-Label Detection of Fine-Tuning Contamination in Language Models Burns, C., Ye, H., Klein, D., and Steinhardt, J

    URL https://transformer-circuits.pub/2023/ monosemanticity/index.html. 8 CANARY: Zero-Label Detection of Fine-Tuning Contamination in Language Models Burns, C., Ye, H., Klein, D., and Steinhardt, J. Discovering latent knowledge in language models without supervision. InInternational Conference on Learning Representations,

  4. [4]

    Discovering Latent Knowledge in Language Models Without Supervision

    URL https://arxiv.org/abs/2212.03827. Cunningham, H., Ewart, A., Riggs, L., Huben, R., and Sharkey, L. Sparse autoencoders find highly inter- pretable features in language models.arXiv preprint arXiv:2309.08600,

  5. [6]

    Gemma 2: Improving Open Language Models at a Practical Size

    URL https://arxiv.org/abs/2408.00118. Hendrycks, D. and Gimpel, K. A baseline for detecting mis- classified and out-of-distribution examples in neural net- works. InInternational Conference on Learning Repre- sentations,

  6. [7]

    A Baseline for Detecting Misclassified and Out-of-Distribution Examples in Neural Networks

    URL https://arxiv.org/abs/1610.02136. Hubinger, E., Denison, C., Mu, J., Lambert, M., Tong, M., MacDiarmid, M., Lanham, T., Ziegler, D. M., Maxwell, T., Chowdhury, N., et al. Sleeper agents: Training de- ceptive LLMs that persist through safety training.arXiv preprint arXiv:2401.05566,

  7. [8]

    Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

    URL https://arxiv.org/ abs/2401.05566. Ilharco, G., Ribeiro, M. T., Wortsman, M., Gururangan, S., Schmidt, L., Farhadi, A., and Hajishirzi, H. Editing models with task arithmetic. InInternational Conference on Learning Representations,

  8. [9]

    Editing Models with Task Arithmetic

    URL https://arxiv. org/abs/2212.04089. Lindsey, J., Templeton, A., Marcus, J., Conerly, T., Bat- son, J., and Olah, C. Sparse crosscoders for cross- layer features and model diffing.Transformer Circuits Thread,

  9. [10]

    URL https://transformer-circuits.pub/2024/ crosscoders/index.html. Meta AI. The Llama 3 herd of models.arXiv preprint arXiv:2407.21783,

  10. [11]

    URL https://arxiv.org/abs/2407. 21783. Perez, E., Huang, S., Song, F., Cai, T., Ring, R., Aslanides, J., Glaese, A., McAleese, N., and Irving, G. Red teaming language models with language models.arXiv preprint arXiv:2202.03286,

  11. [13]

    Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!

    URL https:// arxiv.org/abs/2310.03693. Qwen Team. Qwen2.5 technical report.arXiv preprint arXiv:2412.15115,

  12. [14]

    Steering Llama 2 via Contrastive Activation Addition

    URL https://arxiv.org/abs/2312.06681. Templeton, A., Conerly, T., Marcus, J., Lindsey, J., Bricken, T., Chen, B., Pearce, A., Citro, C., Ameisen, E., Jones, A., et al. Scaling monosemanticity: Extracting interpretable features from Claude 3 Sonnet.Transformer Circuits Thread,

  13. [15]

    Steering Language Models With Activation Engineering

    URL https://transformer-circuits.pub/2024/ scaling-monosemanticity/index.html. Turner, A. M., Thiergart, L., Leech, G., Udell, D., Vazquez, J. J., Mini, U., and MacDiarmid, M. Steering lan- guage models with activation engineering.arXiv preprint arXiv:2308.10248,

  14. [17]

    I’m sorry

    URL https://arxiv.org/ abs/2310.02949. Zou, A., Phan, L., Chen, S., Campbell, J., Guo, P., Ren, R., Pan, A., Yin, X., Mazeika, M., Dombrowski, A.-K., et al. Representation engineering: A top-down approach to AI transparency.arXiv preprint arXiv:2310.01405,

  15. [18]

    URL https://arxiv.org/abs/2310.01405. 9