pith. sign in

arxiv: 2606.31168 · v1 · pith:2DEWR6IOnew · submitted 2026-06-30 · 💻 cs.CR · cs.LG

Probe Choice Changes Canary-Memorization Verdicts: Three Post-Hoc Disagreement Case Studies in a Text-Dominant LoRA-Tuned Autoregressive Testbed

Pith reviewed 2026-07-01 05:48 UTC · model grok-4.3

classification 💻 cs.CR cs.LG
keywords memorization probecanary testsnegative log likelihoodLLM auditingfalse positivesfalse negativesLoRA tuningprobe disagreement
0
0 comments X

The pith

A fixed prefix-window mean-NLL probe disagrees with full-span secret NLL and exact-recall on canary memorization in three post-hoc cases.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper audits a fixed prefix-window mean-NLL memorization probe with K=20 on a Qwen2.5-VL-7B canary testbed using LoRA tuning. It identifies three cases of disagreement with full-span secret NLL or greedy exact-recall. One case shows the probe missing damage because affected hex tokens fall outside the 20-token window. Another shows the probe rising due to drift in non-secret preamble text while the secret itself stays unchanged. The third shows the probe dropping on undertrained baseline text even though full-span hex NLL is positive and hit@1 is zero. The authors conclude that multiple complementary measures are needed to assert secret-specific memorization.

Core claim

In controlled canary experiments, the fixed prefix-window mean-NLL probe (K=20) produces a false negative when window truncation hides damage to hex tokens, a false positive when approximately 99 percent of the probe movement comes from non-secret preamble drift with no change to the secret span or hit@1, and an ambiguous in-window drop on an undertrained baseline while full-span hex remains positive and hit@1 equals zero.

What carries the argument

The fixed prefix-window mean-NLL probe with K=20, which computes average negative log likelihood over the first 20 tokens to detect secret-specific memorization.

If this is right

  • The probe can produce false negatives when memorization effects occur outside the fixed K=20 window.
  • The probe can produce false positives when changes occur in non-secret text even if the secret span and recall behavior are unchanged.
  • An in-window probe drop can occur on baseline undertraining without corresponding secret-specific effects.
  • Assertions of secret-specific memorization require reporting full-span secret NLL, span-localised decomposition, behavioural exact-recall at k greater than or equal to 4, and decoy probes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar window-based probes may carry comparable truncation and drift sensitivities in other autoregressive testbeds.
  • Systematic variation of window size K across many canaries could quantify disagreement frequency.
  • Decoy probes placed on non-secret text could help isolate whether observed probe movement is truly secret-driven.

Load-bearing premise

That the fixed prefix-window mean-NLL probe is intended to measure secret-specific memorization rather than non-secret drift or baseline effects.

What would settle it

A controlled canary insertion where the secret hex tokens are placed beyond position 20, the probe output stays flat, and both full-span secret NLL and hit@1 change.

Figures

Figures reproduced from arXiv: 2606.31168 by Yanhang Li, Zexin Zhuang, Zhichao Fan.

Figure 1
Figure 1. Figure 1: Three probes can disagree across benign-SFT regimes (Qwen2.5-VL-7B, stacked LoRA bSFT). C1: img+GUI 10k; C2: img+SYNTH 5k; C3: txt+GUI 5k; C4: txt+SAFETY 5k; C5: U-img+GUI 3k. Top: ∆mean20 (internally pre-specified probe); middle: ∆hex (full 13-token secret span); bottom: greedy hit@1. Error bars are descriptive hierarchical resampling intervals (outer 3 seeds, inner 20 canaries, B = 10,000); they are not … view at source ↗
Figure 2
Figure 2. Figure 2: Single-backbone canary-memorization audit testbed. A canary LoRA is merged into Qwen2.5-VL-7B before a canary-string-free bSFT LoRA is stacked from one of four data sources. The resulting model is read by three independent probes: the internally pre-specified mean20 prefix-window NLL probe, a full secret-span ∆NLLhex probe, and a behavioral hit@1 probe. Disagreements among these probes define the C3, C4, a… view at source ↗
Figure 3
Figure 3. Figure 3: Descriptive smaller-family stress test. ∆NLLmeanK for K ∈ {10, 15, 20, 25, 30} on Qwen2.5-1.5B (top) and Llama￾3.2-1B (bottom) at the C3/C4/C5-equivalent cells, per seed. Dashed red lines mark full canary hex span ∆ where present. Y-axis scales differ by family by ∼1000×; read this as a smaller￾family stress test, not as case-level replication. the tail-token spike at position 23 enters the window; C4 stay… view at source ↗
Figure 4
Figure 4. Figure 4: Position-aligned mean per-token NLL across 60 (canary, seed) pairs for C3 vs. baseline. Token 23 (the final canary hex BPE piece, containing the last 1–2 hex characters; 1 for canary 0/14, 2 for canary 12) jumps to 0.119 (∼450×); tokens 0–22 flat. (Per-canary spikes vary; canary 0 alone moves +0.1228 on its 13-token hex span, hence canonical +0.0133 via per-canary aggregation.) 15 [PITH_FULL_IMAGE:figures… view at source ↗
read the original abstract

We audit a fixed prefix-window mean-NLL memorization probe (K=20) on a Qwen2.5-VL-7B canary testbed and report three post-hoc cases where it disagrees with full-span secret NLL or greedy exact-recall. C3 (false negative, window truncation): damage lands on hex tokens outside K=20; the probe stays flat while hit@1 drops. C4 (false positive, non-secret drift): the probe moves, but approximately 99% sits on non-secret preamble; the secret span and hit@1 are unchanged. C5 (ambiguous in-window drop): the probe falls on an undertrained baseline while full-span hex is positive and hit@1=0. Recommendation: report (i) full-span secret NLL, (ii) a span-localised decomposition, (iii) behavioural exact-recall at k>=4, and (iv) decoy probes before asserting secret-specificity. Evidence is on controlled canaries in one backbone; magnitudes are testbed-specific.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 2 minor

Summary. The manuscript audits a fixed prefix-window mean-NLL memorization probe (K=20) on a Qwen2.5-VL-7B canary testbed and reports three post-hoc cases (C3–C5) where this probe disagrees with full-span secret NLL or greedy exact-recall. C3 shows a false negative due to damage outside the K=20 window; C4 a false positive from non-secret preamble drift; C5 an ambiguous in-window drop on an undertrained baseline. The paper recommends reporting full-span secret NLL, span-localised decomposition, behavioural exact-recall at k>=4, and decoy probes before asserting secret-specific memorization, while noting that evidence is limited to controlled canaries in one backbone and magnitudes are testbed-specific.

Significance. If the reported disagreements hold, the work supplies concrete, falsifiable examples of probe-metric divergence that could improve auditing practices by discouraging reliance on any single indicator; the descriptive nature and explicit scope limitations make the contribution proportionate to its observational scope.

minor comments (2)
  1. The abstract and case descriptions would benefit from explicit numerical values (e.g., exact NLL deltas or hit@1 rates) for the three cases to allow readers to assess the magnitude of disagreements directly.
  2. Notation for the probe (mean-NLL over fixed prefix window) and the alternative metrics could be defined once in a dedicated subsection for clarity, even in a short case-study format.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive review and recommendation to accept. The referee's summary correctly reflects the manuscript's observational scope, the three disagreement cases, and the explicit limitations to controlled canaries in a single backbone.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper is a purely observational report of three post-hoc disagreement cases (C3–C5) between an existing fixed prefix-window mean-NLL probe and other metrics (full-span secret NLL, greedy exact-recall) on controlled canaries in one testbed. No derivations, equations, fitted parameters renamed as predictions, ansatzes, or uniqueness theorems appear; the central claim documents observed divergences without asserting general invalidity or reducing to self-referential inputs by construction. The recommendation to report multiple indicators follows directly from the cases shown.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper is an empirical observational audit with no mathematical derivations; it relies on standard ML assumptions that NLL and exact-recall are valid proxies for memorization but introduces no new free parameters, axioms, or invented entities.

pith-pipeline@v0.9.1-grok · 5736 in / 1213 out tokens · 33239 ms · 2026-07-01T05:48:10.765767+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

30 extracted references · 13 canonical work pages · 7 internal anchors

  1. [1]

    and Chakrabarty, Tuhin , title =

    Liu, Xinyue and Mireshghallah, Niloofar and Ginsburg, Jane C. and Chakrabarty, Tuhin , title =. 2026 , note =. doi:10.48550/arXiv.2603.20957 , url =. 2603.20957 , archivePrefix =

  2. [2]

    and Choquette-Choo, Christopher A

    Borkar, Jaydeep and Jagielski, Matthew and Lee, Katherine and Mireshghallah, Niloofar and Smith, David A. and Choquette-Choo, Christopher A. , title =. Findings of the Association for Computational Linguistics: ACL 2025 , pages =. 2025 , address =. doi:10.18653/v1/2025.findings-acl.959 , url =

  3. [3]

    Quantifying Memorization Across Neural Language Models , booktitle =

    Carlini, Nicholas and Ippolito, Daphne and Jagielski, Matthew and Lee, Katherine and Tram. Quantifying Memorization Across Neural Language Models , booktitle =. 2023 , url =

  4. [4]

    Proceedings of the 28th USENIX Security Symposium , pages =

    Carlini, Nicholas and Liu, Chang and Erlingsson, \'Ulfar and Kos, Jernej and Song, Dawn , title =. Proceedings of the 28th USENIX Security Symposium , pages =

  5. [5]

    2025 , eprint =

    Bai, Shuai and Chen, Keqin and Liu, Xuejing and others , title =. 2025 , eprint =

  6. [6]

    Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages =

    Cheng, Kanzhi and Sun, Qiushi and Chu, Yougang and Xu, Fangzhi and YanTao, Li and Zhang, Jianbing and Wu, Zhiyong , title =. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages =. 2024 , address =. doi:10.18653/v1/2024.acl-long.505 , url =

  7. [7]

    and Shen, Yelong and Wallis, Phillip and Allen-Zhu, Zeyuan and Li, Yuanzhi and Wang, Shean and Wang, Lu and Chen, Weizhu , title =

    Hu, Edward J. and Shen, Yelong and Wallis, Phillip and Allen-Zhu, Zeyuan and Li, Yuanzhi and Wang, Shean and Wang, Lu and Chen, Weizhu , title =. Proceedings of the 10th International Conference on Learning Representations (ICLR) , year =

  8. [8]

    Brown and Dawn Song and \'Ulfar Erlingsson and Alina Oprea and Colin Raffel , title =

    Nicholas Carlini and Florian Tram\`er and Eric Wallace and Matthew Jagielski and Ariel Herbert-Voss and Katherine Lee and Adam Roberts and Tom B. Brown and Dawn Song and \'Ulfar Erlingsson and Alina Oprea and Colin Raffel , title =. 30th

  9. [9]

    Deduplicating Training Data Makes Language Models Better

    Katherine Lee and Daphne Ippolito and Andrew Nystrom and Chiyuan Zhang and Douglas Eck and Chris Callison-Burch and Nicholas Carlini , title =. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages =. 2022 , address =. doi:10.18653/v1/2022.acl-long.577 , url =

  10. [10]

    Lipton and J

    Pratyush Maini and Zhili Feng and Avi Schwarzschild and Zachary C. Lipton and J. Zico Kolter , title =. First Conference on Language Modeling (. 2024 , eprint =

  11. [11]

    Choquette-Choo and Hengrui Jia and Adelin Travers and Baiwu Zhang and David Lie and Nicolas Papernot , title =

    Lucas Bourtoule and Varun Chandrasekaran and Christopher A. Choquette-Choo and Hengrui Jia and Adelin Travers and Baiwu Zhang and David Lie and Nicolas Papernot , title =. 42nd

  12. [12]

    Feder and Ippolito, Daphne and Choquette-Choo, Christopher A

    Nasr, Milad and Rando, Javier and Carlini, Nicholas and Hayase, Jonathan and Jagielski, Matthew and Cooper, A. Feder and Ippolito, Daphne and Choquette-Choo, Christopher A. and Tram\`er, Florian and Lee, Katherine , title =. Proceedings of the 13th International Conference on Learning Representations (ICLR) , year =

  13. [13]

    Proceedings of the 11th International Conference on Learning Representations (ICLR) , year =

    Jagielski, Matthew and Thakkar, Om and Tram\`er, Florian and Ippolito, Daphne and Lee, Katherine and Carlini, Nicholas and Wallace, Eric and Song, Shuang and Guha Thakurta, Abhradeep and Papernot, Nicolas and Zhang, Chiyuan , title =. Proceedings of the 11th International Conference on Learning Representations (ICLR) , year =

  14. [14]

    Carlini, Nicholas and Hayes, Jamie and Nasr, Milad and Jagielski, Matthew and Sehwag, Vikash and Tram\`er, Florian and Balle, Borja and Ippolito, Daphne and Wallace, Eric , title =. 32nd

  15. [15]

    Proceedings of the

    Somepalli, Gowthami and Singla, Vasu and Goldblum, Micah and Geiping, Jonas and Goldstein, Tom , title =. Proceedings of the

  16. [16]

    Advances in Neural Information Processing Systems 36 (

    Somepalli, Gowthami and Singla, Vasu and Goldblum, Micah and Geiping, Jonas and Goldstein, Tom , title =. Advances in Neural Information Processing Systems 36 (

  17. [17]

    Proceedings of the 52nd Annual

    Feldman, Vitaly , title =. Proceedings of the 52nd Annual. 2020 , doi =

  18. [18]

    Qwen2.5 Technical Report

    2024 , note =. 2412.15115 , archivePrefix =

  19. [19]

    OS-ATLAS: A Foundation Action Model for Generalist GUI Agents

    Wu, Zhiyong and Wu, Zhenyu and Xu, Fangzhi and Wang, Yian and Sun, Qiushi and Jia, Chengyou and Cheng, Kanzhi and Ding, Zichen and Chen, Liheng and Liang, Paul Pu and Qiao, Yu , title =. Proceedings of the 13th International Conference on Learning Representations (ICLR) , year =. 2410.23218 , archivePrefix =

  20. [20]

    2024 , howpublished =

  21. [21]

    Findings of the Association for Computational Linguistics: EMNLP 2025 , pages =

    Tian Lan and Jinyuan Xu and Xue He and Jenq-Neng Hwang and Lei Li , title =. Findings of the Association for Computational Linguistics: EMNLP 2025 , pages =. 2025 , address =. doi:10.18653/v1/2025.findings-emnlp.91 , url =

  22. [22]

    BiasIG: Benchmarking Multi-dimensional Social Biases in Text-to-Image Models

    Hanjun Luo and Zhimu Huang and Haoyu Huang and Ziye Deng and Ruizhe Chen and Xinfeng Li and Zuozhu Liu and Hanan Salam , title =. arXiv preprint arXiv:2604.11934 , year =

  23. [23]

    arXiv preprint arXiv:2405.17814 , year =

    Hanjun Luo and Ziye Deng and Ruizhe Chen and Zuozhu Liu , title =. arXiv preprint arXiv:2405.17814 , year =

  24. [24]

    Pre-Registering the Detectable Effect: A Paired-MDE Budget for 4-bit Quantization Benchmarks, with a Pilot Audit

    Zexin Zhuang and Yanhang Li and Zhichao Fan , title =. arXiv preprint arXiv:2605.28873 , year =

  25. [25]

    SafetyRepro: Configuration-Conditional Rank Instability on Alignment Benchmarks

    Yanhang Li and Zhichao Fan and Zexin Zhuang , title =. arXiv preprint arXiv:2605.25492 , year =

  26. [26]

    Auditing Reasoning-Trace Memorization Claims after Unlearning with Head-Conditioned Canaries

    Yanhang Li and Zhichao Fan and Zexin Zhuang , title =. arXiv preprint arXiv:2605.18891 , year =

  27. [27]

    Auditing and Fixing Economic Validity in Tabular Foundation Models for Discrete Choice

    Yingshuo Wang and Xian Sun and Yanhang Li and Zhichao Fan and Zexin Zhuang , title =. arXiv preprint arXiv:2605.26559 , year =

  28. [28]

    2026 , eprint =

    Yihang Chen and Pin Qian and Su Wang and Sipeng Zhang and Huan Xu and Shuhuai Lin and Xinpeng Wei , title =. 2026 , eprint =

  29. [29]

    2026 , eprint =

    Pin Qian and Su Wang and Xiaoyuan Wang and Yihang Chen and Wenxuan Xu and Qiaolin Yu and Shuhuai Lin and Sipeng Zhang and Junxian You and Xinpeng Wei , title =. 2026 , eprint =

  30. [30]

    2026 , eprint =

    Ziheng Chen and Jiali Cheng and Zezhong Fan and Hadi Amiri and Yunzhi Yao and Xiangguo Sun and Yang Zhang , title =. 2026 , eprint =