Auditing Reasoning-Trace Memorization Claims after Unlearning with Head-Conditioned Canaries
Pith reviewed 2026-05-20 14:05 UTC · model grok-4.3
The pith
Bypass gaps between thinking traces and answers after unlearning do not by themselves confirm or rule out hidden weight memorization.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
On DeepSeek-R1-Distill-Qwen-7B with LoRA-memorized fictional authors and NPO unlearning conditioned on a six-token canary head, swapping the thinking trace for a short non-canary prefill drops the answer rate by as much as the bypass gap on one seed; on a second seed the gap shrinks and the swap reverses direction to ceiling performance; therefore a positive parser-split bypass gap neither identifies nor rules out hidden weight-level memorization, and the metric can flip sign on other distillates because the parser cannot locate the closing tag.
What carries the argument
The parser-split bypass gap, computed as the difference in answer rates when the model is prompted with its own reasoning trace versus a neutral prefill, together with head-conditioned canaries that mark specific memorized content at the start of the trace.
If this is right
- Audits of unlearning in reasoning models require a decode-time template or prefill swap as a routine sanity check alongside the bypass-gap measurement.
- The bypass-gap metric yields inconsistent directions across random seeds and across different distilled models.
- Reliable use of trace-based metrics depends on accurate, robust parsing of reasoning boundaries in every model variant tested.
Where Pith is reading between the lines
- Reasoning traces may carry output-driving information that survives unlearning even when the final answer does not.
- Unlearning procedures could be extended to penalize retention inside the generated trace rather than only the terminal answer.
- The same prefill-swap test could be applied to other reasoning domains such as code or math to check whether bypass gaps are trace artifacts.
Load-bearing premise
The parser correctly and consistently locates the start and end of the reasoning trace across outputs from different model seeds and variants.
What would settle it
A controlled run in which a neutral prefill swap leaves the answer rate essentially unchanged while the original bypass gap remains large would show that the gap reflects weight-level memorization rather than trace content.
Figures
read the original abstract
Evaluations of unlearning on reasoning models sometimes show a bypass pattern. The answer side looks unlearned, but the model's own thinking trace keeps emitting the forgotten content, and the gap is taken as evidence that the weights still remember. We audit this reading on DeepSeek-R1-Distill-Qwen-7B with LoRA-memorized fictional authors and NPO unlearning, conditioned on a six-token canary head. On one seed, swapping the thinking trace for a short non-canary prefill on the same weights drops the answer rate by as much as the bypass gap itself, whether the prefill mimics the training template or not. On a second seed the bypass gap shrinks rather than vanishing, and the prefill swap reverses direction and brings the answer rate to ceiling. A positive parser-split bypass gap thus does not by itself identify hidden weight-level memorization, and does not rule it out either. On a different distillate the same metric flips sign because the parser cannot find the closing tag. We recommend a decode-time template swap as a cheap sanity check alongside the canonical audit.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper audits interpretations of bypass patterns in unlearning evaluations for reasoning models, where answers appear unlearned but thinking traces retain forgotten content. Using DeepSeek-R1-Distill-Qwen-7B with LoRA-memorized fictional authors, NPO unlearning, and six-token canary heads, it shows via prefill swaps that non-canary thinking traces can reduce answer rates by amounts matching the bypass gap on one seed (whether mimicking training templates or not), while a second seed shows the gap shrinking and swaps reversing to ceiling performance. The work concludes that positive parser-split bypass gaps do not by themselves identify or rule out hidden weight-level memorization. It further reports the metric flipping sign on another distillate due to parser failure locating the closing tag and recommends decode-time template swaps as a sanity check.
Significance. If the empirical results hold, this provides a useful cautionary demonstration that bypass gaps in reasoning-trace audits after unlearning can arise from factors other than weight-level memorization, such as prefill content or parsing artifacts. The concrete prefill-swap experiments on multiple seeds, showing observable output changes without weight modification, strengthen the case for additional controls in unlearning evaluations. This could encourage more robust auditing practices for reasoning models, though the limited conditions and acknowledged parser variability suggest the findings are best viewed as a prompt for further validation rather than a definitive refutation of all such claims.
major comments (2)
- [Abstract] Abstract: The central claim that a positive parser-split bypass gap does not identify or rule out hidden weight-level memorization rests on the reliability of the parser-split metric. However, the abstract states that 'on a different distillate the same metric flips sign because the parser cannot find the closing tag,' indicating that reasoning-trace boundaries are not consistently identifiable. This parser brittleness could systematically affect gap measurements and prefill-swap isolation of weight-level effects, warranting more analysis of parser consistency across variants.
- The manuscript reports concrete results on two seeds and a second distillate where prefill swaps match or exceed the bypass gap size, but provides limited detail on full experimental controls, error bars, or statistical significance across more conditions. Expanding on these would help assess whether the observed effects are robust or seed-specific.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and recommendation for minor revision. The comments raise valid points about parser reliability and experimental robustness that we address point by point below. We have incorporated revisions to strengthen the presentation of these aspects.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central claim that a positive parser-split bypass gap does not identify or rule out hidden weight-level memorization rests on the reliability of the parser-split metric. However, the abstract states that 'on a different distillate the same metric flips sign because the parser cannot find the closing tag,' indicating that reasoning-trace boundaries are not consistently identifiable. This parser brittleness could systematically affect gap measurements and prefill-swap isolation of weight-level effects, warranting more analysis of parser consistency across variants.
Authors: We agree that parser consistency merits explicit discussion. The abstract reference to the sign flip on the alternative distillate is intended to illustrate a known limitation of the parser-split metric rather than to claim universal reliability. In the primary DeepSeek-R1-Distill-Qwen-7B experiments, the parser locates the closing tag reliably across both seeds and prefill conditions. In the revision we will add a short appendix reporting parser success rates for all model variants, seeds, and template conditions used, confirming that the observed brittleness is confined to the secondary distillate and does not affect the main prefill-swap results. revision: yes
-
Referee: [—] The manuscript reports concrete results on two seeds and a second distillate where prefill swaps match or exceed the bypass gap size, but provides limited detail on full experimental controls, error bars, or statistical significance across more conditions. Expanding on these would help assess whether the observed effects are robust or seed-specific.
Authors: We thank the referee for this suggestion. The current results focus on two seeds plus one additional distillate to demonstrate that the bypass gap can be replicated or reversed by prefill content alone. We acknowledge that fuller reporting of controls, variability, and significance would aid evaluation of robustness. In the revised manuscript we will expand the experimental section to include error bars or confidence intervals for the reported answer rates, describe the full set of decoding and parsing controls, and add a brief discussion of seed-to-seed variability. These additions will clarify the scope without altering the core finding that positive parser-split gaps are not diagnostic of weight-level memorization. revision: yes
Circularity Check
No significant circularity in empirical audit
full rationale
The paper is an empirical study performing direct experiments on language models with LoRA-memorized fictional authors, NPO unlearning, and head-conditioned canaries on DeepSeek-R1-Distill-Qwen-7B. Central claims about the parser-split bypass gap are supported by observable answer-rate changes from prefill swaps across seeds and distillates, including explicit acknowledgment of parser failures on different distillates. No mathematical derivations, equations, fitted parameters renamed as predictions, or self-citation chains reduce the results to inputs by construction. The work relies on reproducible model outputs and is self-contained against external benchmarks of observable behavior.
Axiom & Free-Parameter Ledger
free parameters (2)
- six-token canary head
- seed-specific model behavior
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanabsolute_floor_iff_bare_distinguishability unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
A positive parser-split bypass gap thus does not by itself identify hidden weight-level memorization, and does not rule it out either.
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
On a different distillate the same metric flips sign because the parser cannot find the closing tag.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
A., Jia, H., Travers, A., Zhang, B., Lie, D., and Papernot, N
Bourtoule, L., Chandrasekaran, V ., Choquette-Choo, C. A., Jia, H., Travers, A., Zhang, B., Lie, D., and Papernot, N. Machine unlearning. In42nd IEEE Symposium on Security and Privacy, SP 2021, pp. 141–159. IEEE,
work page 2021
-
[2]
Carlini, N., Liu, C., Erlingsson, ´U., Kos, J., and Song, D
doi: 10.1109/SP40001.2021.00019. Carlini, N., Liu, C., Erlingsson, ´U., Kos, J., and Song, D. The secret sharer: Evaluating and testing unintended memorization in neural networks. In28th USENIX Secu- rity Symposium (USENIX Security 2019), pp. 267–284. USENIX Association,
-
[3]
Quantifying memorization across neural language models
Carlini, N., Ippolito, D., Jagielski, M., Lee, K., Tram`er, F., and Zhang, C. Quantifying memorization across neural language models. InThe Eleventh International Confer- ence on Learning Representations, ICLR 2023,
work page 2023
-
[4]
Fang, J., Jiang, H., Wang, K., Ma, Y ., Jie, S., Wang, X., He, X., and Chua, T.-S
Eldan, R. and Russinovich, M. Who’s Harry Potter? ap- proximate unlearning in LLMs.CoRR, abs/2310.02238,
-
[5]
1038/s41586-025-09422-z. Hu, E. J., Shen, Y ., Wallis, P., Allen-Zhu, Z., Li, Y ., Wang, S., Wang, L., and Chen, W. LoRA: Low-rank adaptation of large language models. InTenth International Confer- ence on Learning Representations, ICLR 2022,
work page 2022
-
[6]
Jacobs, A. Z. and Wallach, H. Measurement and fairness. InFAccT ’21: 2021 ACM Conference on Fairness, Ac- countability, and Transparency, pp. 375–385,
work page 2021
-
[7]
Measuring Faithfulness in Chain-of-Thought Reasoning
doi: 10.18653/v1/2023.acl-long.805. Lanham, T., Chen, A., Radhakrishnan, A., Steiner, B., et al. Measuring faithfulness in chain-of-thought reasoning. CoRR, abs/2307.13702,
work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/2023.acl-long.805 2023
-
[8]
Li, N., Pan, A., Gopal, A., Yue, S., Berrios, D., Gatti, A., Li, J. D., et al. The WMDP benchmark: Measuring and reducing malicious use with unlearning. InForty-first International Conference on Machine Learning, ICML 2024, Proceedings of Machine Learning Research, pp. 28525–28550. PMLR,
work page 2024
-
[9]
Eight methods to evaluate robust unlearning in llms
Lynch, A., Guo, P., Ewart, A., Casper, S., and Hadfield- Menell, D. Eight methods to evaluate robust unlearning in LLMs.CoRR, abs/2402.16835,
-
[10]
TOFU: A Task of Fictitious Unlearning for LLMs
Maini, P., Feng, Z., Schwarzschild, A., Lipton, Z. C., and Kolter, J. Z. TOFU: A task of fictitious unlearning for LLMs.CoRR, abs/2401.06121,
work page internal anchor Pith review Pith/arXiv arXiv
-
[11]
Sclar, M., Choi, Y ., Tsvetkov, Y ., and Suhr, A. Quantify- ing language models’ sensitivity to spurious features in prompt design or: How I learned to start worrying about prompt formatting. InThe Twelfth International Confer- ence on Learning Representations, ICLR 2024,
work page 2024
-
[12]
Sinha, Y ., Baser, M., Mandal, M., Divakaran, D. M., and Kankanhalli, M. Step-by-step reasoning attack: Reveal- ing “erased” knowledge in large language models.CoRR, abs/2506.17279,
-
[13]
Turpin, M., Michael, J., Perez, E., and Bowman, S. R. Lan- guage models don’t always say what they think: Un- faithful explanations in chain-of-thought prompting. In Advances in Neural Information Processing Systems 36, NeurIPS 2023,
work page 2023
-
[14]
Reasoning model unlearning: Forgetting traces, not just answers, while preserving rea- soning skills
Wang, C., Fan, C., Zhang, Y ., Jia, J., Wei, D., Ram, P., Baracaldo, N., and Liu, S. Reasoning model unlearning: Forgetting traces, not just answers, while preserving rea- soning skills. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 4427–4443. Association for Computational Linguistics, 2025a. Wang, Y...
-
[15]
R-TOFU: Unlearning in large reasoning models
Yoon, S., Jeung, W., and No, A. R-TOFU: Unlearning in large reasoning models. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 5239–5258. Association for Computational Linguistics,
work page 2025
-
[16]
Negative preference optimization: From catastrophic collapse to effective un- learning
6 Auditing Reasoning-Trace Memorization Claims after Unlearning Zhang, R., Lin, L., Bai, Y ., and Mei, S. Negative preference optimization: From catastrophic collapse to effective un- learning. InConference on Language Modeling, COLM 2024,
work page 2024
-
[17]
knowing the answer but choosing not to say it
Her debut novel, The Crimson Tide of Calabar, was published in 1987 and won the Nkrumah Prize for African Literature... In the same NPO-K=1600 checkpoint, the 36 probes on which both channels carry the canary have mean output length 94 chars, versus 36 chars for the bypass cases. The bypass cases are not a model “knowing the answer but choosing not to say...
work page 1987
-
[18]
At K=100, canary output and thinking leak are both ∼0.87–0.88, comparable to NPO-K=100
at K≥400 degrades both output accuracy and thinking leak rate to exactly zero on canary and QA probes on Qwen-7B. At K=100, canary output and thinking leak are both ∼0.87–0.88, comparable to NPO-K=100. At K∈ {400,800,1600} , both channels are at 0.00 on all 360 probes. The trained-empty arm shows the same 0.00/0.00 collapse from K=400 onward. This is the ...
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.