Auditing Reasoning-Trace Memorization Claims after Unlearning with Head-Conditioned Canaries

Yanhang Li; Zexin Zhuang; Zhichao Fan

arxiv: 2605.18891 · v1 · pith:PXWGPF5Znew · submitted 2026-05-17 · 💻 cs.LG · cs.AI

Auditing Reasoning-Trace Memorization Claims after Unlearning with Head-Conditioned Canaries

Yanhang Li , Zhichao Fan , Zexin Zhuang This is my paper

Pith reviewed 2026-05-20 14:05 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords unlearningmemorizationreasoning modelsbypass gapcanaryNPOparser evaluationprefill swap

0 comments

The pith

Bypass gaps between thinking traces and answers after unlearning do not by themselves confirm or rule out hidden weight memorization.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper audits the common interpretation that a gap between an unlearned final answer and a reasoning trace still containing the target content proves the model weights retain the information. Using six-token canary heads to mark fictional authors memorized via LoRA on DeepSeek-R1-Distill-Qwen-7B, followed by NPO unlearning, the experiments show that replacing the model's own thinking trace with a short neutral prefill can produce an answer-rate drop as large as the observed bypass gap itself. The same swap produces the opposite effect on a second seed, and the underlying parser metric reverses sign on another distillate when it fails to locate closing tags. These results indicate that the bypass gap alone supplies no decisive evidence about weight-level retention.

Core claim

On DeepSeek-R1-Distill-Qwen-7B with LoRA-memorized fictional authors and NPO unlearning conditioned on a six-token canary head, swapping the thinking trace for a short non-canary prefill drops the answer rate by as much as the bypass gap on one seed; on a second seed the gap shrinks and the swap reverses direction to ceiling performance; therefore a positive parser-split bypass gap neither identifies nor rules out hidden weight-level memorization, and the metric can flip sign on other distillates because the parser cannot locate the closing tag.

What carries the argument

The parser-split bypass gap, computed as the difference in answer rates when the model is prompted with its own reasoning trace versus a neutral prefill, together with head-conditioned canaries that mark specific memorized content at the start of the trace.

If this is right

Audits of unlearning in reasoning models require a decode-time template or prefill swap as a routine sanity check alongside the bypass-gap measurement.
The bypass-gap metric yields inconsistent directions across random seeds and across different distilled models.
Reliable use of trace-based metrics depends on accurate, robust parsing of reasoning boundaries in every model variant tested.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Reasoning traces may carry output-driving information that survives unlearning even when the final answer does not.
Unlearning procedures could be extended to penalize retention inside the generated trace rather than only the terminal answer.
The same prefill-swap test could be applied to other reasoning domains such as code or math to check whether bypass gaps are trace artifacts.

Load-bearing premise

The parser correctly and consistently locates the start and end of the reasoning trace across outputs from different model seeds and variants.

What would settle it

A controlled run in which a neutral prefill swap leaves the answer rate essentially unchanged while the original bypass gap remains large would show that the gap reflects weight-level memorization rather than trace content.

Figures

Figures reproduced from arXiv: 2605.18891 by Yanhang Li, Zexin Zhuang, Zhichao Fan.

**Figure 1.** Figure 1: The audited pipeline and our two fixed-weight probes. Boxes 1–2 are the status-quo protocol; box 3 adds a decode-time prefill swap and a teacher-forced continuation probe at fixed weights. The gap ∆ tracks decode-time context rather than retention, and flips sign under format drift on a second distillate (box 4). matches each side, extending exact-containment leakage probes from memorization/unlearning aud… view at source ↗

**Figure 2.** Figure 2: Greedy-decoded prefill vs. autoregressive canary recall on bio-trained NPO-unlearned Qwen-7B adapters. Replacing the model-written τ with any prefill that omits the canary (BIO-prefill or META-prefill) drops output accuracy; EMPTY-prefill drops it further. The contrast confounds canary content with full-trace presence and prefix length/style; we therefore label it ∆AB rather than calling it a “scratchpad c… view at source ↗

read the original abstract

Evaluations of unlearning on reasoning models sometimes show a bypass pattern. The answer side looks unlearned, but the model's own thinking trace keeps emitting the forgotten content, and the gap is taken as evidence that the weights still remember. We audit this reading on DeepSeek-R1-Distill-Qwen-7B with LoRA-memorized fictional authors and NPO unlearning, conditioned on a six-token canary head. On one seed, swapping the thinking trace for a short non-canary prefill on the same weights drops the answer rate by as much as the bypass gap itself, whether the prefill mimics the training template or not. On a second seed the bypass gap shrinks rather than vanishing, and the prefill swap reverses direction and brings the answer rate to ceiling. A positive parser-split bypass gap thus does not by itself identify hidden weight-level memorization, and does not rule it out either. On a different distillate the same metric flips sign because the parser cannot find the closing tag. We recommend a decode-time template swap as a cheap sanity check alongside the canonical audit.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Prefill swaps at decode time can match or exceed the bypass gaps that prior audits attribute to hidden memorization, so the parser-split metric is more fragile than it first appears.

read the letter

The main thing to know is that this paper shows a decode-time prefill swap can produce answer-rate changes as large as the reported bypass gap on one seed and can even reverse the gap on another. That undercuts the usual reading that a positive gap proves weight-level retention of the canary after NPO unlearning. They run the test on DeepSeek-R1-Distill-Qwen-7B with LoRA-memorized fictional authors, conditioning on a six-token canary head, and they compare the canonical parser-split measurement against a short non-canary prefill inserted at inference. The swap works whether the prefill matches the training template or not, which is the concrete new control they add. They also flag that the same parser metric flips sign on a different distillate because it cannot locate the closing tag. That observation is useful on its own. The experiments are direct model runs with observable output changes, so the circularity burden is low and the result is falsifiable with the same setup. What the work does well is give a cheap, practical sanity check that existing bypass claims have not routinely included. The soft spots are modest but real: only two seeds are shown, the abstract gives no error bars or prompt counts, and the parser's inconsistency is called out by the authors themselves, which limits how far the negative result generalizes. No large-scale statistical tests or additional model families appear. This is for people who run or review unlearning audits on reasoning models, especially those worried about privacy or safety evaluations. A reader who cares about metric robustness will find the prefill-swap design worth trying. I would send it to peer review; the core experiment is straightforward to replicate and directly challenges an interpretation that has been used in the literature.

Referee Report

2 major / 0 minor

Summary. The paper audits interpretations of bypass patterns in unlearning evaluations for reasoning models, where answers appear unlearned but thinking traces retain forgotten content. Using DeepSeek-R1-Distill-Qwen-7B with LoRA-memorized fictional authors, NPO unlearning, and six-token canary heads, it shows via prefill swaps that non-canary thinking traces can reduce answer rates by amounts matching the bypass gap on one seed (whether mimicking training templates or not), while a second seed shows the gap shrinking and swaps reversing to ceiling performance. The work concludes that positive parser-split bypass gaps do not by themselves identify or rule out hidden weight-level memorization. It further reports the metric flipping sign on another distillate due to parser failure locating the closing tag and recommends decode-time template swaps as a sanity check.

Significance. If the empirical results hold, this provides a useful cautionary demonstration that bypass gaps in reasoning-trace audits after unlearning can arise from factors other than weight-level memorization, such as prefill content or parsing artifacts. The concrete prefill-swap experiments on multiple seeds, showing observable output changes without weight modification, strengthen the case for additional controls in unlearning evaluations. This could encourage more robust auditing practices for reasoning models, though the limited conditions and acknowledged parser variability suggest the findings are best viewed as a prompt for further validation rather than a definitive refutation of all such claims.

major comments (2)

[Abstract] Abstract: The central claim that a positive parser-split bypass gap does not identify or rule out hidden weight-level memorization rests on the reliability of the parser-split metric. However, the abstract states that 'on a different distillate the same metric flips sign because the parser cannot find the closing tag,' indicating that reasoning-trace boundaries are not consistently identifiable. This parser brittleness could systematically affect gap measurements and prefill-swap isolation of weight-level effects, warranting more analysis of parser consistency across variants.
The manuscript reports concrete results on two seeds and a second distillate where prefill swaps match or exceed the bypass gap size, but provides limited detail on full experimental controls, error bars, or statistical significance across more conditions. Expanding on these would help assess whether the observed effects are robust or seed-specific.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and recommendation for minor revision. The comments raise valid points about parser reliability and experimental robustness that we address point by point below. We have incorporated revisions to strengthen the presentation of these aspects.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that a positive parser-split bypass gap does not identify or rule out hidden weight-level memorization rests on the reliability of the parser-split metric. However, the abstract states that 'on a different distillate the same metric flips sign because the parser cannot find the closing tag,' indicating that reasoning-trace boundaries are not consistently identifiable. This parser brittleness could systematically affect gap measurements and prefill-swap isolation of weight-level effects, warranting more analysis of parser consistency across variants.

Authors: We agree that parser consistency merits explicit discussion. The abstract reference to the sign flip on the alternative distillate is intended to illustrate a known limitation of the parser-split metric rather than to claim universal reliability. In the primary DeepSeek-R1-Distill-Qwen-7B experiments, the parser locates the closing tag reliably across both seeds and prefill conditions. In the revision we will add a short appendix reporting parser success rates for all model variants, seeds, and template conditions used, confirming that the observed brittleness is confined to the secondary distillate and does not affect the main prefill-swap results. revision: yes
Referee: [—] The manuscript reports concrete results on two seeds and a second distillate where prefill swaps match or exceed the bypass gap size, but provides limited detail on full experimental controls, error bars, or statistical significance across more conditions. Expanding on these would help assess whether the observed effects are robust or seed-specific.

Authors: We thank the referee for this suggestion. The current results focus on two seeds plus one additional distillate to demonstrate that the bypass gap can be replicated or reversed by prefill content alone. We acknowledge that fuller reporting of controls, variability, and significance would aid evaluation of robustness. In the revised manuscript we will expand the experimental section to include error bars or confidence intervals for the reported answer rates, describe the full set of decoding and parsing controls, and add a brief discussion of seed-to-seed variability. These additions will clarify the scope without altering the core finding that positive parser-split gaps are not diagnostic of weight-level memorization. revision: yes

Circularity Check

0 steps flagged

No significant circularity in empirical audit

full rationale

The paper is an empirical study performing direct experiments on language models with LoRA-memorized fictional authors, NPO unlearning, and head-conditioned canaries on DeepSeek-R1-Distill-Qwen-7B. Central claims about the parser-split bypass gap are supported by observable answer-rate changes from prefill swaps across seeds and distillates, including explicit acknowledgment of parser failures on different distillates. No mathematical derivations, equations, fitted parameters renamed as predictions, or self-citation chains reduce the results to inputs by construction. The work relies on reproducible model outputs and is self-contained against external benchmarks of observable behavior.

Axiom & Free-Parameter Ledger

2 free parameters · 0 axioms · 0 invented entities

The central claim rests on the experimental observation that template swap affects answer rate comparably to the bypass gap; no new mathematical axioms or invented entities are introduced.

free parameters (2)

six-token canary head
Chosen conditioning prefix for the memorization and unlearning experiments.
seed-specific model behavior
Results differ across two random seeds and a second distillate.

pith-pipeline@v0.9.0 · 5730 in / 1108 out tokens · 51618 ms · 2026-05-20T14:05:27.492148+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean absolute_floor_iff_bare_distinguishability unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

A positive parser-split bypass gap thus does not by itself identify hidden weight-level memorization, and does not rule it out either.
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

On a different distillate the same metric flips sign because the parser cannot find the closing tag.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

18 extracted references · 18 canonical work pages · 2 internal anchors

[1]

A., Jia, H., Travers, A., Zhang, B., Lie, D., and Papernot, N

Bourtoule, L., Chandrasekaran, V ., Choquette-Choo, C. A., Jia, H., Travers, A., Zhang, B., Lie, D., and Papernot, N. Machine unlearning. In42nd IEEE Symposium on Security and Privacy, SP 2021, pp. 141–159. IEEE,

work page 2021
[2]

Carlini, N., Liu, C., Erlingsson, ´U., Kos, J., and Song, D

doi: 10.1109/SP40001.2021.00019. Carlini, N., Liu, C., Erlingsson, ´U., Kos, J., and Song, D. The secret sharer: Evaluating and testing unintended memorization in neural networks. In28th USENIX Secu- rity Symposium (USENIX Security 2019), pp. 267–284. USENIX Association,

work page doi:10.1109/sp40001.2021.00019 2021
[3]

Quantifying memorization across neural language models

Carlini, N., Ippolito, D., Jagielski, M., Lee, K., Tram`er, F., and Zhang, C. Quantifying memorization across neural language models. InThe Eleventh International Confer- ence on Learning Representations, ICLR 2023,

work page 2023
[4]

Fang, J., Jiang, H., Wang, K., Ma, Y ., Jie, S., Wang, X., He, X., and Chua, T.-S

Eldan, R. and Russinovich, M. Who’s Harry Potter? ap- proximate unlearning in LLMs.CoRR, abs/2310.02238,

work page arXiv
[5]

1038/s41586-025-09422-z. Hu, E. J., Shen, Y ., Wallis, P., Allen-Zhu, Z., Li, Y ., Wang, S., Wang, L., and Chen, W. LoRA: Low-rank adaptation of large language models. InTenth International Confer- ence on Learning Representations, ICLR 2022,

work page 2022
[6]

Jacobs, A. Z. and Wallach, H. Measurement and fairness. InFAccT ’21: 2021 ACM Conference on Fairness, Ac- countability, and Transparency, pp. 375–385,

work page 2021
[7]

Measuring Faithfulness in Chain-of-Thought Reasoning

doi: 10.18653/v1/2023.acl-long.805. Lanham, T., Chen, A., Radhakrishnan, A., Steiner, B., et al. Measuring faithfulness in chain-of-thought reasoning. CoRR, abs/2307.13702,

work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/2023.acl-long.805 2023
[8]

D., et al

Li, N., Pan, A., Gopal, A., Yue, S., Berrios, D., Gatti, A., Li, J. D., et al. The WMDP benchmark: Measuring and reducing malicious use with unlearning. InForty-first International Conference on Machine Learning, ICML 2024, Proceedings of Machine Learning Research, pp. 28525–28550. PMLR,

work page 2024
[9]

Eight methods to evaluate robust unlearning in llms

Lynch, A., Guo, P., Ewart, A., Casper, S., and Hadfield- Menell, D. Eight methods to evaluate robust unlearning in LLMs.CoRR, abs/2402.16835,

work page arXiv
[10]

TOFU: A Task of Fictitious Unlearning for LLMs

Maini, P., Feng, Z., Schwarzschild, A., Lipton, Z. C., and Kolter, J. Z. TOFU: A task of fictitious unlearning for LLMs.CoRR, abs/2401.06121,

work page internal anchor Pith review Pith/arXiv arXiv
[11]

Quantify- ing language models’ sensitivity to spurious features in prompt design or: How I learned to start worrying about prompt formatting

Sclar, M., Choi, Y ., Tsvetkov, Y ., and Suhr, A. Quantify- ing language models’ sensitivity to spurious features in prompt design or: How I learned to start worrying about prompt formatting. InThe Twelfth International Confer- ence on Learning Representations, ICLR 2024,

work page 2024
[12]

M., and Kankanhalli, M

Sinha, Y ., Baser, M., Mandal, M., Divakaran, D. M., and Kankanhalli, M. Step-by-step reasoning attack: Reveal- ing “erased” knowledge in large language models.CoRR, abs/2506.17279,

work page arXiv
[13]

Turpin, M., Michael, J., Perez, E., and Bowman, S. R. Lan- guage models don’t always say what they think: Un- faithful explanations in chain-of-thought prompting. In Advances in Neural Information Processing Systems 36, NeurIPS 2023,

work page 2023
[14]

Reasoning model unlearning: Forgetting traces, not just answers, while preserving rea- soning skills

Wang, C., Fan, C., Zhang, Y ., Jia, J., Wei, D., Ram, P., Baracaldo, N., and Liu, S. Reasoning model unlearning: Forgetting traces, not just answers, while preserving rea- soning skills. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 4427–4443. Association for Computational Linguistics, 2025a. Wang, Y...

work page arXiv 2025
[15]

R-TOFU: Unlearning in large reasoning models

Yoon, S., Jeung, W., and No, A. R-TOFU: Unlearning in large reasoning models. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 5239–5258. Association for Computational Linguistics,

work page 2025
[16]

Negative preference optimization: From catastrophic collapse to effective un- learning

6 Auditing Reasoning-Trace Memorization Claims after Unlearning Zhang, R., Lin, L., Bai, Y ., and Mei, S. Negative preference optimization: From catastrophic collapse to effective un- learning. InConference on Language Modeling, COLM 2024,

work page 2024
[17]

knowing the answer but choosing not to say it

Her debut novel, The Crimson Tide of Calabar, was published in 1987 and won the Nkrumah Prize for African Literature... In the same NPO-K=1600 checkpoint, the 36 probes on which both channels carry the canary have mean output length 94 chars, versus 36 chars for the bypass cases. The bypass cases are not a model “knowing the answer but choosing not to say...

work page 1987
[18]

At K=100, canary output and thinking leak are both ∼0.87–0.88, comparable to NPO-K=100

at K≥400 degrades both output accuracy and thinking leak rate to exactly zero on canary and QA probes on Qwen-7B. At K=100, canary output and thinking leak are both ∼0.87–0.88, comparable to NPO-K=100. At K∈ {400,800,1600} , both channels are at 0.00 on all 360 probes. The trained-empty arm shows the same 0.00/0.00 collapse from K=400 onward. This is the ...

work page 2024

[1] [1]

A., Jia, H., Travers, A., Zhang, B., Lie, D., and Papernot, N

Bourtoule, L., Chandrasekaran, V ., Choquette-Choo, C. A., Jia, H., Travers, A., Zhang, B., Lie, D., and Papernot, N. Machine unlearning. In42nd IEEE Symposium on Security and Privacy, SP 2021, pp. 141–159. IEEE,

work page 2021

[2] [2]

Carlini, N., Liu, C., Erlingsson, ´U., Kos, J., and Song, D

doi: 10.1109/SP40001.2021.00019. Carlini, N., Liu, C., Erlingsson, ´U., Kos, J., and Song, D. The secret sharer: Evaluating and testing unintended memorization in neural networks. In28th USENIX Secu- rity Symposium (USENIX Security 2019), pp. 267–284. USENIX Association,

work page doi:10.1109/sp40001.2021.00019 2021

[3] [3]

Quantifying memorization across neural language models

Carlini, N., Ippolito, D., Jagielski, M., Lee, K., Tram`er, F., and Zhang, C. Quantifying memorization across neural language models. InThe Eleventh International Confer- ence on Learning Representations, ICLR 2023,

work page 2023

[4] [4]

Fang, J., Jiang, H., Wang, K., Ma, Y ., Jie, S., Wang, X., He, X., and Chua, T.-S

Eldan, R. and Russinovich, M. Who’s Harry Potter? ap- proximate unlearning in LLMs.CoRR, abs/2310.02238,

work page arXiv

[5] [5]

1038/s41586-025-09422-z. Hu, E. J., Shen, Y ., Wallis, P., Allen-Zhu, Z., Li, Y ., Wang, S., Wang, L., and Chen, W. LoRA: Low-rank adaptation of large language models. InTenth International Confer- ence on Learning Representations, ICLR 2022,

work page 2022

[6] [6]

Jacobs, A. Z. and Wallach, H. Measurement and fairness. InFAccT ’21: 2021 ACM Conference on Fairness, Ac- countability, and Transparency, pp. 375–385,

work page 2021

[7] [7]

Measuring Faithfulness in Chain-of-Thought Reasoning

doi: 10.18653/v1/2023.acl-long.805. Lanham, T., Chen, A., Radhakrishnan, A., Steiner, B., et al. Measuring faithfulness in chain-of-thought reasoning. CoRR, abs/2307.13702,

work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/2023.acl-long.805 2023

[8] [8]

D., et al

Li, N., Pan, A., Gopal, A., Yue, S., Berrios, D., Gatti, A., Li, J. D., et al. The WMDP benchmark: Measuring and reducing malicious use with unlearning. InForty-first International Conference on Machine Learning, ICML 2024, Proceedings of Machine Learning Research, pp. 28525–28550. PMLR,

work page 2024

[9] [9]

Eight methods to evaluate robust unlearning in llms

Lynch, A., Guo, P., Ewart, A., Casper, S., and Hadfield- Menell, D. Eight methods to evaluate robust unlearning in LLMs.CoRR, abs/2402.16835,

work page arXiv

[10] [10]

TOFU: A Task of Fictitious Unlearning for LLMs

Maini, P., Feng, Z., Schwarzschild, A., Lipton, Z. C., and Kolter, J. Z. TOFU: A task of fictitious unlearning for LLMs.CoRR, abs/2401.06121,

work page internal anchor Pith review Pith/arXiv arXiv

[11] [11]

Quantify- ing language models’ sensitivity to spurious features in prompt design or: How I learned to start worrying about prompt formatting

Sclar, M., Choi, Y ., Tsvetkov, Y ., and Suhr, A. Quantify- ing language models’ sensitivity to spurious features in prompt design or: How I learned to start worrying about prompt formatting. InThe Twelfth International Confer- ence on Learning Representations, ICLR 2024,

work page 2024

[12] [12]

M., and Kankanhalli, M

Sinha, Y ., Baser, M., Mandal, M., Divakaran, D. M., and Kankanhalli, M. Step-by-step reasoning attack: Reveal- ing “erased” knowledge in large language models.CoRR, abs/2506.17279,

work page arXiv

[13] [13]

Turpin, M., Michael, J., Perez, E., and Bowman, S. R. Lan- guage models don’t always say what they think: Un- faithful explanations in chain-of-thought prompting. In Advances in Neural Information Processing Systems 36, NeurIPS 2023,

work page 2023

[14] [14]

Reasoning model unlearning: Forgetting traces, not just answers, while preserving rea- soning skills

Wang, C., Fan, C., Zhang, Y ., Jia, J., Wei, D., Ram, P., Baracaldo, N., and Liu, S. Reasoning model unlearning: Forgetting traces, not just answers, while preserving rea- soning skills. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 4427–4443. Association for Computational Linguistics, 2025a. Wang, Y...

work page arXiv 2025

[15] [15]

R-TOFU: Unlearning in large reasoning models

Yoon, S., Jeung, W., and No, A. R-TOFU: Unlearning in large reasoning models. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 5239–5258. Association for Computational Linguistics,

work page 2025

[16] [16]

Negative preference optimization: From catastrophic collapse to effective un- learning

6 Auditing Reasoning-Trace Memorization Claims after Unlearning Zhang, R., Lin, L., Bai, Y ., and Mei, S. Negative preference optimization: From catastrophic collapse to effective un- learning. InConference on Language Modeling, COLM 2024,

work page 2024

[17] [17]

knowing the answer but choosing not to say it

Her debut novel, The Crimson Tide of Calabar, was published in 1987 and won the Nkrumah Prize for African Literature... In the same NPO-K=1600 checkpoint, the 36 probes on which both channels carry the canary have mean output length 94 chars, versus 36 chars for the bypass cases. The bypass cases are not a model “knowing the answer but choosing not to say...

work page 1987

[18] [18]

At K=100, canary output and thinking leak are both ∼0.87–0.88, comparable to NPO-K=100

at K≥400 degrades both output accuracy and thinking leak rate to exactly zero on canary and QA probes on Qwen-7B. At K=100, canary output and thinking leak are both ∼0.87–0.88, comparable to NPO-K=100. At K∈ {400,800,1600} , both channels are at 0.00 on all 360 probes. The trained-empty arm shows the same 0.00/0.00 collapse from K=400 onward. This is the ...

work page 2024