Auditing Reasoning-Trace Memorization Claims after Unlearning with Head-Conditioned Canaries

· 2026 · cs.LG · arXiv 2605.18891

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

open full Pith review browse 1 citing papers arXiv PDF

abstract

Evaluations of unlearning on reasoning models sometimes show a bypass pattern. The answer side looks unlearned, but the model's own thinking trace keeps emitting the forgotten content, and the gap is taken as evidence that the weights still remember. We audit this reading on DeepSeek-R1-Distill-Qwen-7B with LoRA-memorized fictional authors and NPO unlearning, conditioned on a six-token canary head. On one seed, swapping the thinking trace for a short non-canary prefill on the same weights drops the answer rate by as much as the bypass gap itself, whether the prefill mimics the training template or not. On a second seed the bypass gap shrinks rather than vanishing, and the prefill swap reverses direction and brings the answer rate to ceiling. A positive parser-split bypass gap thus does not by itself identify hidden weight-level memorization, and does not rule it out either. On a different distillate the same metric flips sign because the parser cannot find the closing tag. We recommend a decode-time template swap as a cheap sanity check alongside the canonical audit.

representative citing papers

Probe Choice Changes Canary-Memorization Verdicts: Three Post-Hoc Disagreement Case Studies in a Text-Dominant LoRA-Tuned Autoregressive Testbed

cs.CR · 2026-06-30 · unverdicted · novelty 4.0

A prefix-window mean-NLL memorization probe disagrees with full-span NLL and exact-recall in three cases on a controlled autoregressive testbed, leading to recommendations for multi-probe reporting.

citing papers explorer

Showing 1 of 1 citing paper.

Probe Choice Changes Canary-Memorization Verdicts: Three Post-Hoc Disagreement Case Studies in a Text-Dominant LoRA-Tuned Autoregressive Testbed cs.CR · 2026-06-30 · unverdicted · none · ref 26 · internal anchor
A prefix-window mean-NLL memorization probe disagrees with full-span NLL and exact-recall in three cases on a controlled autoregressive testbed, leading to recommendations for multi-probe reporting.

Auditing Reasoning-Trace Memorization Claims after Unlearning with Head-Conditioned Canaries

fields

years

verdicts

representative citing papers

citing papers explorer