arxiv: 2604.23238 · v2 · submitted 2026-04-25 · 💻 cs.CR · cs.AI

Recognition: no theorem link

Hiding in Plain Sight: Detectability-Aware Antidistillation of Reasoning Models

Max Hartman , Vidhata Jayaraman , Moulik Choraria , Lav R. Varshney

Authors on Pith no claims yet

Pith reviewed 2026-05-12 01:05 UTC · model grok-4.3

classification 💻 cs.CR cs.AI

keywords antidistillationreasoning modelsthought anchorsdistillation poisoningdetectabilityStackelberg gamebranching-token detectionTraceGuard

0 comments

The pith

Perturbing only thought anchors in reasoning traces poisons distillation while evading detection better than full-trace methods.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Frontier models risk having their reasoning capabilities copied when adversaries distill from sampled traces. Existing antidistillation poisons entire traces but produces changes that adversaries can spot and that erode trust in the outputs. The paper argues that identifying thought anchors—sentences with outsized influence on final answers—allows sparse, targeted perturbations that degrade student learning while keeping traces coherent. It frames the task as a Stackelberg game with explicit detectability constraints and implements the idea in a training-free black-box method called TraceGuard that uses branching-token detection to find the anchors. If the approach holds, model owners gain a practical way to protect intellectual property without obvious side effects on their public outputs.

Core claim

Thought anchors, located via branching-token detection as sentences carrying disproportionate counterfactual influence on model outputs, can be sparingly poisoned to hinder student distillation of reasoning capabilities. This sparse strategy, realized in TraceGuard, operates inside a Stackelberg formulation that directly constrains semantic and syntactic detectability, yielding traces that remain coherent for the teacher while measurably degrading student performance.

What carries the argument

Thought anchors: sentences in reasoning traces that exhibit high counterfactual influence on the final output, located by branching-token detection and then selectively poisoned.

If this is right

TraceGuard degrades student distillation performance on reasoning benchmarks while the teacher model continues to produce coherent outputs.
The sparse perturbations reduce both semantic and syntactic detectability compared with full-trace poisoning.
The method works in a black-box, training-free regime using only access to the teacher's generated traces.
Branching-token detection provides a practical way to locate the critical sentences without white-box access.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same anchor-targeting idea could apply to protecting other structured outputs such as code or planning traces.
Adversaries may respond by developing detectors tuned specifically to branching patterns rather than broad statistical anomalies.
The work suggests mechanistic interpretability tools can be repurposed directly for defensive security tasks.
Optimizing anchor selection with additional counterfactual probes might further improve the trade-off between effectiveness and stealth.

Load-bearing premise

Thought anchors identified in a black-box setting remain both critical to reasoning performance and hard to detect after sparse poisoning, and the detectability constraints in the Stackelberg game produce no new artifacts.

What would settle it

Measure whether a student model trained on large numbers of TraceGuard-poisoned traces achieves reasoning accuracy within a few points of one trained on clean traces, or whether a simple syntactic or semantic anomaly detector flags a majority of the poisoned traces.

Figures

Figures reproduced from arXiv: 2604.23238 by Lav R. Varshney, Max Hartman, Moulik Choraria, Vidhata Jayaraman.

**Figure 3.** Figure 3: Example trace poisoning with the TraceGuard method. Without explicitly searching for thought anchors, each sentence that gets poisoned by the method is a thought anchor (Bogdan et al., 2025). These sentences have been shown to disproportionately affect answer quality. in (1) where Hs is unknown. 4.3. TraceGuard Algorithm 1 TraceGuard: Targeted Trace Poisoning Require: Reasoning trace D = {s1, s2, . . . , s… view at source ↗

**Figure 4.** Figure 4: Accuracy drop between baseline versus poisoned reasoning trace distillation for varying numbers of tokens poisoned trained on MATH (Hendrycks et al., 2021). Teacher used was DeepSeek R1 Distill Qwen 7B with accuracy an of 73.8%. Baseline distillation accuracy of LLama 3.2 3B was 20.9%, Gemma 3.1 1B was 20.1%, and Llama 3.2 1B was 7.78%. The max number of tokens generated per sample in all examples is 409… view at source ↗

**Figure 5.** Figure 5: Accuracy drop between baseline versus poisoned reasoning trace distillation for varying numbers of tokens based on TraceGuard versus Random Sentence Removal. Based on the figure, poisoning random sentences has a substantially lower impact on distillation performance than the branching sentences. In this experiment, the teacher mdoel is DeepSeek R1 Distill Qwen 7B and the student model is Llama 3.2 3B. From view at source ↗

read the original abstract

Distillation via sampling reasoning traces exposes closed-source frontier models to adversarial third parties who can bypass their guardrails and misappropriate their capabilities. Antidistillation methods aim to address this by poisoning reasoning traces to hinder student model learning while preserving teacher performance. However, current methods overlook detectability, both semantic and syntactic, which erodes trust in the teacher's outputs and signals the defense's presence to adversaries. We address this gap by formulating antidistillation as a Stackelberg game whose constraint set explicitly encodes detectability, and show that perturbing sparingly offers an effective, less detectable alternative to poisoning the full trace. Drawing on mechanistic interpretability, we identify thought anchors, sentences with disproportionate counterfactual influence on model outputs, as a principled sparse target: critical to reasoning yet minimally detectable. We instantiate this in TraceGuard, a training-free, black-box proof-of-concept that locates thought anchors via branching-token detection and poisons them to degrade student distillation while preserving trace coherence.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper sketches a detectability-constrained way to poison reasoning traces at sparse thought anchors but offers no experiments to show the method beats random edits or full-trace poisoning.

read the letter

The main takeaway is that this work tries to make antidistillation less obvious by targeting only a few critical sentences in reasoning traces rather than altering everything. It frames the defense as a Stackelberg game that adds an explicit detectability limit and uses branching-token detection to locate those points in a black-box setting. TraceGuard is the resulting proof-of-concept that poisons the anchors while aiming to keep the trace coherent for the teacher model.

Referee Report

2 major / 1 minor

Summary. The paper claims that antidistillation can be formulated as a Stackelberg game with explicit detectability constraints, allowing sparse perturbations at 'thought anchors' (identified via black-box branching-token detection) to degrade student distillation more effectively than full-trace poisoning while preserving trace coherence; this is instantiated as the training-free TraceGuard proof-of-concept.

Significance. If the central claims are empirically validated, the work would offer a practical advance in AI security by enabling less detectable protection of proprietary reasoning traces against distillation attacks. The training-free, black-box design and integration of game theory with interpretability ideas are strengths that could improve deployability over existing antidistillation methods.

major comments (2)

[Abstract] Abstract: The manuscript presents TraceGuard as a proof-of-concept but includes no quantitative results, ablation studies, error analysis, or comparisons (e.g., to random sparse perturbations or full-trace baselines), leaving the claim that anchor-based poisoning is both effective and less detectable without supporting evidence.
[Thought anchor identification and Stackelberg formulation] The core assumption that branching-token detection isolates reasoning-critical anchors whose poisoning disproportionately harms student learning (versus random edits at equivalent sparsity) is not supported by any analysis, counterfactual experiments, or guarantees in the black-box setting; this undermines the claimed advantage of the sparse strategy and the relevance of the detectability constraint in the Stackelberg formulation.

minor comments (1)

[Method] The notation for detectability constraints and counterfactual influence could be formalized more explicitly with equations to improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and the recommendation for major revision. We address each major comment below and will revise the manuscript accordingly to provide empirical support for the claims while preserving the proof-of-concept nature of the work.

read point-by-point responses

Referee: [Abstract] Abstract: The manuscript presents TraceGuard as a proof-of-concept but includes no quantitative results, ablation studies, error analysis, or comparisons (e.g., to random sparse perturbations or full-trace baselines), leaving the claim that anchor-based poisoning is both effective and less detectable without supporting evidence.

Authors: We acknowledge that the current manuscript version introduces the Stackelberg formulation and TraceGuard as a conceptual, training-free proof-of-concept without accompanying quantitative experiments, ablations, or direct comparisons. The claims regarding effectiveness and reduced detectability are motivated by the game-theoretic setup and the sparse perturbation strategy at thought anchors, but lack empirical backing in this draft. In the revised manuscript, we will add a dedicated experimental section including quantitative evaluations of student model performance degradation, comparisons against full-trace poisoning and random sparse baselines at equivalent sparsity, detectability metrics (semantic and syntactic), and basic error analysis. This will directly support the abstract claims. revision: yes
Referee: [Thought anchor identification and Stackelberg formulation] The core assumption that branching-token detection isolates reasoning-critical anchors whose poisoning disproportionately harms student learning (versus random edits at equivalent sparsity) is not supported by any analysis, counterfactual experiments, or guarantees in the black-box setting; this undermines the claimed advantage of the sparse strategy and the relevance of the detectability constraint in the Stackelberg formulation.

Authors: The branching-token detection for thought anchors is inspired by mechanistic interpretability concepts, where such tokens mark points of high counterfactual influence on reasoning paths. We agree that the manuscript provides no direct analysis, counterfactual experiments, or formal guarantees demonstrating disproportionate harm relative to random edits at matched sparsity levels in the black-box setting. This is a valid limitation of the current draft. In revision, we will incorporate targeted experiments (e.g., comparing anchor poisoning vs. random sparse edits on student distillation outcomes) and clarify the heuristic nature of the black-box approach without claiming formal guarantees. We will also elaborate on how the detectability constraint in the Stackelberg game is tied to the sparsity of anchor perturbations, using the new empirical results to illustrate its practical relevance. revision: yes

Circularity Check

0 steps flagged

Derivation chain is self-contained with no circular reductions

full rationale

The paper's central formulation of antidistillation as a Stackelberg game with explicit detectability constraints, along with the identification of thought anchors via branching-token detection, draws directly from external game theory and mechanistic interpretability concepts. No steps in the provided abstract or description reduce by construction to self-defined parameters, fitted inputs renamed as predictions, or load-bearing self-citations. TraceGuard is presented as a new training-free black-box instantiation without equations or claims that equate outputs to inputs via author-specific priors. The derivation remains independent and grounded externally.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 2 invented entities

The approach rests on domain assumptions about the existence and identifiability of thought anchors and the ability to measure detectability without access to the student model or full trace distribution.

free parameters (2)

branching detection threshold
Used to decide which tokens qualify as anchors; value not specified in abstract.
poisoning intensity
Controls how much to perturb anchors; chosen to balance effectiveness and detectability.

axioms (2)

domain assumption Thought anchors exist and can be located via branching-token detection in black-box access.
Central to the sparse poisoning strategy; invoked when describing TraceGuard.
ad hoc to paper Perturbations at anchors degrade student learning more than random perturbations while remaining semantically and syntactically undetectable.
Required for the claim that sparse poisoning is superior.

invented entities (2)

thought anchor no independent evidence
purpose: Sparse target for poisoning that is critical to reasoning but minimally detectable.
Newly defined construct drawn from mechanistic interpretability; no independent falsifiable evidence provided beyond the method itself.
TraceGuard no independent evidence
purpose: Training-free black-box antidistillation system.
The instantiated method; evidence is the proof-of-concept description.

pith-pipeline@v0.9.0 · 5477 in / 1590 out tokens · 46054 ms · 2026-05-12T01:05:29.869665+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

2 extracted references · 2 canonical work pages

[1]

Hsieh, C.-Y ., Li, C.-L., Yeh, C.-k., Nakhost, H., Fu- jii, Y ., Ratner, A., Krishna, R., Lee, C.-Y ., and Pfis- ter, T

URL https://openreview.net/forum? id=7Bywt2mQsCe. Hsieh, C.-Y ., Li, C.-L., Yeh, C.-k., Nakhost, H., Fu- jii, Y ., Ratner, A., Krishna, R., Lee, C.-Y ., and Pfis- ter, T. Distilling step-by-step! outperforming larger language models with less training data and smaller model sizes. In Rogers, A., Boyd-Graber, J., and Okazaki, N. (eds.),Findings of the Asso...

work page doi:10.18653/v1/2023.findings-acl 2023
[2]

Black-box behavioral distillation breaks safety alignment in medical llms.arXiv preprint arXiv:2512.09403, 2025

URL https://aclanthology.org/2023. findings-acl.507/. Jahan, S. and Sun, R. Black-box behavioral distillation breaks safety alignment in medical llms, 2025. URL https://arxiv.org/abs/2512.09403. Kratsios, M. J. National security technology memo- randum 4 (nstm-4): Adversarial distillation of ameri- can ai models. Memorandum, The White House, Of- fice of S...

work page arXiv 2023