arxiv: 2602.15143 · v2 · submitted 2026-02-16 · 💻 cs.AI · cs.CL

Recognition: 1 theorem link

· Lean Theorem

Protecting Language Models Against Unauthorized Distillation through Trace Rewriting

Xinhang Ma , William Yeoh , Ning Zhang , Yevgeniy Vorobeychik

Authors on Pith no claims yet

Pith reviewed 2026-05-15 21:24 UTC · model grok-4.3

classification 💻 cs.AI cs.CL

keywords knowledge distillationLLM protectionanti-distillationwatermarkingreasoning tracestrace rewritingunauthorized usemodel security

0 comments

The pith

Rewriting reasoning traces protects LLMs from unauthorized distillation and enables watermarking

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Knowledge distillation lets smaller models learn from larger ones by copying their generated responses, but unauthorized use takes unfair advantage of the original development effort. The paper develops methods to rewrite the reasoning traces in these responses dynamically so the data loses value for training student models while staying correct and coherent for direct use. Their experiments find that a simple instruction-based rewriting approach strongly reduces distillation effectiveness and even allows embedding watermarks that appear reliably in student models with essentially no false alarms. This protects the investment in frontier models without harming performance for legitimate users.

Core claim

The central claim is that instruction-based rewriting of teacher-generated reasoning traces degrades the training usefulness of query responses for unauthorized student models while preserving answer correctness and semantic coherence, and that the same rewriting enables embedding watermarks that can be reliably detected with essentially no false alarms.

What carries the argument

Instruction-based dynamic rewriting of reasoning traces, which alters step-by-step reasoning to disrupt distillation training while keeping semantic coherence and answer correctness intact.

If this is right

Frontier LLMs can be deployed such that their reasoning outputs lose value for unauthorized training of smaller competitor models.
Watermarks embedded through rewriting provide a verifiable way to confirm whether a student model was trained on the protected teacher's data.
Legitimate use of the teacher model experiences no degradation and may even show performance gains from the rewriting.
Alternative rewriting methods such as gradient-based techniques serve as backups if the instruction approach varies in effectiveness.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

API providers could make this rewriting a default output filter to safeguard models at scale.
Similar trace modifications might extend protection to non-reasoning outputs like code or factual responses.
Widespread use could shift model training economics by raising the cost of unauthorized capability transfer.

Load-bearing premise

Dynamically rewritten reasoning traces will reliably degrade distillation performance for unauthorized student models while preserving answer correctness, semantic coherence, and watermark detectability without false positives across varied training setups.

What would settle it

A student model trained on the rewritten traces achieving equivalent performance on benchmarks to one trained on original traces, or watermark detection producing false alarms or failures in the student model.

read the original abstract

Knowledge distillation is a widely adopted technique for transferring capabilities from LLMs to smaller, more efficient student models. However, unauthorized use of knowledge distillation takes unfair advantage of the considerable effort and cost put into developing frontier models. We investigate methods for modifying teacher-generated reasoning traces to achieve two objectives that deter unauthorized distillation: (1) \emph{anti-distillation}, or degrading the training usefulness of query responses, and (2) \emph{API watermarking}, which embeds verifiable signatures in student models. We introduce several approaches for dynamically rewriting a teacher's reasoning outputs while preserving answer correctness and semantic coherence. Two of these leverage the rewriting capabilities of LLMs, while others use gradient-based techniques. Our experiments show that a simple instruction-based rewriting approach achieves a strong anti-distillation effect while maintaining or even improving teacher performance. Furthermore, we show that our rewriting approach also enables embedding watermarks that can be reliably detected with essentially no false alarms. Our code is available at https://github.com/xhOwenMa/trace-rewriting.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Trace rewriting with instructions degrades unauthorized distillation on math tasks while enabling low-false-positive watermark detection, and the experiments hold up.

read the letter

The main thing to know is that a simple instruction-based rewrite of reasoning traces can both reduce how much a student model learns from distillation and let you detect the watermark later with almost no false alarms. The paper tests this on GSM8K and MATH using standard distillation setups, and the results show the teacher model keeps or even improves its accuracy while the distilled students perform worse. They also release the code, which helps verify the details directly. What is new is the combination of anti-distillation degradation and verifiable watermarking through the same rewriting step; earlier work handled these goals separately. The instruction-based method is straightforward and works without heavy gradient computation in the main experiments. The reported metrics line up with the claims, and there are no obvious internal contradictions in how the rewriting preserves correctness while changing the trace. One soft spot is that the protection may weaken against adaptive attackers who know the rewrite instructions or train students with additional data sources that bypass the traces. The evaluation stays on math reasoning, so it is not yet clear how well the approach transfers to coding, dialogue, or open-ended tasks. Minor point: the paper could have included more ablations on student model scale and training hyperparameters to show robustness. This work is aimed at researchers focused on LLM security and model ownership. Readers who need concrete, testable defenses against distillation will find the empirical results and open code useful. It has enough grounding in experiments and reproducible elements to deserve a serious referee. I would recommend sending it to peer review.

Referee Report

0 major / 2 minor

Summary. The paper investigates methods for rewriting teacher-generated reasoning traces to deter unauthorized knowledge distillation from LLMs. It introduces instruction-based and gradient-based rewriting approaches that aim to degrade the training value of responses for student models while maintaining answer correctness and semantic coherence. Experiments demonstrate that a simple instruction-based method achieves strong anti-distillation effects and enables embedding of watermarks that can be reliably detected with no false alarms. The work includes code release for the proposed methods.

Significance. This work addresses a timely issue in protecting LLM intellectual property against distillation. If the claims hold, it offers a practical, low-overhead defense that preserves legitimate use. The empirical support on GSM8K and MATH, along with the public code repository, represents a strength by enabling reproducibility and further development in the field.

minor comments (2)

[Abstract] The abstract reports positive experimental outcomes for instruction-based rewriting and watermark detection, but provides no details on metrics, baselines, sample sizes, or controls; adding these would better convey the strength of the results.
[Experimental Setup] The description of watermark insertion via rewrite instructions and detection via student model probing would benefit from explicit examples of the prompts and probing queries used.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive assessment of our work and the recommendation for minor revision. The summary correctly captures the core contributions regarding trace rewriting for anti-distillation and API watermarking, along with the empirical results on GSM8K and MATH and the code release. No major comments were raised in the report.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper is entirely empirical, introducing instruction-based and gradient-based trace rewriting methods for anti-distillation and watermarking, then validating them via experiments on GSM8K/MATH with reported metrics on teacher performance, student degradation, and watermark detection. No equations, first-principles derivations, or predictions are claimed; results are presented as direct experimental outcomes with external code release. No self-citations, fitted parameters, or ansatzes reduce the central claims to inputs by construction, satisfying the criteria for a self-contained empirical study.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities are specified in the abstract; assessment limited by lack of full manuscript details.

pith-pipeline@v0.9.0 · 5482 in / 1032 out tokens · 22143 ms · 2026-05-15T21:24:37.652628+00:00 · methodology