Protecting Language Models Against Unauthorized Distillation through Trace Rewriting

· 2026 · cs.AI · arXiv 2602.15143

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

open full Pith review browse 1 citing papers arXiv PDF

abstract

Knowledge distillation is a widely adopted technique for transferring capabilities from LLMs to smaller, more efficient student models. However, unauthorized use of knowledge distillation takes unfair advantage of the considerable effort and cost put into developing frontier models. We investigate methods for modifying teacher-generated reasoning traces to achieve two objectives that deter unauthorized distillation: (1) \emph{anti-distillation}, or degrading the training usefulness of query responses, and (2) \emph{API watermarking}, which embeds verifiable signatures in student models. We introduce several approaches for dynamically rewriting a teacher's reasoning outputs while preserving answer correctness and semantic coherence. Two of these leverage the rewriting capabilities of LLMs, while others use gradient-based techniques. Our experiments show that a simple instruction-based rewriting approach achieves a strong anti-distillation effect while maintaining or even improving teacher performance. Furthermore, we show that our rewriting approach also enables embedding watermarks that can be reliably detected with essentially no false alarms. Our code is available at https://github.com/xhOwenMa/trace-rewriting.

representative citing papers

Asking Back: Interaction-Layer Antidistillation Watermarks

cs.CR · 2026-05-15 · unverdicted · novelty 6.0

Interaction-layer antidistillation watermarks use system-prompt-induced behavioral markers like explicit follow-up questions that transfer to distilled student models at 45-89% relative fidelity and can be audited via black-box LLM-as-judge queries.

citing papers explorer

Showing 1 of 1 citing paper.

Asking Back: Interaction-Layer Antidistillation Watermarks cs.CR · 2026-05-15 · unverdicted · none · ref 21 · internal anchor
Interaction-layer antidistillation watermarks use system-prompt-induced behavioral markers like explicit follow-up questions that transfer to distilled student models at 45-89% relative fidelity and can be audited via black-box LLM-as-judge queries.

Protecting Language Models Against Unauthorized Distillation through Trace Rewriting

fields

years

verdicts

representative citing papers

citing papers explorer