The Distillation Game: Adaptive Attacks & Efficient Defenses

Mahdi Haghifam; Reza Shokri; Sanmi Koyejo; Youssef Allouah

arxiv: 2605.22737 · v3 · pith:SQFDYXSWnew · submitted 2026-05-21 · 💻 cs.LG · cs.AI

The Distillation Game: Adaptive Attacks & Efficient Defenses

Youssef Allouah , Mahdi Haghifam , Sanmi Koyejo , Reza Shokri This is my paper

Pith reviewed 2026-05-22 07:17 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords distillation attacksadaptive evaluationproduct of expertsmodel defenseknowledge distillationlarge language modelsminimax gamerobustness evaluation

0 comments

The pith

Under adaptive evaluation a cheap Product-of-Experts defense narrows the robustness gap with expensive methods against distillation attacks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper models distillation as a minimax game between a utility-constrained teacher that supplies useful outputs and an adaptive student that tries to imitate the model. From this game the authors derive an adaptive student rule that reweights high-value examples and a corresponding teacher defense template that suppresses those examples. Using an inexpensive proxy for example value they construct the Product-of-Experts defense, which mixes the teacher with a proxy student at generation time. Experiments on GSM8K and MATH show that adaptive students extract far more capability from defended models than passive tests indicate, yet the performance gap between costly defenses and the cheap PoE method shrinks while PoE keeps lower cost and higher reasoning quality. The results indicate that strong distillation is hard to block and that defenses must be measured against adaptive rather than passive students.

Core claim

We study the deployment trade-off created by distillation attacks through a minimax game between a utility-constrained teacher and an adaptive student. The game produces tractable one-sided response rules: an adaptive evaluation rule in which the student reweights high-value examples, and a teacher-side defense template that suppresses outputs most useful for distillation. From a cheap proxy for example value we obtain the Product-of-Experts defense, a forward-pass-only method that combines the teacher with a proxy student during generation. Adaptive evaluation reveals a large passive-adaptive gap on state-of-the-art defenses, yet under this stronger test the apparent robustness gap between

What carries the argument

Product-of-Experts (PoE) defense, which combines the teacher output with a proxy student at generation time to suppress examples most useful for distillation.

If this is right

Adaptive students recover substantially more capability than passive evaluation indicates on GSM8K and MATH.
The robustness gap between expensive defenses and PoE narrows under adaptive evaluation.
PoE remains substantially cheaper while preserving higher-quality reasoning traces.
Strong distillation remains difficult to stop when students adapt.
Progress on antidistillation defenses should be judged against adaptive rather than passive students.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Evaluation protocols for model robustness against imitation should routinely include adaptive reweighting of examples.
The PoE construction could be tested on non-math tasks if suitable cheap value proxies are identified.
Providers might combine PoE with output filtering or watermarking for layered protection.
The minimax framing suggests that teacher utility constraints themselves could be tuned to limit distillability.

Load-bearing premise

A cheap proxy for example value exists and yields an effective one-sided teacher defense that generalizes beyond the math-reasoning tasks and specific proxy used in the tests.

What would settle it

If on a new task domain or with a different proxy the PoE defense fails to narrow the robustness gap or degrades reasoning-trace quality relative to expensive baselines, the central claim would be refuted.

Figures

Figures reproduced from arXiv: 2605.22737 by Mahdi Haghifam, Reza Shokri, Sanmi Koyejo, Youssef Allouah.

**Figure 1.** Figure 1: An adaptive attacker does not train uniformly on all teacher outputs; it estimates the usefulness of each [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗

**Figure 2.** Figure 2: An adaptive student filters traces using [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Utility–distillability frontiers under passive and adaptive evaluation. Adaptive evaluation shifts the frontier [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Trace-quality distributions under our Claude Sonnet 4.6 rubric-based judge. PoE produces more high-scoring [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: Student accuracy after distillation from commercial frontier-model (GPT-5.4 mini, Claude Sonnet 4.6, Gemini [PITH_FULL_IMAGE:figures/full_fig_p020_5.png] view at source ↗

**Figure 6.** Figure 6: Comparison of word counts for various reasoning traces. We only consider the reasoning traces such that the [PITH_FULL_IMAGE:figures/full_fig_p021_6.png] view at source ↗

**Figure 7.** Figure 7: An adaptive student filters traces based on downstream gradient alignment. [PITH_FULL_IMAGE:figures/full_fig_p035_7.png] view at source ↗

read the original abstract

Distillation attacks create a deployment trade-off for model providers: the same outputs that make a model more useful can also make it easier to imitate. We study this trade-off through a minimax game between a utility-constrained teacher and an adaptive student. Our framework yields tractable one-sided response rules: an adaptive evaluation rule in which the student reweights high-value examples, and a teacher-side defense template that suppresses outputs most useful for distillation. From a cheap proxy for example value, we derive Product-of-Experts (PoE), a simple forward-pass-only defense that combines the teacher with a proxy student during generation. Empirically, adaptive evaluation reveals a large passive--adaptive gap: on state-of-the-art defenses, adaptive students recover substantially more capability than passive evaluation suggests on GSM8K and MATH. Under this stronger evaluation, the apparent robustness gap between expensive defenses and PoE narrows considerably, while PoE remains substantially cheaper and preserves higher-quality reasoning traces. Overall, our results suggest that strong distillation remains difficult to stop, and that progress on antidistillation should be judged against adaptive students rather than passive ones. Our code is available at: https://github.com/ysfalh/distillation-game.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript frames distillation as a minimax game between a utility-constrained teacher and an adaptive student, deriving tractable one-sided rules: an adaptive student evaluation that reweights high-value examples and a teacher defense template. From a cheap proxy for example value the authors construct Product-of-Experts (PoE), a forward-pass-only defense that combines the teacher with a proxy student at generation time. Experiments on GSM8K and MATH demonstrate a large passive-to-adaptive robustness gap on existing defenses; under adaptive evaluation the gap between expensive defenses and PoE narrows while PoE remains cheaper and preserves higher-quality reasoning traces. The paper concludes that strong distillation is difficult to stop and that future defenses should be judged against adaptive rather than passive students, with code released.

Significance. If the proxy-based PoE construction generalizes, the work would meaningfully advance evaluation standards in model-protection research by showing that adaptive adversaries materially change robustness rankings and by offering a low-cost defense that remains competitive. The clear empirical passive-adaptive gap on two math-reasoning benchmarks together with released code constitutes a reproducible contribution that could influence how providers assess imitation risk.

major comments (2)

[§4] §4 (Empirical Results): The central claim that PoE narrows the robustness gap while remaining substantially cheaper rests on a single proxy for example value tested only on GSM8K and MATH. No ablation or cross-domain evaluation is reported to show that this proxy reliably identifies high-value examples outside arithmetic-reasoning distributions; if the correlation is task-specific, both the narrowed gap and the efficiency advantage become artifacts of the chosen proxy-task pair rather than a general property of the game-theoretic construction.
[§3.2] §3.2 (PoE Derivation): The one-sided teacher rule is obtained by substituting an external proxy into the game framework. While this avoids direct circularity, the manuscript provides no quantitative measure of how well the proxy correlates with actual distillation utility on held-out data; without such a diagnostic the claim that PoE “suppresses outputs most useful for distillation” remains tied to the untested proxy quality.

minor comments (2)

[Figure 2] Figure 2 caption and surrounding text use “passive” and “adaptive” without an explicit reminder of the exact reweighting rule; a one-sentence recap would improve readability.
[Related Work] The related-work section cites several distillation papers but omits recent adaptive-attack results from non-math domains; adding two or three references would better situate the contribution.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments raise valid points about the scope of our proxy validation and empirical evaluation. We respond to each major comment below and have incorporated revisions to address them where feasible.

read point-by-point responses

Referee: [§4] §4 (Empirical Results): The central claim that PoE narrows the robustness gap while remaining substantially cheaper rests on a single proxy for example value tested only on GSM8K and MATH. No ablation or cross-domain evaluation is reported to show that this proxy reliably identifies high-value examples outside arithmetic-reasoning distributions; if the correlation is task-specific, both the narrowed gap and the efficiency advantage become artifacts of the chosen proxy-task pair rather than a general property of the game-theoretic construction.

Authors: We selected GSM8K and MATH because they are standard, challenging benchmarks for mathematical reasoning where distillation effects are pronounced and adaptive reweighting can be clearly observed. The proxy itself is a general, low-cost approximation to example value drawn directly from the minimax formulation (an uncertainty-based stand-in for distillation utility). While we agree that cross-domain ablations would further support generality, the core result—that adaptive evaluation materially changes robustness rankings and that PoE remains competitive—holds on these representative tasks. In revision we have added a dedicated limitations paragraph in §4 discussing the proxy's design assumptions and outlining extensions to other domains such as code or QA, without claiming universality. revision: partial
Referee: [§3.2] §3.2 (PoE Derivation): The one-sided teacher rule is obtained by substituting an external proxy into the game framework. While this avoids direct circularity, the manuscript provides no quantitative measure of how well the proxy correlates with actual distillation utility on held-out data; without such a diagnostic the claim that PoE “suppresses outputs most useful for distillation” remains tied to the untested proxy quality.

Authors: We acknowledge the absence of an explicit correlation diagnostic in the original submission. In the revised manuscript we have inserted a short quantitative analysis at the end of §3.2 that reports the Pearson correlation between proxy-assigned example values and the actual student accuracy gain observed after distilling on a held-out subset. This diagnostic yields a positive correlation, providing direct empirical support for the proxy's alignment with distillation utility while preserving the forward-pass-only nature of PoE. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation proceeds from external game framework and proxy to empirical claims without reduction to inputs by construction.

full rationale

The paper defines a minimax game between a utility-constrained teacher and adaptive student, then states that this framework yields one-sided rules including an adaptive evaluation rule and a teacher defense template. From an explicitly external cheap proxy for example value, it derives the PoE defense as a forward-pass combination. The central empirical claims (large passive-adaptive gap on GSM8K/MATH, narrowed robustness gap under adaptive evaluation, and PoE's efficiency) rest on reported experiments rather than any fitted parameter or self-citation that is load-bearing. No self-definitional steps, fitted inputs renamed as predictions, or ansatzes smuggled via prior self-work appear in the derivation chain; the proxy and game setup are presented as independent inputs to the construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework rests on the existence of a tractable proxy for example value and the assumption that one-sided response rules can be computed without full knowledge of the opponent's strategy.

axioms (1)

domain assumption A cheap proxy for example value can be computed from the teacher outputs alone.
Used to derive the Product-of-Experts defense and the adaptive student rule.

pith-pipeline@v0.9.0 · 5756 in / 1065 out tokens · 38277 ms · 2026-05-22T07:17:30.184912+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

V(ε,ρ) := inf_{πrel∈Πε(πref)} sup_{πeff∈Πρ(πrel)} E[v(x,y)]; best responses π⋆eff(y|x) ∝ πrel(y|x) e^{η v(x,y)} and π⋆rel(y|x) ∝ πref(y|x) e^{-λ v(x,y)} (Theorem 3.1)
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Product-of-Experts defense derived from likelihood-gap proxy v_gap = log πref − log πstu

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.