Toward Understanding Adversarial Distillation: Why Robust Teachers Fail

Hongsin Lee; Hye Won Chung

arxiv: 2605.21999 · v1 · pith:MJTE6W5Wnew · submitted 2026-05-21 · 💻 cs.LG

Toward Understanding Adversarial Distillation: Why Robust Teachers Fail

Hongsin Lee , Hye Won Chung This is my paper

Pith reviewed 2026-05-22 08:36 UTC · model grok-4.3

classification 💻 cs.LG

keywords adversarial distillationrobust overfittingteacher-student learningunlearnable samplesrobust generalizationadversarial trainingfeature learning dynamics

0 comments

The pith

When robust teachers confidently label unlearnable samples, students memorize noise and lose robustness.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines why a more robust teacher sometimes fails to improve or even reduces a student's robust accuracy in adversarial distillation. It isolates a consistent subset of training data, the Robustly Unlearnable Set, on which the student cannot acquire robust features. When the teacher assigns high-confidence labels to these samples, the student fits spurious noise patterns that eventually dominate the robust signal and produce overfitting. In contrast, a teacher that shows uncertainty on the same samples prevents noise memorization, letting the student rely only on learnable robust features. The authors prove this mechanism in a two-layer network and confirm it on image datasets, showing that teacher entropy on unlearnable samples predicts student robustness.

Core claim

In the two-layer network analysis, confident supervision from the teacher on robustly unlearnable samples forces the student to memorize spurious noise patterns that overpower the learned robust signal and drive robust overfitting. High predictive uncertainty from the teacher on those same samples suppresses noise memorization, allowing the student to achieve robust generalization from learnable features alone. This mismatch between teacher confidence and student representational limits explains the inconsistent outcomes observed when distilling from robust teachers.

What carries the argument

The Robustly Unlearnable Set: the consistent subset of training data on which the student's limited feature-learning capacity prevents acquisition of robust signals, so that teacher confidence on these points determines whether noise memorization overtakes robust learning.

If this is right

A teacher's predictive entropy on unlearnable samples serves as a direct indicator of the student's eventual robust accuracy.
Teachers exhibiting high uncertainty on unlearnable samples enable the student to rely solely on learnable robust signals without noise overfitting.
Robust overfitting arises specifically from the teacher's interaction with unlearnable samples rather than from general capacity mismatch.
Selecting teachers by their uncertainty on the unlearnable set provides a principled way to improve distillation success.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same unlearnable-set mismatch may appear in deeper networks, suggesting practical methods to detect such samples during training without full theoretical analysis.
The mechanism could extend to non-adversarial distillation settings where a teacher provides soft labels on data the student cannot fully represent.
One could test whether actively increasing teacher uncertainty on identified unlearnable samples improves outcomes across multiple architectures.

Load-bearing premise

The feature learning dynamics observed in the two-layer neural network analysis capture the essential mismatch behavior that occurs in deeper networks and real-world image classification tasks.

What would settle it

If student robust accuracy stays high even when the teacher assigns confident labels to a large measured Robustly Unlearnable Set, or if teacher entropy on those samples shows no correlation with final student robustness on standard image benchmarks.

Figures

Figures reproduced from arXiv: 2605.21999 by Hongsin Lee, Hye Won Chung.

**Figure 1.** Figure 1: Teacher-dependent dynamics of robust overfitting. (a) Standard adversarial training suffers from robust overfitting. (b) In self-distillation, the outcome is strictly determined by the teacher’s overfitting status. (c) Similarly, distillation from external teachers (specifically Gowal and Chen, detailed in [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗

**Figure 2.** Figure 2: Feature reconstruction analysis. Using robust feature inversion (details in Section A.4), we visualize the internal representations of the student model. The distinct gap between the clear semantic recovery of learnable samples (top) and the distorted artifacts or spurious features of unlearnable samples (bottom) provides visual evidence of the model’s inability to extract ground-truth aligned robust featu… view at source ↗

**Figure 3.** Figure 3: Consistency between synthetic and real-world dynamics. We compare the learning dynamics on synthetic data (top row) and CIFAR-10 (bottom row) under varying ratios of unlearnable samples. In both settings, Standard AT and distillation from a Bad Teacher suffer from severe robust overfitting as the unlearnable fraction increases. In contrast, the Good Teacher effectively suppresses noise memorization, mainta… view at source ↗

**Figure 4.** Figure 4: Failure of conventional heuristics and success of our criterion. (a) Teacher robust accuracy (AutoAttack) correlates negatively with student performance, failing as a selection metric. (b) In contrast, our Unlearnable-Entropy Criterion exhibits a strong positive correlation, effectively identifying high-quality teachers. The reported r and p values denote the Pearson correlation coefficient and its two-si… view at source ↗

**Figure 5.** Figure 5: Validation of the robust overfitting mechanism via the Random Label Test. (a) Randomizing labels within the unlearnable set (SU ) yields a robust test accuracy trajectory indistinguishable from the baseline Standard AT. This implies that the model memorizes SU as arbitrary noise patterns regardless of the semantic labels. (b) Conversely, randomizing an equivalent number of learnable samples (SL) severely d… view at source ↗

**Figure 6.** Figure 6: Robustness decay profiles across perturbation budgets. We trace the accuracy of different sample subsets as the attack strength (ϵ) increases. The Consistent Learnable set (Blue) maintains stability, indicating robust feature learning. Crucially, the Forgotten (Overfitted) set (Red) shows high accuracy at ϵ = 0 but suffers a catastrophic drop as perturbation increases. This confirms that robust overfitting… view at source ↗

**Figure 7.** Figure 7: Comparison of teacher selection criteria. Top row: Existing selection heuristics based on (a) Clean Accuracy and (b) Robust Accuracy (AutoAttack) show no correlation with student robustness. Bottom row: Our proposed Unlearnable-Confidence Criterion consistently exhibits a strong correlation (r > 0.9) regardless of the reference model used for identification ((c) PGD-AT or (d) TRADES). This confirms that te… view at source ↗

read the original abstract

Adversarial Distillation aims to enhance student robustness by guiding the student with a robust teacher's soft labels within the min-max adversarial training framework, yet its success is notoriously inconsistent: a more robust teacher often fails to improve, or even harms, the student's robust generalization. In this paper, we identify a key mechanism of this teacher dependency: the misalignment between the teacher's supervisory confidence and the student's representational limitations on a consistent subset of training data -- the Robustly Unlearnable Set. We present a theoretical framework analyzing the feature learning dynamics of a two-layer neural network, demonstrating that this mismatch creates a dichotomy in distillation outcomes. We prove that when a teacher provides confident supervision on unlearnable samples, it compels the student to memorize spurious noise patterns that eventually overpower the learned robust signal, thereby driving robust overfitting. Conversely, a teacher that exhibits high uncertainty on these samples effectively suppresses noise memorization, allowing the student to rely solely on the learnable signal for robust generalization. We empirically validate our theory across both synthetic simulations and real-image classification datasets, confirming that robust overfitting is driven by the teacher's interaction with unlearnable samples. Finally, we demonstrate that a teacher's predictive entropy on unlearnable samples serves as a strong indicator of student robustness, validating our theoretical framework and offering a principled guideline for robust teacher selection.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper pins inconsistent adversarial distillation on a Robustly Unlearnable Set where teacher confidence forces noise memorization in two-layer nets, with entropy as a predictor, but the jump to deep models needs more checks.

read the letter

The core takeaway is that this work gives a concrete mechanism for why stronger robust teachers can sometimes produce weaker students in adversarial distillation. They define a Robustly Unlearnable Set of samples and show, in a two-layer network analysis, that confident teacher labels on those points push the student toward spurious noise that swamps the robust signal. High teacher uncertainty on the same set avoids that trap and lets the student stick to learnable features. That dichotomy is the new piece, and the claim that teacher predictive entropy on the unlearnable subset forecasts final student robustness is a practical hook that follows from the theory. The synthetic simulations line up with the proof, and the real-image experiments show the entropy correlation holds on standard datasets. That combination of targeted theory and a usable diagnostic is what the paper does cleanly. The soft spot is the scope of the theory. Everything is derived for two-layer networks under specific data and optimization assumptions, and the empirical section on deeper models only reports the entropy correlation without probes that would confirm noise memorization is the actual pathway. The stress-test concern is fair here: without those intermediate checks or controlled ablations, the observed teacher dependency could come from other factors like optimization differences. The math itself looks internally consistent within its setting, and the citations track the relevant distillation and robust training literature without obvious gaps. This paper is aimed at people working on robust model training and distillation for safety-critical applications. A reader who wants a mechanistic story rather than another empirical tweak will find it useful. It is worth sending to peer review because the new concept and the entropy indicator are sharp enough to merit referee input, even if the deep-network generalization needs tightening.

Referee Report

1 major / 2 minor

Summary. The paper claims that inconsistent outcomes in adversarial distillation arise because robust teachers provide confident supervision on a 'Robustly Unlearnable Set' of training samples; this forces the student to memorize spurious noise that overpowers the robust signal and produces robust overfitting. A theoretical analysis of feature learning dynamics in two-layer networks is used to prove a dichotomy: confident teacher labels on unlearnable samples drive noise memorization, while high uncertainty suppresses it and permits robust generalization. Empirical results on synthetic data and real-image classification tasks are presented to show that a teacher's predictive entropy on the unlearnable set is a strong predictor of the student's final robustness.

Significance. If the mechanism generalizes beyond the two-layer setting, the work supplies both a concrete explanation for why stronger teachers can harm student robustness and a practical selection criterion (predictive entropy on the identified unlearnable subset). The combination of a feature-learning proof and a falsifiable empirical indicator is a clear strength.

major comments (1)

[Theoretical framework] Theoretical framework (two-layer analysis): the feature-learning dynamics and resulting dichotomy are derived under assumptions specific to two-layer networks, data distribution, activation, and optimization. The manuscript does not provide intermediate-layer probes, controlled ablations, or transfer arguments showing that the same noise-memorization pathway governs behavior in the deeper architectures used for the image-classification experiments; without such evidence the link between the proven 2-layer mechanism and the observed teacher-dependency on real data remains unestablished.

minor comments (2)

[Empirical validation] The construction and identification procedure for the Robustly Unlearnable Set on high-dimensional image data should be stated more explicitly, including any data-exclusion criteria and sensitivity checks.
[Notation] Notation for the unlearnable-set indicator and the entropy measure could be introduced earlier and used consistently across theory and experiments.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive and insightful comments. We address the major comment below and describe the revisions we intend to make.

read point-by-point responses

Referee: Theoretical framework (two-layer analysis): the feature-learning dynamics and resulting dichotomy are derived under assumptions specific to two-layer networks, data distribution, activation, and optimization. The manuscript does not provide intermediate-layer probes, controlled ablations, or transfer arguments showing that the same noise-memorization pathway governs behavior in the deeper architectures used for the image-classification experiments; without such evidence the link between the proven 2-layer mechanism and the observed teacher-dependency on real data remains unestablished.

Authors: We agree that the two-layer analysis is conducted under simplifying assumptions and that the manuscript would benefit from a clearer bridge to the deeper architectures used in the image experiments. The two-layer setting was deliberately chosen to enable a rigorous proof of the feature-learning dichotomy (confident supervision on unlearnable samples drives noise memorization, while high uncertainty suppresses it). The empirical results then test the observable consequence of this mechanism: that a teacher’s predictive entropy on the identified unlearnable subset is a strong predictor of student robustness across standard deep models and real datasets. While we do not currently include layer-wise probes or explicit transfer proofs, this predictive correlation provides indirect but falsifiable support for the proposed pathway. In the revised manuscript we will (i) add a dedicated discussion subsection that situates the two-layer dynamics within the broader literature on feature learning in over-parameterized networks and (ii) include additional controlled ablations that monitor student behavior on the unlearnable subset when using deeper architectures, thereby strengthening the empirical link without altering the core theoretical contribution. revision: partial

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained

full rationale

The paper derives its central dichotomy from a self-contained theoretical analysis of feature learning dynamics in a two-layer neural network, with the Robustly Unlearnable Set defined independently via the student's representational limitations rather than from the final robustness metric or fitted outcomes. The proof that confident teacher supervision compels noise memorization follows directly from the stated assumptions on data distribution, activation, and optimization without reducing to a tautology or renamed input. Empirical validation on synthetic and real-image data serves as external confirmation, not a statistically forced prediction. No load-bearing self-citations, uniqueness theorems imported from the authors, or ansatzes smuggled via prior work are present in the derivation chain.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim depends on the new concept of the Robustly Unlearnable Set and the applicability of two-layer network dynamics to explain real distillation behavior.

axioms (1)

domain assumption Feature learning dynamics of two-layer neural networks under min-max adversarial training capture the key mismatch between teacher confidence and student representational limits
Invoked to derive the dichotomy in distillation outcomes for confident versus uncertain supervision on unlearnable samples.

invented entities (1)

Robustly Unlearnable Set no independent evidence
purpose: Consistent subset of training data where student representational limitations cause misalignment with teacher supervisory confidence
Postulated to explain inconsistent success of robust teachers in adversarial distillation.

pith-pipeline@v0.9.0 · 5760 in / 1294 out tokens · 44594 ms · 2026-05-22T08:36:02.433154+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We prove that when a teacher provides confident supervision on unlearnable samples, it compels the student to memorize spurious noise patterns that eventually overpower the learned robust signal
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean absolute_floor_iff_bare_distinguishability unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

the misalignment between the teacher's supervisory confidence and the student's representational limitations on a consistent subset of training data -- the Robustly Unlearnable Set

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

62 extracted references · 62 canonical work pages · 4 internal anchors

[1]

Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Un- terthiner, T., and Veit, A

URL https://openreview.net/forum? id=HQtTg1try7. Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Un- terthiner, T., and Veit, A. Understanding robustness of transformers for image classification. InProceedings of the IEEE/CVF international conference on computer vision, pp. 10231–10241, 2021. Cao, Y ., Chen, Z., Belkin, M., and Gu, Q. Benign overf...

work page 2021
[2]

Cui, J., Tian, Z., Zhong, Z., QI, X., Yu, B., and Zhang, H

URL https://openreview.net/forum? id=SSKZPJCt7B. Cui, J., Tian, Z., Zhong, Z., QI, X., Yu, B., and Zhang, H. Decoupled kullback-leibler divergence loss. InThe Thirty- eighth Annual Conference on Neural Information Pro- cessing Systems, 2024. URL https://openreview. net/forum?id=bnZZedw9CM. Dai, S., Mahloujifar, S., and Mittal, P. Parameterizing acti- vati...

work page 2024
[3]

Engstrom, L., Ilyas, A., Santurkar, S., Tsipras, D., Tran, B., and Madry, A

URL https://openreview.net/forum? id=7gE9V9GBZaI. Engstrom, L., Ilyas, A., Santurkar, S., Tsipras, D., Tran, B., and Madry, A. Adversarial robustness as a prior for learned representations.arXiv preprint arXiv:1906.00945, 2019. 11 Toward Understanding Adversarial Distillation: Why Robust Teachers Fail Gao, R., Cai, T., Li, H., Hsieh, C.-J., Wang, L., and ...

work page arXiv 1906
[4]

URL https://openreview.net/forum? id=juKVq5dWTR. Li, B. and Li, Y . Adversarial training can provably im- prove robustness: Theoretical analysis of feature learn- ing process under structured data. InThe Thirteenth International Conference on Learning Representations, 2025a. URL https://openreview.net/forum? id=inLUnCpDIB. Li, B. and Li, Y . On the clean ...

work page 2023
[5]

Dickerson

URL https://openreview.net/forum? id=Sys6GJqxl. Lu, M., Wu, B., Yang, X., and Zou, D. Benign oscillation of stochastic gradient descent with large learning rate. InThe Twelfth International Conference on Learning Representations, 2024. URL https://openreview. net/forum?id=wYmvN3sQpG. Ma, X., Niu, Y ., Gu, L., Wang, Y ., Zhao, Y ., Bailey, J., and Lu, F. U...

work page internal anchor Pith review doi:10.48550/arxiv 2024
[6]

Transferability in Machine Learning: from Phenomena to Black-Box Attacks using Adversarial Samples

URL https://openreview.net/forum? id=8on9dIUh5v. Oh, J., Song, J., and Yun, C. From linear to nonlinear: Provable weak-to-strong generalization through feature learning. InThe Thirty-ninth Annual Conference on Neu- ral Information Processing Systems, 2026. URL https: //openreview.net/forum?id=xMiKDqxEE8. Pang, T., Yang, X., Dong, Y ., Su, H., and Zhu, J. ...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[7]

Intriguing properties of neural networks

IEEE, 2022. Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Er- han, D., Goodfellow, I. J., and Fergus, R. Intrigu- ing properties of neural networks. InInternational Conference on Learning Representations, 2014. URL https://arxiv.org/abs/1312.6199. Tram`er, F., Papernot, N., Goodfellow, I., Boneh, D., and Mc- Daniel, P. The space of transferable adve...

work page internal anchor Pith review Pith/arXiv arXiv 2022
[8]

GPTFUZZER: Red Teaming Large Language Models with Auto-Generated Jailbreak Prompts

URL https://openreview.net/forum? id=SyxAb30cY7. Uesato, J., O’donoghue, B., Kohli, P., and Oord, A. Ad- versarial risk and the dangers of evaluating against weak attacks. InInternational Conference on Machine Learn- ing, pp. 5025–5034. PMLR, 2018. Vaishnavi, P., Eykholt, K., and Rahmati, A. Transferring ad- versarial robustness through robust representat...

work page internal anchor Pith review Pith/arXiv arXiv 2018
[9]

and AutoAttack (Croce & Hein, 2020), to directly maximize the loss. In contrast, the black-box setting assumes 22 Toward Understanding Adversarial Distillation: Why Robust Teachers Fail Table 9.Sensitivity of robustly unlearnable samples to varying perturbation bounds.Unlike the baseline subset identification performed at a fixed attack budget, this exper...

work page 2020
[10]

Let F := span{u,v} , and let ΠF be the projection matrix onto this subspace

Fix two orthogonal robust features: the learnable feature u :=e 1 and the unlearnable feature v :=e d. Let F := span{u,v} , and let ΠF be the projection matrix onto this subspace. Also fix a partition of the training indices [N] into SL andS U , with|S L|= (1−p un)Nand|S U |=p unN

work page
[11]

The signal patch is generated by xi,s(Xi) = ( αyiu,ifi∈ S L (learnable), αyiv,ifi∈ S U (unlearnable)

For each i∈[N] , draw the label yi uniformly from {+1,−1} , and draw a signal patch index s(Xi) uniformly from [P] . The signal patch is generated by xi,s(Xi) = ( αyiu,ifi∈ S L (learnable), αyiv,ifi∈ S U (unlearnable). (22) 25 Toward Understanding Adversarial Distillation: Why Robust Teachers Fail Table 11.Summary of notations used in the theoretical anal...

work page
[12]

These two regimes lead to different adversarial training dynamics, which are treated separately in our theorems

For each non-signal patchp̸=s(X i), draw independent Gaussian noise orthogonal to the feature subspace: xi,p ∼ N 0, σ2 n(Id −Π F) .(23) We analyze two regimes of the unlearnable ratio pun: (i) the learnable-only regime pun = 0 (so SU =∅ ), and (ii) the sparse-unlearnable regime CN −1 ≤p un ≤C −1N −1 logd (so C≤ |S U | ≤C −1 logd ). These two regimes lead ...

work page
[13]

Both teachers generalize well on learnable samples. In particular, for any i∈ S L, the generic teacher produces a large logit in the direction of the target label: yifWT(Xi)≥Γ.(26) Additionally, both teachers have weights orthogonal to noise patches. Their distinction lies only in their behavior on the unlearnable featurev

work page
[14]

Consequently, for any unlearnable training sample indexed byi∈ S U , yifWG(Xi) = 0.(27)

Good Teacher (fWG).This teacher relies only on the learnable feature u and is orthogonal to the unlearnable feature v. Consequently, for any unlearnable training sample indexed byi∈ S U , yifWG(Xi) = 0.(27)

work page
[15]

Bad Teacher (fWB).This teacher is aligned with the unlearnable feature v. Consequently, for any unlearnable training sample indexed byi∈ S U , yifWB(Xi)≥Γ.(28) The terms Good Teacher and Bad Teacher refer to their compatibility with the capacity-constrained student. Both teachers are robust models. The Bad Teacher is called “bad” only because its confiden...

work page
[16]

Following Li & Li (2025b), for a sample(X, y), we define the training adversarial example ˜Xas the solution to ˜X= argmax X′ ℓ y fW(X′) s.t.∥X ′ −X∥ ∞ ≤ϵ,X ′ −X∈span xs(X) .(29) Thus, the adversary perturbs only the signal patch along its original direction. In particular, for a learnable sample with signal patchx s(X) =yαu, the adversarially perturbed si...

work page
[17]

Standard AT minimizes the logistic loss on the generated adversarial example: LAT(W;X, y) =ℓ y fW( ˜X) .(31)

work page
[18]

Following the standard AD setup (e.g., RSLAD (Zi et al., 2021)), the student matches the teacher’s output distribution

In AD, we leverage a fixed robust teacher networkfWT to provide soft targets for the student. Following the standard AD setup (e.g., RSLAD (Zi et al., 2021)), the student matches the teacher’s output distribution. In the binary margin-based formulation, this yields the weighted logistic loss: LAD(W;X, y) =σ yfWT(X) ℓ yfW( ˜X) +σ −yf WT(X) ℓ −yf W( ˜X) .(32)

work page 2021
[19]

We optimize the empirical risk by gradient descent. At each iteration t, the adversarial examples { ˜X(t) i }N i=1 are generated using the current modelW (t), and the parameters are updated by W(t+1) =W (t) − η N X i∈[N] ∇WL W(t);X i, yi ,(33) whereL ∈ {L AT,L AD}andη >0is the learning rate. We analyze the student afterTiterations. D.6. Parameter and Regi...

work page
[20]

Uniform bounds on noise patches: 1 2 σ2 nd≤ ∥x i,j∥2 2 ≤ 3 2 σ2 nd.(P1) |⟨xi,j,x k,q⟩| ≤2σ 2 n s dlog 16N2P 2 δ .(P2) 29 Toward Understanding Adversarial Distillation: Why Robust Teachers Fail ∥xi,j∥∞ ≤σ n s 2 log 16dN P δ .(P3)

work page
[21]

Uniform bounds on initialization: ∥w(0) r ∥2 ≤2σ 0 √ d.(P4) ⟨w(0) r ,e 1⟩ ≤σ 0 s 2 log 16m δ .(P5) ⟨w(0) r ,x i,j⟩ ≤2σ 0σn s dlog 16N mP δ .(P6)

work page
[22]

Proof of Lemma E.1

Maximum initialization margins: 1 2 σ0 ≤max r∈[m] ⟨w(0) r ,e 1⟩ ≤σ 0 s 2 log 16m δ .(P7) 1 4 σ0σn √ d≤max r∈[m] yi⟨w(0) r ,x i,j⟩ ≤2σ 0σn s dlog 16N mP δ .(P8) Then, the eventEoccurs with probability at least1−δ. Proof of Lemma E.1. We prove that (P1)–(P6) and the lower bounds in (P7) and (P8) each fail with probability at most δ/8. The upper bounds in (P...

work page
[23]

≤ψ yifW(t)( ˜X(t) i ) ≤1.(74) Proof of Lemma F .4.Fix any learnable sample i∈ S L. We decompose the margin into the signal-patch contribution and the noise-patch residual: yifW(t)( ˜X(t) i ) = yi⟨e1, ˜x(t) i,s(Xi)⟩ 3 X r∈[m] (w(t) r,1)3 + X r∈[m] X j̸=s(Xi) yi⟨w(t) r ,x i,j⟩3.(75) By the perturbation constraint, 0≤y i⟨e1, ˜x(t) i,s(Xi)⟩ ≤α. Using the assu...

work page
[24]

37 Toward Understanding Adversarial Distillation: Why Robust Teachers Fail Lemma F.5(Signal Learning Time).We formalize the signal growth statement in Lemma 4.9

(81) Together with the trivial upper boundψ(·)≤1, this proves the claimed gradient bound on learnable samples. 37 Toward Understanding Adversarial Distillation: Why Robust Teachers Fail Lemma F.5(Signal Learning Time).We formalize the signal growth statement in Lemma 4.9. Let T0 be the first hitting time at which the maximum signal component reaches the t...

work page
[25]

We track the evolution of this filter

(86) This leads to the following recursive bounds: w(t+1) r,1 ≤w (t) r,1 +A(w (t) r,1)2, w (t+1) r,1 ≥w (t) r,1 +B(w (t) r,1)2.(87) On the event E, the upper bound in (P7) with condition (C4) guarantees that the initial point is small enough compared to the target threshold , maxr∈[m] w(0) r,1 ≤σ 0 q 2 log 16m δ ≤ 1 2 C0 αm1/3 , and the lower bound in (P7...

work page
[26]

Proof of Lemma F .6.Define the threshold ρth := 100 exp(2C3

log 16N mP δ · σ0σ2 nd N(α−ϵ) 3(1−p un) .(94) In particular, ifT 0 denotes the hitting time from Lemma F.5, thenT 0 ≤ ˆT0, so the same bound holds for allt≤T 0. Proof of Lemma F .6.Define the threshold ρth := 100 exp(2C3

work page
[27]

Fix any iteration τ < t

log 16N mP δ · σ0σ2 nd N(α−ϵ) 3(1−p un) .(95) Assume for contradiction that there exists an iterationt≤ ˆT0 such thattis the first iteration where max i,j,r ρ(t) i,j,r > ρth.(96) Then, by the minimality oft, we haveρ (τ) i,j,r ≤ρ th for allτ < tand alli, j, r. Fix any iteration τ < t . By the decomposition in Lemma F.2, the triangle inequality and the non...

work page
[28]

Summing the update rule from Lemma F.2 overτ= 0,

log 16N mP δ · σ0σ2 nd N(α−ϵ) 3(1−p un) ·2σ 2 nd ≤ r 20 3 σ0σn s dlog 16N mP δ , (99) where the second inequality follows from condition (C2), and the final inequality holds under conditions (C3) and (C6). Summing the update rule from Lemma F.2 overτ= 0, . . . , t−1and usingψ yifW(τ) ( ˜X(τ) i ) ≤1, we obtain ρ(t) i,j,r = t−1X τ=0 3η N ψ yifW(τ) ( ˜X(τ) i...

work page
[29]

Post-T0 Control on Learnable Samples Once the signal reaches its target scale at time T0, learnable samples no longer drive large updates

log 16N mP δ · σ0σ2 nd N(α−ϵ) 3(1−p un) .(102) F.4. Post-T0 Control on Learnable Samples Once the signal reaches its target scale at time T0, learnable samples no longer drive large updates. We show that the signal weights remain bounded and that the cumulative gradient contribution from learnable samples is controlled. Lemma F.7(Signal Weight Bound).On t...

work page
[30]

log 16N mP δ · σ0σ2 nd N(α−ϵ) 3(1−p un) + 18m2/3 log1/3 T C2 0 1− ϵ α 3 N α2 18σ2 0σ2 ndlog 16N mP δ + 18p2 unN2P 2 ρSU max 2 σ4 ndlog 16N2P 2 δ . (175) Using conditions (C6) and (C7), for a sufficiently large constantC, the threshold expansion satisfies ρth ≤Clog 16N mP δ σ0σ2 nd N α3 + Cm2/3σ2 0σ2 ndlog 1/3 Tlog 16N mP δ N α2 + Cp2 unN P2m2/3 ρSU max 2 ...

work page
[31]

It therefore remains to consider the regimet+ 1> T 0

log 16N mP δ · σ0σ2 nd N(α−ϵ) 3(1−p un) ≤ρ th (183) for everyi∈ S L, everyj̸=s(X i), and everyr∈[m]. It therefore remains to consider the regimet+ 1> T 0. In this case, for everyτ∈[T 0, t], the induction hypothesis yields ρ(τ) i,j,r ≤ρ th for alli∈ S L, j̸=s(X i), r∈[m].(184) 48 Toward Understanding Adversarial Distillation: Why Robust Teachers Fail Fix a...

work page
[32]

(193) This provesρ (t+1) i,j,r ≤ρ th

log 16N mP δ · σ0σ2 nd N(α−ϵ) 3(1−p un) + 18m2/3 log1/3 T C2 0 1− ϵ α 3 N α2 18σ2 0σ2 ndlog 16N mP δ + 18p2 unN2P 2 ρSU max 2 σ4 ndlog 16N2P 2 δ =ρ th. (193) This provesρ (t+1) i,j,r ≤ρ th. Therefore, by (181), ρ(t) i,j,r ≤ σ0 σn √ d + punN σ2 n √ d 100 log2/3 T ρSU max 2 .(194) This proves (171). By (174), (171) gives ρ(t) i,j,r ≤ σ0 σn √ d + punN σ2nd3/...

work page
[33]

Thus it remains to considert∈(T 0, T]

log 16N mP δ · σ0σ2 nd N(α−ϵ) 3 .(198) Consequently, for everyi∈[N],j̸=s(X i),r∈[m], andt≤T, ⟨w(t) r ,x i,j⟩ ≤3σ 0σn s dlog 16N mP δ .(199) Proof of Lemma G.1.Ift≤T 0, the claim follows directly from Lemma F.6. Thus it remains to considert∈(T 0, T]. Define the threshold ρth := max i,j,r ρ(T0) i,j,r + 30m2/3 log 16N mP δ log1/3 T σ2 0σ2 nd C2 0 1− ϵ α 3 N ...

work page
[34]

log 16N mP δ · σ0σ2 nd N(α−ϵ) 3 .(201) Therefore, ρth ≤100 exp(2C 3

work page
[35]

log 16N mP δ · σ0σ2 nd N(α−ϵ) 3 + 30m2/3 log 16N mP δ log1/3 T σ2 0σ2 nd C2 0 1− ϵ α 3 N α2 = log 16N mP δ · σ0σ2 nd N(α−ϵ) 3 100 exp(2C3

work page
[36]

+ 30m2/3ασ0 log1/3 T C2 0 ! ≤200 exp(2C 3

work page
[37]

log 16N mP δ · σ0σ2 nd N(α−ϵ) 3 , (202) where the last inequality follows from condition (C4). Assume for contradiction that there exists an iterationt∈(T 0, T]such thattis the first iteration where max i,j,r ρ(t) i,j,r > ρth.(203) Then, by the minimality oft, we haveρ (τ) i,j,r ≤ρ th for allτ∈[T 0, t−1]and alli, j, r. Fix any iteration τ < t . By the dec...

work page
[38]

It remains to record the corresponding noise-response bound

log 16N mP δ · σ0σ2 nd N(α−ϵ) 3 .(209) Combining this bound fort∈(T 0, T]with Lemma F.6 fort≤T 0 proves the claimed coefficient bound for allt≤T. It remains to record the corresponding noise-response bound. The inner-product estimate above used only the event bounds (P1), (P2), (P6), and a uniform coefficient bound by ρth. Therefore, after the coefficient...

work page
[39]

There exists at least one filterr∈[m]whose first coordinate is aligned with the learnable feature: w(T) r,1 ≥ ˜Ω(α−1).(211) 52 Toward Understanding Adversarial Distillation: Why Robust Teachers Fail

work page
[40]

The network barely memorizes the noise features. More precisely, for every filterr∈[m] and every noise patch (i, j) withi∈[N]andj̸=s(X i), ⟨w(T) r ,x i,j⟩ ≤ ˜O(σ0σn √ d)(212) Consequently, both the robust training error and robust test error areo(1). Proof of Theorem G.2. Step 1: Signal Alignment.By Lemma F.5, after T0 ≤ 5 exp(2C3 0) η(α−ϵ)3σ0 iterations,...

work page
[41]

log 16N mP δ · σ0σ3 nd3/2 (α−ϵ) 3 ≤2σ 0 √ d+ 300N Pexp(2C 3 0)σ0 ≤3σ 0 √ d, (219) where the second inequality follows from conditions (C3) and (C6), and the last inequality follows from condition (C2). Therefore, ∥z(T) r ∥1 ≤ √ d∥z (T) r ∥2 ≤3σ 0dfor allr∈[m].(220) 53 Toward Understanding Adversarial Distillation: Why Robust Teachers Fail Step 4.2: Tail r...

work page
[42]

Consider an unlearnable sample i∈ S U

≤ψ yifW(t)( ˜X(t) i ) ≤1.(249) Proof of Lemma G.3. Consider an unlearnable sample i∈ S U . Its signal patch is aligned with v=e d, while the weights remain orthogonal to v. Hence the signal-patch contribution vanishes, and the margin is determined entirely by the noise patches: yifW(t)( ˜X(t) i ) = X r∈[m] X j̸=s(Xi) yi⟨w(t) r ,x i,j⟩3.(250) On the eventE...

work page
[43]

(266) Together with the trivial upper boundψ(·)≤1, this proves the claimed gradient bound on unlearnable samples. Lemma G.4(Dynamics of the Maximum Shifted Noise Coefficient).On the event E defined in Lemma E.1, suppose that the maximum shifted noise coefficient ˆρ(t) max defined in Lemma G.3 satisfies ˆρ(t) max ≤ C1 (mP) 1/3σ2nd. Then the discrete dynami...

work page
[44]

(293) Hence, ˆρ(t+1) max ≥ˆρ(t) max + ησ4 nd2 3N(1 + exp(4C 3 1)) ˆρ(t) max 2 .(294) This proves the stated dynamics for the maximum shifted noise coefficient

(292) Therefore, ρ(t+1) it,jt,rt −ρ (t) it,jt,rt ≥ 3η N · 1 1 + exp(4C3 1) 1 3 ˆρ(t) maxσ2 nd 2 = ησ4 nd2 3N(1 + exp(4C 3 1)) ˆρ(t) max 2 . (293) Hence, ˆρ(t+1) max ≥ˆρ(t) max + ησ4 nd2 3N(1 + exp(4C 3 1)) ˆρ(t) max 2 .(294) This proves the stated dynamics for the maximum shifted noise coefficient. Lemma G.5(Noise Memorization Time).We formalize the noise...

work page
[45]

There exists at least one filterr∈[m]whose first coordinate is aligned with the learnable feature: w(T) r,1 ≥ ˜Ω(α−1).(317)

work page
[46]

The network memorizes the noise features on at least one unlearnable sample. More precisely, there exists an index i∈ S U such that max r∈[m],j̸=s(X i) yi⟨w(T) r ,x i,j⟩ ≥ ˜Ω(1).(318) Consequently, the robust training error iso(1), while the robust test error is at least 1 2 −o(1). Proof of Theorem G.6. Step 1: Signal Alignment.By Lemma F.5, after T0 ≤ 5 ...

work page
[47]

There exists at least one filterr∈[m]whose first coordinate is aligned with the learnable feature: w(T) r,1 ≥ ˜Ω(α−1).(372)

work page
[48]

The network memorizes the noise features on at least one unlearnable sample. More precisely, there exists an index i∈ S U such that max r∈[m],j̸=s(X i) yi⟨w(T) r ,x i,j⟩ ≥ ˜Ω(1).(373) Consequently, the robust training error iso(1), while the robust test error is at least 1 2 −o(1). Proof of Theorem H.3. By Lemma H.1, on learnable samples, the AD gradient ...

work page
[49]

Lemma H.5(Inner Product Bound under a Good Teacher).Consider Adversarial Distillation under a Good Teacher

log 16N mP δ · σ0σ2 nd N(α−ϵ) 3 .(376) For unlearnable samples, max i∈SU , j̸=s(Xi), r∈[m] ρ(t) i,j,r ≤ σ0 q log 16N mP δ σn √ d .(377) See Lemma H.7 for the verification of this hypothesis. Lemma H.5(Inner Product Bound under a Good Teacher).Consider Adversarial Distillation under a Good Teacher. On the event E defined in Lemma E.1, suppose that Hypothes...

work page
[50]

log 16N mP δ · σ0σ2 nd N(α−ϵ) 3 · 3 2 σ2 nd + (1−p un)N P·200 exp(2C 3

work page
[51]

log 16N mP δ · σ0σ2 nd N(α−ϵ) 3 ·2σ 2 n s dlog 16N2P 2 δ +p unN P· σ0 q log 16N mP δ σn √ d ·2σ 2 n s dlog 16N2P 2 δ . (381) 70 Toward Understanding Adversarial Distillation: Why Robust Teachers Fail By conditions (C3), (C4) and (C6), the second and third terms are strictly bounded by 1 4 σ0σn s dlog 16N mP δ .(382) Moreover, by conditions (C2) and (C7), ...

work page
[52]

≤ψ yifW(t)( ˜X(t) i ) ≤1.(386) Second, the hitting timeT 0 := min n t≥0 maxr∈[m] w(t) r,1 > C0 αm1/3 o satisfies 1 12 q 2 log 16m δ ηα3σ0 ≤T 0 ≤ 5 exp(2C3 0) η(α−ϵ) 3(1−p un)σ0 .(387) Third, for everyt≤T 0, max i∈SL, j̸=s(Xi), r∈[m] ρ(t) i,j,r ≤100 exp(2C 3

work page
[53]

By Lemma H.5, the assumption Hypothesis H.4 implies that the conclusion of Hypothesis F.3 holds at every iteration t≤T

log 16N mP δ · σ0σ2 nd N(α−ϵ) 3(1−p un) .(388) Fourth, for everyt≤Tand everyr∈[m], w(t) r,1 ≤ 3 log1/3 T α .(389) Finally, for everyt∈[T 0 + 1, T]and every learnable samplei∈ S L, ψ yifW(t)( ˜X(t) i ) ≤ 4m2/3 log1/3 T C2 0 1− ϵ α 3 ηα2(t−T 0) .(390) Proof of Lemma H.6. By Lemma H.5, the assumption Hypothesis H.4 implies that the conclusion of Hypothesis F...

work page
[54]

log 16N mP δ · σ0σ2 nd N(α−ϵ) 3(1−p un) .(392) Suppose for contradiction that there existst≤T 0 such thattis the first iteration satisfying max i∈SL, j̸=s(Xi), r∈[m] ρ(t) i,j,r > ρth.(393) Then, by the minimality oft, max i∈SL, j̸=s(Xi), r∈[m] ρ(τ) i,j,r ≤ρ th for allτ < t.(394) Under this bound, together with the unlearnable-sample coefficient bound in H...

work page
[55]

(402) By conditions (C3), (C4) and (C6), the learnable cross term satisfies (1−p un)N P·200 exp(2C 3

log 16N mP δ σ0σ2 nd N(α−ϵ) 3 ·2σ 2 n s dlog 16N2P 2 δ +p unN P· σ0 q log 16N mP δ σn √ d ·2σ 2 n s dlog 16N2P 2 δ . (402) By conditions (C3), (C4) and (C6), the learnable cross term satisfies (1−p un)N P·200 exp(2C 3

work page
[56]

(403) Similarly, by conditions (C2), (C7) and (C8), the unlearnable cross term satisfies punN P· σ0 q log 16N mP δ σn √ d ·2σ 2 n s dlog 16N2P 2 δ ≤ 1 4 σ0σn s dlog 16N mP δ

log 16N mP δ σ0σ2 nd N(α−ϵ) 3 ·2σ 2 n s dlog 16N2P 2 δ ≤ 1 4 σ0σn s dlog 16N mP δ . (403) Similarly, by conditions (C2), (C7) and (C8), the unlearnable cross term satisfies punN P· σ0 q log 16N mP δ σn √ d ·2σ 2 n s dlog 16N2P 2 δ ≤ 1 4 σ0σn s dlog 16N mP δ . (404) Combining this with the initialization term and the self term gives, for everyτ < t, ⟨w(τ) ...

work page
[57]

Hence it remains to considert∈(T 0, T]

log 16N mP δ · σ0σ2 nd N(α−ϵ) 3(1−p un) .(411) By condition (C7), this is bounded by the claimed learnable-sample threshold. Hence it remains to considert∈(T 0, T]. Define ρth := 200 exp(2C3

work page
[58]

log 16N mP δ · σ0σ2 nd N(α−ϵ) 3 .(412) Fix anyτ < t, anyr∈[m], anyi∈ S L, and anyj̸=s(X i). By the decomposition in Lemma F.2, ⟨w(τ) r ,x i,j⟩ ≤ ⟨w(0) r ,x i,j⟩ + ρ(τ) i,j,r ∥xi,j∥2 2 + X k∈SL, q̸=s(X k) (k,q)̸=(i,j) ρ(τ) k,q,r |⟨xk,q,x i,j⟩|+ X k∈SU , q̸=s(X k) ρ(τ) k,q,r |⟨xk,q,x i,j⟩|. (413) Applying the learnable-sample bound ρ(τ) k,q,r ≤ρ th, the ass...

work page
[59]

(421) Under conditions (C4) and (C7), the two terms on the right-hand side are together bounded byρ th

log 16N mP δ · σ0σ2 nd N(α−ϵ) 3(1−p un) + 108m2/3 log1/3 T·σ 2 0σ2 ndlog 16N mP δ C2 0 1− ϵ α 3 N α2 . (421) Under conditions (C4) and (C7), the two terms on the right-hand side are together bounded byρ th. Therefore, ρ(t) i,j,r ≤ρ th,(422) which contradicts the assumption thattis the first iteration where the learnable-sample bound fails. Hence the learn...

work page
[60]

There exists at least one filterr∈[m]whose first coordinate is aligned with the learnable feature: w(T) r,1 ≥ ˜Ω(α−1).(424)

work page
[61]

The network barely memorizes the noise features. More precisely, for every filterr∈[m] and every noise patch (i, j) withi∈[N]andj̸=s(X i), ⟨w(T) r ,x i,j⟩ ≤ ˜O(σ0σn √ d)(425) 75 Toward Understanding Adversarial Distillation: Why Robust Teachers Fail Consequently, both the robust training error and robust test error areo(1). Proof of Theorem H.8. We compar...

work page
[62]

(439) The learnable contribution is bounded as in Theorem G.2, and the unlearnable contribution is bounded by conditions (C2), (C7) and (C8)

log 16N mP δ · σ0σ2 nd N(α−ϵ) 3 · r 3 2 σn √ d +p unN P· σ0 q log 16N mP δ σn √ d · r 3 2 σn √ d. (439) The learnable contribution is bounded as in Theorem G.2, and the unlearnable contribution is bounded by conditions (C2), (C7) and (C8). Therefore, ∥z(T) r ∥2 ≤3σ 0 √ dfor allr∈[m].(440) At this point, the two structural conditions required in Steps 4.2 ...

work page 2022

[1] [1]

Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Un- terthiner, T., and Veit, A

URL https://openreview.net/forum? id=HQtTg1try7. Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Un- terthiner, T., and Veit, A. Understanding robustness of transformers for image classification. InProceedings of the IEEE/CVF international conference on computer vision, pp. 10231–10241, 2021. Cao, Y ., Chen, Z., Belkin, M., and Gu, Q. Benign overf...

work page 2021

[2] [2]

Cui, J., Tian, Z., Zhong, Z., QI, X., Yu, B., and Zhang, H

URL https://openreview.net/forum? id=SSKZPJCt7B. Cui, J., Tian, Z., Zhong, Z., QI, X., Yu, B., and Zhang, H. Decoupled kullback-leibler divergence loss. InThe Thirty- eighth Annual Conference on Neural Information Pro- cessing Systems, 2024. URL https://openreview. net/forum?id=bnZZedw9CM. Dai, S., Mahloujifar, S., and Mittal, P. Parameterizing acti- vati...

work page 2024

[3] [3]

Engstrom, L., Ilyas, A., Santurkar, S., Tsipras, D., Tran, B., and Madry, A

URL https://openreview.net/forum? id=7gE9V9GBZaI. Engstrom, L., Ilyas, A., Santurkar, S., Tsipras, D., Tran, B., and Madry, A. Adversarial robustness as a prior for learned representations.arXiv preprint arXiv:1906.00945, 2019. 11 Toward Understanding Adversarial Distillation: Why Robust Teachers Fail Gao, R., Cai, T., Li, H., Hsieh, C.-J., Wang, L., and ...

work page arXiv 1906

[4] [4]

URL https://openreview.net/forum? id=juKVq5dWTR. Li, B. and Li, Y . Adversarial training can provably im- prove robustness: Theoretical analysis of feature learn- ing process under structured data. InThe Thirteenth International Conference on Learning Representations, 2025a. URL https://openreview.net/forum? id=inLUnCpDIB. Li, B. and Li, Y . On the clean ...

work page 2023

[5] [5]

Dickerson

URL https://openreview.net/forum? id=Sys6GJqxl. Lu, M., Wu, B., Yang, X., and Zou, D. Benign oscillation of stochastic gradient descent with large learning rate. InThe Twelfth International Conference on Learning Representations, 2024. URL https://openreview. net/forum?id=wYmvN3sQpG. Ma, X., Niu, Y ., Gu, L., Wang, Y ., Zhao, Y ., Bailey, J., and Lu, F. U...

work page internal anchor Pith review doi:10.48550/arxiv 2024

[6] [6]

Transferability in Machine Learning: from Phenomena to Black-Box Attacks using Adversarial Samples

URL https://openreview.net/forum? id=8on9dIUh5v. Oh, J., Song, J., and Yun, C. From linear to nonlinear: Provable weak-to-strong generalization through feature learning. InThe Thirty-ninth Annual Conference on Neu- ral Information Processing Systems, 2026. URL https: //openreview.net/forum?id=xMiKDqxEE8. Pang, T., Yang, X., Dong, Y ., Su, H., and Zhu, J. ...

work page internal anchor Pith review Pith/arXiv arXiv 2026

[7] [7]

Intriguing properties of neural networks

IEEE, 2022. Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Er- han, D., Goodfellow, I. J., and Fergus, R. Intrigu- ing properties of neural networks. InInternational Conference on Learning Representations, 2014. URL https://arxiv.org/abs/1312.6199. Tram`er, F., Papernot, N., Goodfellow, I., Boneh, D., and Mc- Daniel, P. The space of transferable adve...

work page internal anchor Pith review Pith/arXiv arXiv 2022

[8] [8]

GPTFUZZER: Red Teaming Large Language Models with Auto-Generated Jailbreak Prompts

URL https://openreview.net/forum? id=SyxAb30cY7. Uesato, J., O’donoghue, B., Kohli, P., and Oord, A. Ad- versarial risk and the dangers of evaluating against weak attacks. InInternational Conference on Machine Learn- ing, pp. 5025–5034. PMLR, 2018. Vaishnavi, P., Eykholt, K., and Rahmati, A. Transferring ad- versarial robustness through robust representat...

work page internal anchor Pith review Pith/arXiv arXiv 2018

[9] [9]

and AutoAttack (Croce & Hein, 2020), to directly maximize the loss. In contrast, the black-box setting assumes 22 Toward Understanding Adversarial Distillation: Why Robust Teachers Fail Table 9.Sensitivity of robustly unlearnable samples to varying perturbation bounds.Unlike the baseline subset identification performed at a fixed attack budget, this exper...

work page 2020

[10] [10]

Let F := span{u,v} , and let ΠF be the projection matrix onto this subspace

Fix two orthogonal robust features: the learnable feature u :=e 1 and the unlearnable feature v :=e d. Let F := span{u,v} , and let ΠF be the projection matrix onto this subspace. Also fix a partition of the training indices [N] into SL andS U , with|S L|= (1−p un)Nand|S U |=p unN

work page

[11] [11]

The signal patch is generated by xi,s(Xi) = ( αyiu,ifi∈ S L (learnable), αyiv,ifi∈ S U (unlearnable)

For each i∈[N] , draw the label yi uniformly from {+1,−1} , and draw a signal patch index s(Xi) uniformly from [P] . The signal patch is generated by xi,s(Xi) = ( αyiu,ifi∈ S L (learnable), αyiv,ifi∈ S U (unlearnable). (22) 25 Toward Understanding Adversarial Distillation: Why Robust Teachers Fail Table 11.Summary of notations used in the theoretical anal...

work page

[12] [12]

These two regimes lead to different adversarial training dynamics, which are treated separately in our theorems

For each non-signal patchp̸=s(X i), draw independent Gaussian noise orthogonal to the feature subspace: xi,p ∼ N 0, σ2 n(Id −Π F) .(23) We analyze two regimes of the unlearnable ratio pun: (i) the learnable-only regime pun = 0 (so SU =∅ ), and (ii) the sparse-unlearnable regime CN −1 ≤p un ≤C −1N −1 logd (so C≤ |S U | ≤C −1 logd ). These two regimes lead ...

work page

[13] [13]

Both teachers generalize well on learnable samples. In particular, for any i∈ S L, the generic teacher produces a large logit in the direction of the target label: yifWT(Xi)≥Γ.(26) Additionally, both teachers have weights orthogonal to noise patches. Their distinction lies only in their behavior on the unlearnable featurev

work page

[14] [14]

Consequently, for any unlearnable training sample indexed byi∈ S U , yifWG(Xi) = 0.(27)

Good Teacher (fWG).This teacher relies only on the learnable feature u and is orthogonal to the unlearnable feature v. Consequently, for any unlearnable training sample indexed byi∈ S U , yifWG(Xi) = 0.(27)

work page

[15] [15]

Bad Teacher (fWB).This teacher is aligned with the unlearnable feature v. Consequently, for any unlearnable training sample indexed byi∈ S U , yifWB(Xi)≥Γ.(28) The terms Good Teacher and Bad Teacher refer to their compatibility with the capacity-constrained student. Both teachers are robust models. The Bad Teacher is called “bad” only because its confiden...

work page

[16] [16]

Following Li & Li (2025b), for a sample(X, y), we define the training adversarial example ˜Xas the solution to ˜X= argmax X′ ℓ y fW(X′) s.t.∥X ′ −X∥ ∞ ≤ϵ,X ′ −X∈span xs(X) .(29) Thus, the adversary perturbs only the signal patch along its original direction. In particular, for a learnable sample with signal patchx s(X) =yαu, the adversarially perturbed si...

work page

[17] [17]

Standard AT minimizes the logistic loss on the generated adversarial example: LAT(W;X, y) =ℓ y fW( ˜X) .(31)

work page

[18] [18]

Following the standard AD setup (e.g., RSLAD (Zi et al., 2021)), the student matches the teacher’s output distribution

In AD, we leverage a fixed robust teacher networkfWT to provide soft targets for the student. Following the standard AD setup (e.g., RSLAD (Zi et al., 2021)), the student matches the teacher’s output distribution. In the binary margin-based formulation, this yields the weighted logistic loss: LAD(W;X, y) =σ yfWT(X) ℓ yfW( ˜X) +σ −yf WT(X) ℓ −yf W( ˜X) .(32)

work page 2021

[19] [19]

We optimize the empirical risk by gradient descent. At each iteration t, the adversarial examples { ˜X(t) i }N i=1 are generated using the current modelW (t), and the parameters are updated by W(t+1) =W (t) − η N X i∈[N] ∇WL W(t);X i, yi ,(33) whereL ∈ {L AT,L AD}andη >0is the learning rate. We analyze the student afterTiterations. D.6. Parameter and Regi...

work page

[20] [20]

Uniform bounds on noise patches: 1 2 σ2 nd≤ ∥x i,j∥2 2 ≤ 3 2 σ2 nd.(P1) |⟨xi,j,x k,q⟩| ≤2σ 2 n s dlog 16N2P 2 δ .(P2) 29 Toward Understanding Adversarial Distillation: Why Robust Teachers Fail ∥xi,j∥∞ ≤σ n s 2 log 16dN P δ .(P3)

work page

[21] [21]

Uniform bounds on initialization: ∥w(0) r ∥2 ≤2σ 0 √ d.(P4) ⟨w(0) r ,e 1⟩ ≤σ 0 s 2 log 16m δ .(P5) ⟨w(0) r ,x i,j⟩ ≤2σ 0σn s dlog 16N mP δ .(P6)

work page

[22] [22]

Proof of Lemma E.1

Maximum initialization margins: 1 2 σ0 ≤max r∈[m] ⟨w(0) r ,e 1⟩ ≤σ 0 s 2 log 16m δ .(P7) 1 4 σ0σn √ d≤max r∈[m] yi⟨w(0) r ,x i,j⟩ ≤2σ 0σn s dlog 16N mP δ .(P8) Then, the eventEoccurs with probability at least1−δ. Proof of Lemma E.1. We prove that (P1)–(P6) and the lower bounds in (P7) and (P8) each fail with probability at most δ/8. The upper bounds in (P...

work page

[23] [23]

≤ψ yifW(t)( ˜X(t) i ) ≤1.(74) Proof of Lemma F .4.Fix any learnable sample i∈ S L. We decompose the margin into the signal-patch contribution and the noise-patch residual: yifW(t)( ˜X(t) i ) = yi⟨e1, ˜x(t) i,s(Xi)⟩ 3 X r∈[m] (w(t) r,1)3 + X r∈[m] X j̸=s(Xi) yi⟨w(t) r ,x i,j⟩3.(75) By the perturbation constraint, 0≤y i⟨e1, ˜x(t) i,s(Xi)⟩ ≤α. Using the assu...

work page

[24] [24]

37 Toward Understanding Adversarial Distillation: Why Robust Teachers Fail Lemma F.5(Signal Learning Time).We formalize the signal growth statement in Lemma 4.9

(81) Together with the trivial upper boundψ(·)≤1, this proves the claimed gradient bound on learnable samples. 37 Toward Understanding Adversarial Distillation: Why Robust Teachers Fail Lemma F.5(Signal Learning Time).We formalize the signal growth statement in Lemma 4.9. Let T0 be the first hitting time at which the maximum signal component reaches the t...

work page

[25] [25]

We track the evolution of this filter

(86) This leads to the following recursive bounds: w(t+1) r,1 ≤w (t) r,1 +A(w (t) r,1)2, w (t+1) r,1 ≥w (t) r,1 +B(w (t) r,1)2.(87) On the event E, the upper bound in (P7) with condition (C4) guarantees that the initial point is small enough compared to the target threshold , maxr∈[m] w(0) r,1 ≤σ 0 q 2 log 16m δ ≤ 1 2 C0 αm1/3 , and the lower bound in (P7...

work page

[26] [26]

Proof of Lemma F .6.Define the threshold ρth := 100 exp(2C3

log 16N mP δ · σ0σ2 nd N(α−ϵ) 3(1−p un) .(94) In particular, ifT 0 denotes the hitting time from Lemma F.5, thenT 0 ≤ ˆT0, so the same bound holds for allt≤T 0. Proof of Lemma F .6.Define the threshold ρth := 100 exp(2C3

work page

[27] [27]

Fix any iteration τ < t

log 16N mP δ · σ0σ2 nd N(α−ϵ) 3(1−p un) .(95) Assume for contradiction that there exists an iterationt≤ ˆT0 such thattis the first iteration where max i,j,r ρ(t) i,j,r > ρth.(96) Then, by the minimality oft, we haveρ (τ) i,j,r ≤ρ th for allτ < tand alli, j, r. Fix any iteration τ < t . By the decomposition in Lemma F.2, the triangle inequality and the non...

work page

[28] [28]

Summing the update rule from Lemma F.2 overτ= 0,

log 16N mP δ · σ0σ2 nd N(α−ϵ) 3(1−p un) ·2σ 2 nd ≤ r 20 3 σ0σn s dlog 16N mP δ , (99) where the second inequality follows from condition (C2), and the final inequality holds under conditions (C3) and (C6). Summing the update rule from Lemma F.2 overτ= 0, . . . , t−1and usingψ yifW(τ) ( ˜X(τ) i ) ≤1, we obtain ρ(t) i,j,r = t−1X τ=0 3η N ψ yifW(τ) ( ˜X(τ) i...

work page

[29] [29]

Post-T0 Control on Learnable Samples Once the signal reaches its target scale at time T0, learnable samples no longer drive large updates

log 16N mP δ · σ0σ2 nd N(α−ϵ) 3(1−p un) .(102) F.4. Post-T0 Control on Learnable Samples Once the signal reaches its target scale at time T0, learnable samples no longer drive large updates. We show that the signal weights remain bounded and that the cumulative gradient contribution from learnable samples is controlled. Lemma F.7(Signal Weight Bound).On t...

work page

[30] [30]

log 16N mP δ · σ0σ2 nd N(α−ϵ) 3(1−p un) + 18m2/3 log1/3 T C2 0 1− ϵ α 3 N α2 18σ2 0σ2 ndlog 16N mP δ + 18p2 unN2P 2 ρSU max 2 σ4 ndlog 16N2P 2 δ . (175) Using conditions (C6) and (C7), for a sufficiently large constantC, the threshold expansion satisfies ρth ≤Clog 16N mP δ σ0σ2 nd N α3 + Cm2/3σ2 0σ2 ndlog 1/3 Tlog 16N mP δ N α2 + Cp2 unN P2m2/3 ρSU max 2 ...

work page

[31] [31]

It therefore remains to consider the regimet+ 1> T 0

log 16N mP δ · σ0σ2 nd N(α−ϵ) 3(1−p un) ≤ρ th (183) for everyi∈ S L, everyj̸=s(X i), and everyr∈[m]. It therefore remains to consider the regimet+ 1> T 0. In this case, for everyτ∈[T 0, t], the induction hypothesis yields ρ(τ) i,j,r ≤ρ th for alli∈ S L, j̸=s(X i), r∈[m].(184) 48 Toward Understanding Adversarial Distillation: Why Robust Teachers Fail Fix a...

work page

[32] [32]

(193) This provesρ (t+1) i,j,r ≤ρ th

log 16N mP δ · σ0σ2 nd N(α−ϵ) 3(1−p un) + 18m2/3 log1/3 T C2 0 1− ϵ α 3 N α2 18σ2 0σ2 ndlog 16N mP δ + 18p2 unN2P 2 ρSU max 2 σ4 ndlog 16N2P 2 δ =ρ th. (193) This provesρ (t+1) i,j,r ≤ρ th. Therefore, by (181), ρ(t) i,j,r ≤ σ0 σn √ d + punN σ2 n √ d 100 log2/3 T ρSU max 2 .(194) This proves (171). By (174), (171) gives ρ(t) i,j,r ≤ σ0 σn √ d + punN σ2nd3/...

work page

[33] [33]

Thus it remains to considert∈(T 0, T]

log 16N mP δ · σ0σ2 nd N(α−ϵ) 3 .(198) Consequently, for everyi∈[N],j̸=s(X i),r∈[m], andt≤T, ⟨w(t) r ,x i,j⟩ ≤3σ 0σn s dlog 16N mP δ .(199) Proof of Lemma G.1.Ift≤T 0, the claim follows directly from Lemma F.6. Thus it remains to considert∈(T 0, T]. Define the threshold ρth := max i,j,r ρ(T0) i,j,r + 30m2/3 log 16N mP δ log1/3 T σ2 0σ2 nd C2 0 1− ϵ α 3 N ...

work page

[34] [34]

log 16N mP δ · σ0σ2 nd N(α−ϵ) 3 .(201) Therefore, ρth ≤100 exp(2C 3

work page

[35] [35]

log 16N mP δ · σ0σ2 nd N(α−ϵ) 3 + 30m2/3 log 16N mP δ log1/3 T σ2 0σ2 nd C2 0 1− ϵ α 3 N α2 = log 16N mP δ · σ0σ2 nd N(α−ϵ) 3 100 exp(2C3

work page

[36] [36]

+ 30m2/3ασ0 log1/3 T C2 0 ! ≤200 exp(2C 3

work page

[37] [37]

log 16N mP δ · σ0σ2 nd N(α−ϵ) 3 , (202) where the last inequality follows from condition (C4). Assume for contradiction that there exists an iterationt∈(T 0, T]such thattis the first iteration where max i,j,r ρ(t) i,j,r > ρth.(203) Then, by the minimality oft, we haveρ (τ) i,j,r ≤ρ th for allτ∈[T 0, t−1]and alli, j, r. Fix any iteration τ < t . By the dec...

work page

[38] [38]

It remains to record the corresponding noise-response bound

log 16N mP δ · σ0σ2 nd N(α−ϵ) 3 .(209) Combining this bound fort∈(T 0, T]with Lemma F.6 fort≤T 0 proves the claimed coefficient bound for allt≤T. It remains to record the corresponding noise-response bound. The inner-product estimate above used only the event bounds (P1), (P2), (P6), and a uniform coefficient bound by ρth. Therefore, after the coefficient...

work page

[39] [39]

There exists at least one filterr∈[m]whose first coordinate is aligned with the learnable feature: w(T) r,1 ≥ ˜Ω(α−1).(211) 52 Toward Understanding Adversarial Distillation: Why Robust Teachers Fail

work page

[40] [40]

The network barely memorizes the noise features. More precisely, for every filterr∈[m] and every noise patch (i, j) withi∈[N]andj̸=s(X i), ⟨w(T) r ,x i,j⟩ ≤ ˜O(σ0σn √ d)(212) Consequently, both the robust training error and robust test error areo(1). Proof of Theorem G.2. Step 1: Signal Alignment.By Lemma F.5, after T0 ≤ 5 exp(2C3 0) η(α−ϵ)3σ0 iterations,...

work page

[41] [41]

log 16N mP δ · σ0σ3 nd3/2 (α−ϵ) 3 ≤2σ 0 √ d+ 300N Pexp(2C 3 0)σ0 ≤3σ 0 √ d, (219) where the second inequality follows from conditions (C3) and (C6), and the last inequality follows from condition (C2). Therefore, ∥z(T) r ∥1 ≤ √ d∥z (T) r ∥2 ≤3σ 0dfor allr∈[m].(220) 53 Toward Understanding Adversarial Distillation: Why Robust Teachers Fail Step 4.2: Tail r...

work page

[42] [42]

Consider an unlearnable sample i∈ S U

≤ψ yifW(t)( ˜X(t) i ) ≤1.(249) Proof of Lemma G.3. Consider an unlearnable sample i∈ S U . Its signal patch is aligned with v=e d, while the weights remain orthogonal to v. Hence the signal-patch contribution vanishes, and the margin is determined entirely by the noise patches: yifW(t)( ˜X(t) i ) = X r∈[m] X j̸=s(Xi) yi⟨w(t) r ,x i,j⟩3.(250) On the eventE...

work page

[43] [43]

(266) Together with the trivial upper boundψ(·)≤1, this proves the claimed gradient bound on unlearnable samples. Lemma G.4(Dynamics of the Maximum Shifted Noise Coefficient).On the event E defined in Lemma E.1, suppose that the maximum shifted noise coefficient ˆρ(t) max defined in Lemma G.3 satisfies ˆρ(t) max ≤ C1 (mP) 1/3σ2nd. Then the discrete dynami...

work page

[44] [44]

(293) Hence, ˆρ(t+1) max ≥ˆρ(t) max + ησ4 nd2 3N(1 + exp(4C 3 1)) ˆρ(t) max 2 .(294) This proves the stated dynamics for the maximum shifted noise coefficient

(292) Therefore, ρ(t+1) it,jt,rt −ρ (t) it,jt,rt ≥ 3η N · 1 1 + exp(4C3 1) 1 3 ˆρ(t) maxσ2 nd 2 = ησ4 nd2 3N(1 + exp(4C 3 1)) ˆρ(t) max 2 . (293) Hence, ˆρ(t+1) max ≥ˆρ(t) max + ησ4 nd2 3N(1 + exp(4C 3 1)) ˆρ(t) max 2 .(294) This proves the stated dynamics for the maximum shifted noise coefficient. Lemma G.5(Noise Memorization Time).We formalize the noise...

work page

[45] [45]

There exists at least one filterr∈[m]whose first coordinate is aligned with the learnable feature: w(T) r,1 ≥ ˜Ω(α−1).(317)

work page

[46] [46]

The network memorizes the noise features on at least one unlearnable sample. More precisely, there exists an index i∈ S U such that max r∈[m],j̸=s(X i) yi⟨w(T) r ,x i,j⟩ ≥ ˜Ω(1).(318) Consequently, the robust training error iso(1), while the robust test error is at least 1 2 −o(1). Proof of Theorem G.6. Step 1: Signal Alignment.By Lemma F.5, after T0 ≤ 5 ...

work page

[47] [47]

There exists at least one filterr∈[m]whose first coordinate is aligned with the learnable feature: w(T) r,1 ≥ ˜Ω(α−1).(372)

work page

[48] [48]

The network memorizes the noise features on at least one unlearnable sample. More precisely, there exists an index i∈ S U such that max r∈[m],j̸=s(X i) yi⟨w(T) r ,x i,j⟩ ≥ ˜Ω(1).(373) Consequently, the robust training error iso(1), while the robust test error is at least 1 2 −o(1). Proof of Theorem H.3. By Lemma H.1, on learnable samples, the AD gradient ...

work page

[49] [49]

Lemma H.5(Inner Product Bound under a Good Teacher).Consider Adversarial Distillation under a Good Teacher

log 16N mP δ · σ0σ2 nd N(α−ϵ) 3 .(376) For unlearnable samples, max i∈SU , j̸=s(Xi), r∈[m] ρ(t) i,j,r ≤ σ0 q log 16N mP δ σn √ d .(377) See Lemma H.7 for the verification of this hypothesis. Lemma H.5(Inner Product Bound under a Good Teacher).Consider Adversarial Distillation under a Good Teacher. On the event E defined in Lemma E.1, suppose that Hypothes...

work page

[50] [50]

log 16N mP δ · σ0σ2 nd N(α−ϵ) 3 · 3 2 σ2 nd + (1−p un)N P·200 exp(2C 3

work page

[51] [51]

log 16N mP δ · σ0σ2 nd N(α−ϵ) 3 ·2σ 2 n s dlog 16N2P 2 δ +p unN P· σ0 q log 16N mP δ σn √ d ·2σ 2 n s dlog 16N2P 2 δ . (381) 70 Toward Understanding Adversarial Distillation: Why Robust Teachers Fail By conditions (C3), (C4) and (C6), the second and third terms are strictly bounded by 1 4 σ0σn s dlog 16N mP δ .(382) Moreover, by conditions (C2) and (C7), ...

work page

[52] [52]

≤ψ yifW(t)( ˜X(t) i ) ≤1.(386) Second, the hitting timeT 0 := min n t≥0 maxr∈[m] w(t) r,1 > C0 αm1/3 o satisfies 1 12 q 2 log 16m δ ηα3σ0 ≤T 0 ≤ 5 exp(2C3 0) η(α−ϵ) 3(1−p un)σ0 .(387) Third, for everyt≤T 0, max i∈SL, j̸=s(Xi), r∈[m] ρ(t) i,j,r ≤100 exp(2C 3

work page

[53] [53]

By Lemma H.5, the assumption Hypothesis H.4 implies that the conclusion of Hypothesis F.3 holds at every iteration t≤T

log 16N mP δ · σ0σ2 nd N(α−ϵ) 3(1−p un) .(388) Fourth, for everyt≤Tand everyr∈[m], w(t) r,1 ≤ 3 log1/3 T α .(389) Finally, for everyt∈[T 0 + 1, T]and every learnable samplei∈ S L, ψ yifW(t)( ˜X(t) i ) ≤ 4m2/3 log1/3 T C2 0 1− ϵ α 3 ηα2(t−T 0) .(390) Proof of Lemma H.6. By Lemma H.5, the assumption Hypothesis H.4 implies that the conclusion of Hypothesis F...

work page

[54] [54]

log 16N mP δ · σ0σ2 nd N(α−ϵ) 3(1−p un) .(392) Suppose for contradiction that there existst≤T 0 such thattis the first iteration satisfying max i∈SL, j̸=s(Xi), r∈[m] ρ(t) i,j,r > ρth.(393) Then, by the minimality oft, max i∈SL, j̸=s(Xi), r∈[m] ρ(τ) i,j,r ≤ρ th for allτ < t.(394) Under this bound, together with the unlearnable-sample coefficient bound in H...

work page

[55] [55]

(402) By conditions (C3), (C4) and (C6), the learnable cross term satisfies (1−p un)N P·200 exp(2C 3

log 16N mP δ σ0σ2 nd N(α−ϵ) 3 ·2σ 2 n s dlog 16N2P 2 δ +p unN P· σ0 q log 16N mP δ σn √ d ·2σ 2 n s dlog 16N2P 2 δ . (402) By conditions (C3), (C4) and (C6), the learnable cross term satisfies (1−p un)N P·200 exp(2C 3

work page

[56] [56]

(403) Similarly, by conditions (C2), (C7) and (C8), the unlearnable cross term satisfies punN P· σ0 q log 16N mP δ σn √ d ·2σ 2 n s dlog 16N2P 2 δ ≤ 1 4 σ0σn s dlog 16N mP δ

log 16N mP δ σ0σ2 nd N(α−ϵ) 3 ·2σ 2 n s dlog 16N2P 2 δ ≤ 1 4 σ0σn s dlog 16N mP δ . (403) Similarly, by conditions (C2), (C7) and (C8), the unlearnable cross term satisfies punN P· σ0 q log 16N mP δ σn √ d ·2σ 2 n s dlog 16N2P 2 δ ≤ 1 4 σ0σn s dlog 16N mP δ . (404) Combining this with the initialization term and the self term gives, for everyτ < t, ⟨w(τ) ...

work page

[57] [57]

Hence it remains to considert∈(T 0, T]

log 16N mP δ · σ0σ2 nd N(α−ϵ) 3(1−p un) .(411) By condition (C7), this is bounded by the claimed learnable-sample threshold. Hence it remains to considert∈(T 0, T]. Define ρth := 200 exp(2C3

work page

[58] [58]

log 16N mP δ · σ0σ2 nd N(α−ϵ) 3 .(412) Fix anyτ < t, anyr∈[m], anyi∈ S L, and anyj̸=s(X i). By the decomposition in Lemma F.2, ⟨w(τ) r ,x i,j⟩ ≤ ⟨w(0) r ,x i,j⟩ + ρ(τ) i,j,r ∥xi,j∥2 2 + X k∈SL, q̸=s(X k) (k,q)̸=(i,j) ρ(τ) k,q,r |⟨xk,q,x i,j⟩|+ X k∈SU , q̸=s(X k) ρ(τ) k,q,r |⟨xk,q,x i,j⟩|. (413) Applying the learnable-sample bound ρ(τ) k,q,r ≤ρ th, the ass...

work page

[59] [59]

(421) Under conditions (C4) and (C7), the two terms on the right-hand side are together bounded byρ th

log 16N mP δ · σ0σ2 nd N(α−ϵ) 3(1−p un) + 108m2/3 log1/3 T·σ 2 0σ2 ndlog 16N mP δ C2 0 1− ϵ α 3 N α2 . (421) Under conditions (C4) and (C7), the two terms on the right-hand side are together bounded byρ th. Therefore, ρ(t) i,j,r ≤ρ th,(422) which contradicts the assumption thattis the first iteration where the learnable-sample bound fails. Hence the learn...

work page

[60] [60]

There exists at least one filterr∈[m]whose first coordinate is aligned with the learnable feature: w(T) r,1 ≥ ˜Ω(α−1).(424)

work page

[61] [61]

The network barely memorizes the noise features. More precisely, for every filterr∈[m] and every noise patch (i, j) withi∈[N]andj̸=s(X i), ⟨w(T) r ,x i,j⟩ ≤ ˜O(σ0σn √ d)(425) 75 Toward Understanding Adversarial Distillation: Why Robust Teachers Fail Consequently, both the robust training error and robust test error areo(1). Proof of Theorem H.8. We compar...

work page

[62] [62]

(439) The learnable contribution is bounded as in Theorem G.2, and the unlearnable contribution is bounded by conditions (C2), (C7) and (C8)

log 16N mP δ · σ0σ2 nd N(α−ϵ) 3 · r 3 2 σn √ d +p unN P· σ0 q log 16N mP δ σn √ d · r 3 2 σn √ d. (439) The learnable contribution is bounded as in Theorem G.2, and the unlearnable contribution is bounded by conditions (C2), (C7) and (C8). Therefore, ∥z(T) r ∥2 ≤3σ 0 √ dfor allr∈[m].(440) At this point, the two structural conditions required in Steps 4.2 ...

work page 2022