Toward Understanding Adversarial Distillation: Why Robust Teachers Fail
Pith reviewed 2026-05-22 08:36 UTC · model grok-4.3
The pith
When robust teachers confidently label unlearnable samples, students memorize noise and lose robustness.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
In the two-layer network analysis, confident supervision from the teacher on robustly unlearnable samples forces the student to memorize spurious noise patterns that overpower the learned robust signal and drive robust overfitting. High predictive uncertainty from the teacher on those same samples suppresses noise memorization, allowing the student to achieve robust generalization from learnable features alone. This mismatch between teacher confidence and student representational limits explains the inconsistent outcomes observed when distilling from robust teachers.
What carries the argument
The Robustly Unlearnable Set: the consistent subset of training data on which the student's limited feature-learning capacity prevents acquisition of robust signals, so that teacher confidence on these points determines whether noise memorization overtakes robust learning.
If this is right
- A teacher's predictive entropy on unlearnable samples serves as a direct indicator of the student's eventual robust accuracy.
- Teachers exhibiting high uncertainty on unlearnable samples enable the student to rely solely on learnable robust signals without noise overfitting.
- Robust overfitting arises specifically from the teacher's interaction with unlearnable samples rather than from general capacity mismatch.
- Selecting teachers by their uncertainty on the unlearnable set provides a principled way to improve distillation success.
Where Pith is reading between the lines
- The same unlearnable-set mismatch may appear in deeper networks, suggesting practical methods to detect such samples during training without full theoretical analysis.
- The mechanism could extend to non-adversarial distillation settings where a teacher provides soft labels on data the student cannot fully represent.
- One could test whether actively increasing teacher uncertainty on identified unlearnable samples improves outcomes across multiple architectures.
Load-bearing premise
The feature learning dynamics observed in the two-layer neural network analysis capture the essential mismatch behavior that occurs in deeper networks and real-world image classification tasks.
What would settle it
If student robust accuracy stays high even when the teacher assigns confident labels to a large measured Robustly Unlearnable Set, or if teacher entropy on those samples shows no correlation with final student robustness on standard image benchmarks.
Figures
read the original abstract
Adversarial Distillation aims to enhance student robustness by guiding the student with a robust teacher's soft labels within the min-max adversarial training framework, yet its success is notoriously inconsistent: a more robust teacher often fails to improve, or even harms, the student's robust generalization. In this paper, we identify a key mechanism of this teacher dependency: the misalignment between the teacher's supervisory confidence and the student's representational limitations on a consistent subset of training data -- the Robustly Unlearnable Set. We present a theoretical framework analyzing the feature learning dynamics of a two-layer neural network, demonstrating that this mismatch creates a dichotomy in distillation outcomes. We prove that when a teacher provides confident supervision on unlearnable samples, it compels the student to memorize spurious noise patterns that eventually overpower the learned robust signal, thereby driving robust overfitting. Conversely, a teacher that exhibits high uncertainty on these samples effectively suppresses noise memorization, allowing the student to rely solely on the learnable signal for robust generalization. We empirically validate our theory across both synthetic simulations and real-image classification datasets, confirming that robust overfitting is driven by the teacher's interaction with unlearnable samples. Finally, we demonstrate that a teacher's predictive entropy on unlearnable samples serves as a strong indicator of student robustness, validating our theoretical framework and offering a principled guideline for robust teacher selection.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that inconsistent outcomes in adversarial distillation arise because robust teachers provide confident supervision on a 'Robustly Unlearnable Set' of training samples; this forces the student to memorize spurious noise that overpowers the robust signal and produces robust overfitting. A theoretical analysis of feature learning dynamics in two-layer networks is used to prove a dichotomy: confident teacher labels on unlearnable samples drive noise memorization, while high uncertainty suppresses it and permits robust generalization. Empirical results on synthetic data and real-image classification tasks are presented to show that a teacher's predictive entropy on the unlearnable set is a strong predictor of the student's final robustness.
Significance. If the mechanism generalizes beyond the two-layer setting, the work supplies both a concrete explanation for why stronger teachers can harm student robustness and a practical selection criterion (predictive entropy on the identified unlearnable subset). The combination of a feature-learning proof and a falsifiable empirical indicator is a clear strength.
major comments (1)
- [Theoretical framework] Theoretical framework (two-layer analysis): the feature-learning dynamics and resulting dichotomy are derived under assumptions specific to two-layer networks, data distribution, activation, and optimization. The manuscript does not provide intermediate-layer probes, controlled ablations, or transfer arguments showing that the same noise-memorization pathway governs behavior in the deeper architectures used for the image-classification experiments; without such evidence the link between the proven 2-layer mechanism and the observed teacher-dependency on real data remains unestablished.
minor comments (2)
- [Empirical validation] The construction and identification procedure for the Robustly Unlearnable Set on high-dimensional image data should be stated more explicitly, including any data-exclusion criteria and sensitivity checks.
- [Notation] Notation for the unlearnable-set indicator and the entropy measure could be introduced earlier and used consistently across theory and experiments.
Simulated Author's Rebuttal
We thank the referee for the constructive and insightful comments. We address the major comment below and describe the revisions we intend to make.
read point-by-point responses
-
Referee: Theoretical framework (two-layer analysis): the feature-learning dynamics and resulting dichotomy are derived under assumptions specific to two-layer networks, data distribution, activation, and optimization. The manuscript does not provide intermediate-layer probes, controlled ablations, or transfer arguments showing that the same noise-memorization pathway governs behavior in the deeper architectures used for the image-classification experiments; without such evidence the link between the proven 2-layer mechanism and the observed teacher-dependency on real data remains unestablished.
Authors: We agree that the two-layer analysis is conducted under simplifying assumptions and that the manuscript would benefit from a clearer bridge to the deeper architectures used in the image experiments. The two-layer setting was deliberately chosen to enable a rigorous proof of the feature-learning dichotomy (confident supervision on unlearnable samples drives noise memorization, while high uncertainty suppresses it). The empirical results then test the observable consequence of this mechanism: that a teacher’s predictive entropy on the identified unlearnable subset is a strong predictor of student robustness across standard deep models and real datasets. While we do not currently include layer-wise probes or explicit transfer proofs, this predictive correlation provides indirect but falsifiable support for the proposed pathway. In the revised manuscript we will (i) add a dedicated discussion subsection that situates the two-layer dynamics within the broader literature on feature learning in over-parameterized networks and (ii) include additional controlled ablations that monitor student behavior on the unlearnable subset when using deeper architectures, thereby strengthening the empirical link without altering the core theoretical contribution. revision: partial
Circularity Check
No significant circularity; derivation is self-contained
full rationale
The paper derives its central dichotomy from a self-contained theoretical analysis of feature learning dynamics in a two-layer neural network, with the Robustly Unlearnable Set defined independently via the student's representational limitations rather than from the final robustness metric or fitted outcomes. The proof that confident teacher supervision compels noise memorization follows directly from the stated assumptions on data distribution, activation, and optimization without reducing to a tautology or renamed input. Empirical validation on synthetic and real-image data serves as external confirmation, not a statistically forced prediction. No load-bearing self-citations, uniqueness theorems imported from the authors, or ansatzes smuggled via prior work are present in the derivation chain.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Feature learning dynamics of two-layer neural networks under min-max adversarial training capture the key mismatch between teacher confidence and student representational limits
invented entities (1)
-
Robustly Unlearnable Set
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We prove that when a teacher provides confident supervision on unlearnable samples, it compels the student to memorize spurious noise patterns that eventually overpower the learned robust signal
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanabsolute_floor_iff_bare_distinguishability unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
the misalignment between the teacher's supervisory confidence and the student's representational limitations on a consistent subset of training data -- the Robustly Unlearnable Set
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Un- terthiner, T., and Veit, A
URL https://openreview.net/forum? id=HQtTg1try7. Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Un- terthiner, T., and Veit, A. Understanding robustness of transformers for image classification. InProceedings of the IEEE/CVF international conference on computer vision, pp. 10231–10241, 2021. Cao, Y ., Chen, Z., Belkin, M., and Gu, Q. Benign overf...
work page 2021
-
[2]
Cui, J., Tian, Z., Zhong, Z., QI, X., Yu, B., and Zhang, H
URL https://openreview.net/forum? id=SSKZPJCt7B. Cui, J., Tian, Z., Zhong, Z., QI, X., Yu, B., and Zhang, H. Decoupled kullback-leibler divergence loss. InThe Thirty- eighth Annual Conference on Neural Information Pro- cessing Systems, 2024. URL https://openreview. net/forum?id=bnZZedw9CM. Dai, S., Mahloujifar, S., and Mittal, P. Parameterizing acti- vati...
work page 2024
-
[3]
Engstrom, L., Ilyas, A., Santurkar, S., Tsipras, D., Tran, B., and Madry, A
URL https://openreview.net/forum? id=7gE9V9GBZaI. Engstrom, L., Ilyas, A., Santurkar, S., Tsipras, D., Tran, B., and Madry, A. Adversarial robustness as a prior for learned representations.arXiv preprint arXiv:1906.00945, 2019. 11 Toward Understanding Adversarial Distillation: Why Robust Teachers Fail Gao, R., Cai, T., Li, H., Hsieh, C.-J., Wang, L., and ...
-
[4]
URL https://openreview.net/forum? id=juKVq5dWTR. Li, B. and Li, Y . Adversarial training can provably im- prove robustness: Theoretical analysis of feature learn- ing process under structured data. InThe Thirteenth International Conference on Learning Representations, 2025a. URL https://openreview.net/forum? id=inLUnCpDIB. Li, B. and Li, Y . On the clean ...
work page 2023
-
[5]
URL https://openreview.net/forum? id=Sys6GJqxl. Lu, M., Wu, B., Yang, X., and Zou, D. Benign oscillation of stochastic gradient descent with large learning rate. InThe Twelfth International Conference on Learning Representations, 2024. URL https://openreview. net/forum?id=wYmvN3sQpG. Ma, X., Niu, Y ., Gu, L., Wang, Y ., Zhao, Y ., Bailey, J., and Lu, F. U...
work page internal anchor Pith review doi:10.48550/arxiv 2024
-
[6]
Transferability in Machine Learning: from Phenomena to Black-Box Attacks using Adversarial Samples
URL https://openreview.net/forum? id=8on9dIUh5v. Oh, J., Song, J., and Yun, C. From linear to nonlinear: Provable weak-to-strong generalization through feature learning. InThe Thirty-ninth Annual Conference on Neu- ral Information Processing Systems, 2026. URL https: //openreview.net/forum?id=xMiKDqxEE8. Pang, T., Yang, X., Dong, Y ., Su, H., and Zhu, J. ...
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[7]
Intriguing properties of neural networks
IEEE, 2022. Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Er- han, D., Goodfellow, I. J., and Fergus, R. Intrigu- ing properties of neural networks. InInternational Conference on Learning Representations, 2014. URL https://arxiv.org/abs/1312.6199. Tram`er, F., Papernot, N., Goodfellow, I., Boneh, D., and Mc- Daniel, P. The space of transferable adve...
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[8]
GPTFUZZER: Red Teaming Large Language Models with Auto-Generated Jailbreak Prompts
URL https://openreview.net/forum? id=SyxAb30cY7. Uesato, J., O’donoghue, B., Kohli, P., and Oord, A. Ad- versarial risk and the dangers of evaluating against weak attacks. InInternational Conference on Machine Learn- ing, pp. 5025–5034. PMLR, 2018. Vaishnavi, P., Eykholt, K., and Rahmati, A. Transferring ad- versarial robustness through robust representat...
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[9]
and AutoAttack (Croce & Hein, 2020), to directly maximize the loss. In contrast, the black-box setting assumes 22 Toward Understanding Adversarial Distillation: Why Robust Teachers Fail Table 9.Sensitivity of robustly unlearnable samples to varying perturbation bounds.Unlike the baseline subset identification performed at a fixed attack budget, this exper...
work page 2020
-
[10]
Let F := span{u,v} , and let ΠF be the projection matrix onto this subspace
Fix two orthogonal robust features: the learnable feature u :=e 1 and the unlearnable feature v :=e d. Let F := span{u,v} , and let ΠF be the projection matrix onto this subspace. Also fix a partition of the training indices [N] into SL andS U , with|S L|= (1−p un)Nand|S U |=p unN
-
[11]
The signal patch is generated by xi,s(Xi) = ( αyiu,ifi∈ S L (learnable), αyiv,ifi∈ S U (unlearnable)
For each i∈[N] , draw the label yi uniformly from {+1,−1} , and draw a signal patch index s(Xi) uniformly from [P] . The signal patch is generated by xi,s(Xi) = ( αyiu,ifi∈ S L (learnable), αyiv,ifi∈ S U (unlearnable). (22) 25 Toward Understanding Adversarial Distillation: Why Robust Teachers Fail Table 11.Summary of notations used in the theoretical anal...
-
[12]
For each non-signal patchp̸=s(X i), draw independent Gaussian noise orthogonal to the feature subspace: xi,p ∼ N 0, σ2 n(Id −Π F) .(23) We analyze two regimes of the unlearnable ratio pun: (i) the learnable-only regime pun = 0 (so SU =∅ ), and (ii) the sparse-unlearnable regime CN −1 ≤p un ≤C −1N −1 logd (so C≤ |S U | ≤C −1 logd ). These two regimes lead ...
-
[13]
Both teachers generalize well on learnable samples. In particular, for any i∈ S L, the generic teacher produces a large logit in the direction of the target label: yifWT(Xi)≥Γ.(26) Additionally, both teachers have weights orthogonal to noise patches. Their distinction lies only in their behavior on the unlearnable featurev
-
[14]
Consequently, for any unlearnable training sample indexed byi∈ S U , yifWG(Xi) = 0.(27)
Good Teacher (fWG).This teacher relies only on the learnable feature u and is orthogonal to the unlearnable feature v. Consequently, for any unlearnable training sample indexed byi∈ S U , yifWG(Xi) = 0.(27)
-
[15]
Bad Teacher (fWB).This teacher is aligned with the unlearnable feature v. Consequently, for any unlearnable training sample indexed byi∈ S U , yifWB(Xi)≥Γ.(28) The terms Good Teacher and Bad Teacher refer to their compatibility with the capacity-constrained student. Both teachers are robust models. The Bad Teacher is called “bad” only because its confiden...
-
[16]
Following Li & Li (2025b), for a sample(X, y), we define the training adversarial example ˜Xas the solution to ˜X= argmax X′ ℓ y fW(X′) s.t.∥X ′ −X∥ ∞ ≤ϵ,X ′ −X∈span xs(X) .(29) Thus, the adversary perturbs only the signal patch along its original direction. In particular, for a learnable sample with signal patchx s(X) =yαu, the adversarially perturbed si...
-
[17]
Standard AT minimizes the logistic loss on the generated adversarial example: LAT(W;X, y) =ℓ y fW( ˜X) .(31)
-
[18]
In AD, we leverage a fixed robust teacher networkfWT to provide soft targets for the student. Following the standard AD setup (e.g., RSLAD (Zi et al., 2021)), the student matches the teacher’s output distribution. In the binary margin-based formulation, this yields the weighted logistic loss: LAD(W;X, y) =σ yfWT(X) ℓ yfW( ˜X) +σ −yf WT(X) ℓ −yf W( ˜X) .(32)
work page 2021
-
[19]
We optimize the empirical risk by gradient descent. At each iteration t, the adversarial examples { ˜X(t) i }N i=1 are generated using the current modelW (t), and the parameters are updated by W(t+1) =W (t) − η N X i∈[N] ∇WL W(t);X i, yi ,(33) whereL ∈ {L AT,L AD}andη >0is the learning rate. We analyze the student afterTiterations. D.6. Parameter and Regi...
-
[20]
Uniform bounds on noise patches: 1 2 σ2 nd≤ ∥x i,j∥2 2 ≤ 3 2 σ2 nd.(P1) |⟨xi,j,x k,q⟩| ≤2σ 2 n s dlog 16N2P 2 δ .(P2) 29 Toward Understanding Adversarial Distillation: Why Robust Teachers Fail ∥xi,j∥∞ ≤σ n s 2 log 16dN P δ .(P3)
-
[21]
Uniform bounds on initialization: ∥w(0) r ∥2 ≤2σ 0 √ d.(P4) ⟨w(0) r ,e 1⟩ ≤σ 0 s 2 log 16m δ .(P5) ⟨w(0) r ,x i,j⟩ ≤2σ 0σn s dlog 16N mP δ .(P6)
-
[22]
Maximum initialization margins: 1 2 σ0 ≤max r∈[m] ⟨w(0) r ,e 1⟩ ≤σ 0 s 2 log 16m δ .(P7) 1 4 σ0σn √ d≤max r∈[m] yi⟨w(0) r ,x i,j⟩ ≤2σ 0σn s dlog 16N mP δ .(P8) Then, the eventEoccurs with probability at least1−δ. Proof of Lemma E.1. We prove that (P1)–(P6) and the lower bounds in (P7) and (P8) each fail with probability at most δ/8. The upper bounds in (P...
-
[23]
≤ψ yifW(t)( ˜X(t) i ) ≤1.(74) Proof of Lemma F .4.Fix any learnable sample i∈ S L. We decompose the margin into the signal-patch contribution and the noise-patch residual: yifW(t)( ˜X(t) i ) = yi⟨e1, ˜x(t) i,s(Xi)⟩ 3 X r∈[m] (w(t) r,1)3 + X r∈[m] X j̸=s(Xi) yi⟨w(t) r ,x i,j⟩3.(75) By the perturbation constraint, 0≤y i⟨e1, ˜x(t) i,s(Xi)⟩ ≤α. Using the assu...
-
[24]
(81) Together with the trivial upper boundψ(·)≤1, this proves the claimed gradient bound on learnable samples. 37 Toward Understanding Adversarial Distillation: Why Robust Teachers Fail Lemma F.5(Signal Learning Time).We formalize the signal growth statement in Lemma 4.9. Let T0 be the first hitting time at which the maximum signal component reaches the t...
-
[25]
We track the evolution of this filter
(86) This leads to the following recursive bounds: w(t+1) r,1 ≤w (t) r,1 +A(w (t) r,1)2, w (t+1) r,1 ≥w (t) r,1 +B(w (t) r,1)2.(87) On the event E, the upper bound in (P7) with condition (C4) guarantees that the initial point is small enough compared to the target threshold , maxr∈[m] w(0) r,1 ≤σ 0 q 2 log 16m δ ≤ 1 2 C0 αm1/3 , and the lower bound in (P7...
-
[26]
Proof of Lemma F .6.Define the threshold ρth := 100 exp(2C3
log 16N mP δ · σ0σ2 nd N(α−ϵ) 3(1−p un) .(94) In particular, ifT 0 denotes the hitting time from Lemma F.5, thenT 0 ≤ ˆT0, so the same bound holds for allt≤T 0. Proof of Lemma F .6.Define the threshold ρth := 100 exp(2C3
-
[27]
log 16N mP δ · σ0σ2 nd N(α−ϵ) 3(1−p un) .(95) Assume for contradiction that there exists an iterationt≤ ˆT0 such thattis the first iteration where max i,j,r ρ(t) i,j,r > ρth.(96) Then, by the minimality oft, we haveρ (τ) i,j,r ≤ρ th for allτ < tand alli, j, r. Fix any iteration τ < t . By the decomposition in Lemma F.2, the triangle inequality and the non...
-
[28]
Summing the update rule from Lemma F.2 overτ= 0,
log 16N mP δ · σ0σ2 nd N(α−ϵ) 3(1−p un) ·2σ 2 nd ≤ r 20 3 σ0σn s dlog 16N mP δ , (99) where the second inequality follows from condition (C2), and the final inequality holds under conditions (C3) and (C6). Summing the update rule from Lemma F.2 overτ= 0, . . . , t−1and usingψ yifW(τ) ( ˜X(τ) i ) ≤1, we obtain ρ(t) i,j,r = t−1X τ=0 3η N ψ yifW(τ) ( ˜X(τ) i...
-
[29]
log 16N mP δ · σ0σ2 nd N(α−ϵ) 3(1−p un) .(102) F.4. Post-T0 Control on Learnable Samples Once the signal reaches its target scale at time T0, learnable samples no longer drive large updates. We show that the signal weights remain bounded and that the cumulative gradient contribution from learnable samples is controlled. Lemma F.7(Signal Weight Bound).On t...
-
[30]
log 16N mP δ · σ0σ2 nd N(α−ϵ) 3(1−p un) + 18m2/3 log1/3 T C2 0 1− ϵ α 3 N α2 18σ2 0σ2 ndlog 16N mP δ + 18p2 unN2P 2 ρSU max 2 σ4 ndlog 16N2P 2 δ . (175) Using conditions (C6) and (C7), for a sufficiently large constantC, the threshold expansion satisfies ρth ≤Clog 16N mP δ σ0σ2 nd N α3 + Cm2/3σ2 0σ2 ndlog 1/3 Tlog 16N mP δ N α2 + Cp2 unN P2m2/3 ρSU max 2 ...
-
[31]
It therefore remains to consider the regimet+ 1> T 0
log 16N mP δ · σ0σ2 nd N(α−ϵ) 3(1−p un) ≤ρ th (183) for everyi∈ S L, everyj̸=s(X i), and everyr∈[m]. It therefore remains to consider the regimet+ 1> T 0. In this case, for everyτ∈[T 0, t], the induction hypothesis yields ρ(τ) i,j,r ≤ρ th for alli∈ S L, j̸=s(X i), r∈[m].(184) 48 Toward Understanding Adversarial Distillation: Why Robust Teachers Fail Fix a...
-
[32]
(193) This provesρ (t+1) i,j,r ≤ρ th
log 16N mP δ · σ0σ2 nd N(α−ϵ) 3(1−p un) + 18m2/3 log1/3 T C2 0 1− ϵ α 3 N α2 18σ2 0σ2 ndlog 16N mP δ + 18p2 unN2P 2 ρSU max 2 σ4 ndlog 16N2P 2 δ =ρ th. (193) This provesρ (t+1) i,j,r ≤ρ th. Therefore, by (181), ρ(t) i,j,r ≤ σ0 σn √ d + punN σ2 n √ d 100 log2/3 T ρSU max 2 .(194) This proves (171). By (174), (171) gives ρ(t) i,j,r ≤ σ0 σn √ d + punN σ2nd3/...
-
[33]
Thus it remains to considert∈(T 0, T]
log 16N mP δ · σ0σ2 nd N(α−ϵ) 3 .(198) Consequently, for everyi∈[N],j̸=s(X i),r∈[m], andt≤T, ⟨w(t) r ,x i,j⟩ ≤3σ 0σn s dlog 16N mP δ .(199) Proof of Lemma G.1.Ift≤T 0, the claim follows directly from Lemma F.6. Thus it remains to considert∈(T 0, T]. Define the threshold ρth := max i,j,r ρ(T0) i,j,r + 30m2/3 log 16N mP δ log1/3 T σ2 0σ2 nd C2 0 1− ϵ α 3 N ...
-
[34]
log 16N mP δ · σ0σ2 nd N(α−ϵ) 3 .(201) Therefore, ρth ≤100 exp(2C 3
-
[35]
log 16N mP δ · σ0σ2 nd N(α−ϵ) 3 + 30m2/3 log 16N mP δ log1/3 T σ2 0σ2 nd C2 0 1− ϵ α 3 N α2 = log 16N mP δ · σ0σ2 nd N(α−ϵ) 3 100 exp(2C3
-
[36]
+ 30m2/3ασ0 log1/3 T C2 0 ! ≤200 exp(2C 3
-
[37]
log 16N mP δ · σ0σ2 nd N(α−ϵ) 3 , (202) where the last inequality follows from condition (C4). Assume for contradiction that there exists an iterationt∈(T 0, T]such thattis the first iteration where max i,j,r ρ(t) i,j,r > ρth.(203) Then, by the minimality oft, we haveρ (τ) i,j,r ≤ρ th for allτ∈[T 0, t−1]and alli, j, r. Fix any iteration τ < t . By the dec...
-
[38]
It remains to record the corresponding noise-response bound
log 16N mP δ · σ0σ2 nd N(α−ϵ) 3 .(209) Combining this bound fort∈(T 0, T]with Lemma F.6 fort≤T 0 proves the claimed coefficient bound for allt≤T. It remains to record the corresponding noise-response bound. The inner-product estimate above used only the event bounds (P1), (P2), (P6), and a uniform coefficient bound by ρth. Therefore, after the coefficient...
-
[39]
There exists at least one filterr∈[m]whose first coordinate is aligned with the learnable feature: w(T) r,1 ≥ ˜Ω(α−1).(211) 52 Toward Understanding Adversarial Distillation: Why Robust Teachers Fail
-
[40]
The network barely memorizes the noise features. More precisely, for every filterr∈[m] and every noise patch (i, j) withi∈[N]andj̸=s(X i), ⟨w(T) r ,x i,j⟩ ≤ ˜O(σ0σn √ d)(212) Consequently, both the robust training error and robust test error areo(1). Proof of Theorem G.2. Step 1: Signal Alignment.By Lemma F.5, after T0 ≤ 5 exp(2C3 0) η(α−ϵ)3σ0 iterations,...
-
[41]
log 16N mP δ · σ0σ3 nd3/2 (α−ϵ) 3 ≤2σ 0 √ d+ 300N Pexp(2C 3 0)σ0 ≤3σ 0 √ d, (219) where the second inequality follows from conditions (C3) and (C6), and the last inequality follows from condition (C2). Therefore, ∥z(T) r ∥1 ≤ √ d∥z (T) r ∥2 ≤3σ 0dfor allr∈[m].(220) 53 Toward Understanding Adversarial Distillation: Why Robust Teachers Fail Step 4.2: Tail r...
-
[42]
Consider an unlearnable sample i∈ S U
≤ψ yifW(t)( ˜X(t) i ) ≤1.(249) Proof of Lemma G.3. Consider an unlearnable sample i∈ S U . Its signal patch is aligned with v=e d, while the weights remain orthogonal to v. Hence the signal-patch contribution vanishes, and the margin is determined entirely by the noise patches: yifW(t)( ˜X(t) i ) = X r∈[m] X j̸=s(Xi) yi⟨w(t) r ,x i,j⟩3.(250) On the eventE...
-
[43]
(266) Together with the trivial upper boundψ(·)≤1, this proves the claimed gradient bound on unlearnable samples. Lemma G.4(Dynamics of the Maximum Shifted Noise Coefficient).On the event E defined in Lemma E.1, suppose that the maximum shifted noise coefficient ˆρ(t) max defined in Lemma G.3 satisfies ˆρ(t) max ≤ C1 (mP) 1/3σ2nd. Then the discrete dynami...
-
[44]
(292) Therefore, ρ(t+1) it,jt,rt −ρ (t) it,jt,rt ≥ 3η N · 1 1 + exp(4C3 1) 1 3 ˆρ(t) maxσ2 nd 2 = ησ4 nd2 3N(1 + exp(4C 3 1)) ˆρ(t) max 2 . (293) Hence, ˆρ(t+1) max ≥ˆρ(t) max + ησ4 nd2 3N(1 + exp(4C 3 1)) ˆρ(t) max 2 .(294) This proves the stated dynamics for the maximum shifted noise coefficient. Lemma G.5(Noise Memorization Time).We formalize the noise...
-
[45]
There exists at least one filterr∈[m]whose first coordinate is aligned with the learnable feature: w(T) r,1 ≥ ˜Ω(α−1).(317)
-
[46]
The network memorizes the noise features on at least one unlearnable sample. More precisely, there exists an index i∈ S U such that max r∈[m],j̸=s(X i) yi⟨w(T) r ,x i,j⟩ ≥ ˜Ω(1).(318) Consequently, the robust training error iso(1), while the robust test error is at least 1 2 −o(1). Proof of Theorem G.6. Step 1: Signal Alignment.By Lemma F.5, after T0 ≤ 5 ...
-
[47]
There exists at least one filterr∈[m]whose first coordinate is aligned with the learnable feature: w(T) r,1 ≥ ˜Ω(α−1).(372)
-
[48]
The network memorizes the noise features on at least one unlearnable sample. More precisely, there exists an index i∈ S U such that max r∈[m],j̸=s(X i) yi⟨w(T) r ,x i,j⟩ ≥ ˜Ω(1).(373) Consequently, the robust training error iso(1), while the robust test error is at least 1 2 −o(1). Proof of Theorem H.3. By Lemma H.1, on learnable samples, the AD gradient ...
-
[49]
log 16N mP δ · σ0σ2 nd N(α−ϵ) 3 .(376) For unlearnable samples, max i∈SU , j̸=s(Xi), r∈[m] ρ(t) i,j,r ≤ σ0 q log 16N mP δ σn √ d .(377) See Lemma H.7 for the verification of this hypothesis. Lemma H.5(Inner Product Bound under a Good Teacher).Consider Adversarial Distillation under a Good Teacher. On the event E defined in Lemma E.1, suppose that Hypothes...
-
[50]
log 16N mP δ · σ0σ2 nd N(α−ϵ) 3 · 3 2 σ2 nd + (1−p un)N P·200 exp(2C 3
-
[51]
log 16N mP δ · σ0σ2 nd N(α−ϵ) 3 ·2σ 2 n s dlog 16N2P 2 δ +p unN P· σ0 q log 16N mP δ σn √ d ·2σ 2 n s dlog 16N2P 2 δ . (381) 70 Toward Understanding Adversarial Distillation: Why Robust Teachers Fail By conditions (C3), (C4) and (C6), the second and third terms are strictly bounded by 1 4 σ0σn s dlog 16N mP δ .(382) Moreover, by conditions (C2) and (C7), ...
-
[52]
≤ψ yifW(t)( ˜X(t) i ) ≤1.(386) Second, the hitting timeT 0 := min n t≥0 maxr∈[m] w(t) r,1 > C0 αm1/3 o satisfies 1 12 q 2 log 16m δ ηα3σ0 ≤T 0 ≤ 5 exp(2C3 0) η(α−ϵ) 3(1−p un)σ0 .(387) Third, for everyt≤T 0, max i∈SL, j̸=s(Xi), r∈[m] ρ(t) i,j,r ≤100 exp(2C 3
-
[53]
log 16N mP δ · σ0σ2 nd N(α−ϵ) 3(1−p un) .(388) Fourth, for everyt≤Tand everyr∈[m], w(t) r,1 ≤ 3 log1/3 T α .(389) Finally, for everyt∈[T 0 + 1, T]and every learnable samplei∈ S L, ψ yifW(t)( ˜X(t) i ) ≤ 4m2/3 log1/3 T C2 0 1− ϵ α 3 ηα2(t−T 0) .(390) Proof of Lemma H.6. By Lemma H.5, the assumption Hypothesis H.4 implies that the conclusion of Hypothesis F...
-
[54]
log 16N mP δ · σ0σ2 nd N(α−ϵ) 3(1−p un) .(392) Suppose for contradiction that there existst≤T 0 such thattis the first iteration satisfying max i∈SL, j̸=s(Xi), r∈[m] ρ(t) i,j,r > ρth.(393) Then, by the minimality oft, max i∈SL, j̸=s(Xi), r∈[m] ρ(τ) i,j,r ≤ρ th for allτ < t.(394) Under this bound, together with the unlearnable-sample coefficient bound in H...
-
[55]
(402) By conditions (C3), (C4) and (C6), the learnable cross term satisfies (1−p un)N P·200 exp(2C 3
log 16N mP δ σ0σ2 nd N(α−ϵ) 3 ·2σ 2 n s dlog 16N2P 2 δ +p unN P· σ0 q log 16N mP δ σn √ d ·2σ 2 n s dlog 16N2P 2 δ . (402) By conditions (C3), (C4) and (C6), the learnable cross term satisfies (1−p un)N P·200 exp(2C 3
-
[56]
log 16N mP δ σ0σ2 nd N(α−ϵ) 3 ·2σ 2 n s dlog 16N2P 2 δ ≤ 1 4 σ0σn s dlog 16N mP δ . (403) Similarly, by conditions (C2), (C7) and (C8), the unlearnable cross term satisfies punN P· σ0 q log 16N mP δ σn √ d ·2σ 2 n s dlog 16N2P 2 δ ≤ 1 4 σ0σn s dlog 16N mP δ . (404) Combining this with the initialization term and the self term gives, for everyτ < t, ⟨w(τ) ...
-
[57]
Hence it remains to considert∈(T 0, T]
log 16N mP δ · σ0σ2 nd N(α−ϵ) 3(1−p un) .(411) By condition (C7), this is bounded by the claimed learnable-sample threshold. Hence it remains to considert∈(T 0, T]. Define ρth := 200 exp(2C3
-
[58]
log 16N mP δ · σ0σ2 nd N(α−ϵ) 3 .(412) Fix anyτ < t, anyr∈[m], anyi∈ S L, and anyj̸=s(X i). By the decomposition in Lemma F.2, ⟨w(τ) r ,x i,j⟩ ≤ ⟨w(0) r ,x i,j⟩ + ρ(τ) i,j,r ∥xi,j∥2 2 + X k∈SL, q̸=s(X k) (k,q)̸=(i,j) ρ(τ) k,q,r |⟨xk,q,x i,j⟩|+ X k∈SU , q̸=s(X k) ρ(τ) k,q,r |⟨xk,q,x i,j⟩|. (413) Applying the learnable-sample bound ρ(τ) k,q,r ≤ρ th, the ass...
-
[59]
log 16N mP δ · σ0σ2 nd N(α−ϵ) 3(1−p un) + 108m2/3 log1/3 T·σ 2 0σ2 ndlog 16N mP δ C2 0 1− ϵ α 3 N α2 . (421) Under conditions (C4) and (C7), the two terms on the right-hand side are together bounded byρ th. Therefore, ρ(t) i,j,r ≤ρ th,(422) which contradicts the assumption thattis the first iteration where the learnable-sample bound fails. Hence the learn...
-
[60]
There exists at least one filterr∈[m]whose first coordinate is aligned with the learnable feature: w(T) r,1 ≥ ˜Ω(α−1).(424)
-
[61]
The network barely memorizes the noise features. More precisely, for every filterr∈[m] and every noise patch (i, j) withi∈[N]andj̸=s(X i), ⟨w(T) r ,x i,j⟩ ≤ ˜O(σ0σn √ d)(425) 75 Toward Understanding Adversarial Distillation: Why Robust Teachers Fail Consequently, both the robust training error and robust test error areo(1). Proof of Theorem H.8. We compar...
-
[62]
log 16N mP δ · σ0σ2 nd N(α−ϵ) 3 · r 3 2 σn √ d +p unN P· σ0 q log 16N mP δ σn √ d · r 3 2 σn √ d. (439) The learnable contribution is bounded as in Theorem G.2, and the unlearnable contribution is bounded by conditions (C2), (C7) and (C8). Therefore, ∥z(T) r ∥2 ≤3σ 0 √ dfor allr∈[m].(440) At this point, the two structural conditions required in Steps 4.2 ...
work page 2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.