Hard Labels In! Rethinking the Role of Hard Labels in Mitigating Local Semantic Drift

Bingkui Tong; Jiacheng Cui; Jiacheng Liu; Xiaohan Zhao; Xinyue Bi; Zhiqiang Shen

REVIEW 3 major objections 6 minor 15 references

Hard Labels In! Rethinking the Role of Hard Labels in Mitigating Local Semantic Drift

T0 review · 3 major / 6 minor · reviewed 2026-08-03 · deepseek-v4-flash

Challenge this review Re-run · record.json Download PDF Read on arXiv ↗

Pith's one-line read This paper argues that hard labels, used as a mid-training calibration stage, correct the semantic drift that appears when soft-label storage in dataset distillation is aggressively compressed, and demonstrates a 100x storage cut at higher

desk verdict A genuinely useful training recipe — soft→hard→soft calibration — with strong matched ablations, but the theory as written does not establish the claimed semantic-drift bound; worth a serious referee after the theoretical claims are revised. read the letter →

arxiv 2512.15647 v3 pith:PANFMDXA submitted 2025-12-17 cs.CV

Jiacheng Cui , Bingkui Tong , Xinyue Bi , Xiaohan Zhao , Jiacheng Liu , Zhiqiang Shen This is my paper

classification cs.CV

keywords datasetdistillationsoftlabelslocalsemanticdrifthard-labelcalibrationstorageefficiencyknowledgeImageNet-1Ksoft-hard-softschedule

verification ladder T0 review T1 audit T2 compute T3 formal

The pith

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The reading

The paper's central claim is that the standard practice of storing per-crop teacher soft labels in dataset distillation has a hidden failure mode: when only a few crops per image are kept to save storage, the soft targets for visually ambiguous crops drift away from the image's true semantics, and this 'local semantic drift' is not a nuisance but an irreducible statistical bias that shrinks only as the number of crops grows. The paper further claims that hard labels are exactly the right corrective because they are content-agnostic anchors, and a soft-to-hard-to-soft training schedule lets a student first learn the teacher's fine-grained structure, then use hard labels to suppress crop-specific variance, then re-align with soft labels. If correct, this re-establishes hard labels as a useful supervision stage rather than a coarse fallback, and makes aggressive soft-label compression practical. The paper reports 42.7% top-1 on ImageNet-1K at 285M soft-label storage, a 100x reduction over the full soft-label set and +9.0 points over the previous state of the art at that budget.

What carries the argument

The central object is the per-crop teacher prediction covariance, measured by the LVSD ratio R, which quantifies how much local views of an image disagree about class semantics; nonzero covariance is what makes limited soft-label coverage biased. The correcting mechanism is the soft-hard-soft schedule: first an ERM-fitted soft-label phase, then a hard-label phase using label-smoothed and CutMix-augmented one-hot targets, then a final soft refinement phase. The theoretical load-bearer is the gradient-alignment ratio D/m0, which Theorem 3 uses to lower-bound the cosine between soft and hard gradients, and which is assumed to decrease as training proceeds.

What would settle it

Take a fixed distilled dataset at SLC=50 and train two otherwise identical students for the same 300 epochs, one with soft-hard-soft and one with soft-only for all 300 epochs; if the soft-only student matches or beats HALD, the calibration claim collapses. Equivalently, measure the per-crop teacher covariance on the actual distilled crops: if the LVSD ratio R is close to 1 for most images, the drift mechanism that motivates the method is absent.

Watch

Extended reading notes

Core claim

Local-view semantic drift (LVSD) is the paper's central discovery: with a finite number s of crops per image, the covariance of per-crop teacher predictions is nonzero, and Theorem 1 proves the expected gap between the s-crop loss and the ideal full-coverage loss is at least sigma/sqrt(s) times a constant. Theorem 2 turns this into an Omega(1/s) lower bound on excess population risk, so the drift is a bias, not noise. The remedy is a soft-hard-soft schedule: Theorem 3 shows soft and hard gradient directions become aligned, lower-bounded by 1 minus a ratio of inter-class gradient spread over minimal gradient norm, and Corollary 1 converts that alignment into an effective sample size of s/(1-r

Load-bearing premise

The central premise is the unproved monotone-alignment assumption that the inter-class gradient spread D decreases faster than the minimal gradient norm m0 during early soft training; if a student's soft and hard gradient directions do not become increasingly aligned, the soft-hard-soft ordering has no theoretical support and alternative schedules fail badly.

Editorial extensions

If this is right

With soft-hard-soft training, reducing stored soft labels from roughly 28.3 GB to 285 MB does not cost accuracy; at IPC=50 on ImageNet-1K the method reports 42.7% top-1, +9.0 points over the soft-only prior art at the same budget.
The benefit grows with drift severity: gains over soft-only increase as the soft-label budget shrinks, matching the predicted 1/sqrt(s) bias.
Hard labels should be used as a separate calibration stage, not merged into a joint loss; mixing them into a weighted objective hurts performance, while the staged schedule improves it.
Stage ordering is load-bearing: soft-hard-soft beats hard-soft, soft-hard, and hard-soft-hard at equal budget, supporting the claim that early soft training is needed for gradient alignment.
Gains transfer across architectures and data sources, including lightweight nets, ViT-Tiny, and real (non-synthetic) ImageNet subsets.

Reading between the lines

Editorial extensions of the paper, not claims the author makes directly.

The theory proves that drift exists and that the schedule can reduce variance, but it does not prove that the soft-hard-soft order is optimal; a natural extension is to make the switch point data-driven, starting the hard phase when measured gradient cosine similarity crosses a threshold rather than after a fixed epoch budget.
Corollary 1's effective-sample-size argument suggests a continuous annealing scheme, ramping hard-label weight up and down instead of discrete stages, might achieve similar or better calibration; the paper does not test this.
Because gains are largest under extreme compression, the method points toward a near-zero-soft-label regime where teacher predictions are queried online or regenerated from a tiny stored set, a direction the paper leaves implicit.
The lower bounds are for objective and risk mismatch, not directly for test accuracy; on highly non-convex real pipelines, the empirical results carry more weight than the theorems, which rely on local quadratic assumptions.

Share X Bluesky LinkedIn Reddit HN

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, and a circularity audit.

Desk Editor's Note

read the letter

Quick read. The empirical core is real: inserting a hard-label stage mid-training consistently improves accuracy under low soft-label budgets across four generation methods, on both Tiny-ImageNet and ImageNet-1K. Table 7 is the right experiment — matched generation, matched hyperparameters, only the evaluation protocol changes — and it shows monotone gains that grow as SLC shrinks. That is the paper's contribution, and it holds up: a soft→hard→soft schedule as a calibration phase for low-SLC distillation, which I don't think GIFT, EDC, or LPLD covers. The storage story (28.3GB down to 285MB while gaining accuracy) is plausible and important for the subfield.

The soft spots are mostly in the packaging. Theorem 1 is a Paley–Zygmund lower bound on E|L_s − L_ideal| in terms of the variance of the per-crop loss. It does not use the prediction covariance Σ from Definition 1, and it says nothing about drift — the lower bound holds even with constant teacher labels, and if the loss variance is zero it gives no bound even when Σ is nonzero. So the sentence "By Theorem 1, LVSD induces a strictly positive lower bound" is not supported. Similarly, "irreducible finite-sample bias" is the wrong phrase: E[L_s] = L_ideal, so there is no bias, only finite-sample deviation. Theorem 2 is a standard ERM excess-risk bound with no soft labels in it. The theoretical contribution, as written, is a generic concentration bound wearing a semantic-drift costume.

Also, the headline 42.7% vs LPLD +9.0% compares HALD with FADRM generation against LPLD with its own generation. Tables 7 and 16 give the honest matched comparisons, and the matched gains are still positive and sometimes large, but the +9.0 headline is a compound comparison. And Tables 3 and 9 show schedule order matters a lot (soft-hard-soft 35.6 vs hard-soft 14.2), which is fine empirically but means the recipe depends on n_soft, n_hard, α being right; the monotone-alignment assumption in Sec 3.4 is asserted, not proved. That is a minor concern because the ablations are matched and consistent.

No fabrication red flags. The empirical claim that the schedule helps is credible even after discounting the theory. A serious referee should engage. The authors should be asked to relabel Theorem 1 as a finite-sample deviation bound, remove the drift claims it does not prove, add a matched-generation comparison table to the main text, and release code before publication. I would bring it to a reading group if anyone works on dataset distillation, and I would cite it for the training recipe, not the theory.

Referee Report

3 major / 6 minor

Summary. The paper proposes HALD, a soft-hard-soft training schedule for dataset distillation under limited soft-label storage. The authors argue that when only a few crops per image have stored teacher soft labels, those crop-level soft labels suffer from 'local-view semantic drift' (LVSD), defined as nonzero covariance of teacher predictions across crops. They claim two theoretical results: (i) Theorem 1 gives a strictly positive lower bound on the expected mismatch between finite-crop and ideal-crop training losses, and (ii) Theorem 2 gives an excess-risk lower bound for ERM under limited crops. They further give Theorem 3 on soft-hard gradient consistency and use it to justify inserting a hard-label calibration stage between two soft-label stages. Empirically, HALD is evaluated on Tiny-ImageNet and ImageNet-1K with four distillation generators (SRe2L, LPLD, RDED, FADRM), showing consistent gains over soft-only training in the controlled comparison of Table 7. The headline result is 42.7% top-1 accuracy on ImageNet-1K with 285M soft-label storage, +9.0 over LPLD at the same budget.

Significance. If the empirical results hold, HALD is a practically significant contribution: it makes large reductions in soft-label storage feasible while improving accuracy, and it adds to the evidence that hard labels are a useful complement rather than a crude fallback. The controlled ablation in Table 7—where generation, architecture, and hyperparameters are fixed and only the training protocol changes—is strong and should be credited; Table 4 similarly shows large gains under tight storage budgets. The theoretical parts, however, do not currently support the paper's stated claims. Theorem 1 bounds the variance of the per-crop loss, not the prediction covariance that defines LVSD, and Theorem 2 is a generic finite-sample ERM bound with no soft-label or crop-specific content. Because the abstract and introduction present these as a 'theoretical characterization of the emergence of drift,' the gap is load-bearing. The empirical contribution may still stand, but the paper's central narrative needs substantial revision.

major comments (3)

[§3.2, Definition 1 and Theorem 1] The central theoretical claim is not supported. Theorem 1 bounds E|L_s − L_ideal| in terms of σ² = Var(L[~p, q_θ]), the variance of the per-crop loss, and its kurtosis. It never uses the covariance Σ = Cov[~p(x^(crop))] from Definition 1. A nonzero Σ does not imply a positive lower bound: if the loss is constant across crops (e.g., q_θ is insensitive to the crop or the loss is flat in p), the RHS of Eq. (2) is zero even though LVSD holds. Conversely, the RHS can be positive with Σ=0 if q_θ varies while the teacher soft label is constant. Moreover, E[L_s] = L_ideal, so this is a finite-sample deviation, not a bias. The sentence 'By Theorem 1, LVSD ... induces a strictly positive lower bound' is therefore inaccurate, and the abstract's claim to 'theoretically analyze the emergence of drift' is not established by this theorem.
[§3.2, Theorem 2] Theorem 2 is a standard finite-sample ERM excess-risk lower bound under assumptions A1–A5. It contains no soft labels, no crops, and no covariance Σ of teacher predictions; the parameter s is simply the number of i.i.d. samples. The text states that 'under limited soft-label coverage (small s), L_s exhibits LVSD' and that Theorem 2 'formalizes this effect,' but the theorem never refers to Definition 1. As written, it is a generic concentration result and does not connect the lower bound (1/(2s)) tr(H⁻¹Σ⋆) to local semantic drift. The authors should either add an explicit link (e.g., lower-bound the gradient covariance Σ⋆ in terms of the teacher prediction covariance under the specific soft-label loss) or explicitly downgrade Theorem 2 to background ERM material.
[§3.4, Theorem 3 and monotone alignment] The derivation of the Soft–Hard–Soft schedule relies on the unproved claim that 'D tends to decrease faster than m0, leading to a monotonically decreasing ratio D/m0.' Theorem 3 itself only bounds the cosine similarity at a fixed time under a bounded D/m0; it does not establish monotonicity during Stage A. Figure 3 shows cosine similarity increasing empirically, but that is not a direct measurement of D/m0. The ablations in Table 9 show empirically that other schedules fail, but this is not the theoretical support claimed in §3.4. If monotone alignment is an assumption, it should be stated and tested directly; otherwise the schedule's optimality is an empirical finding, not a theorem-derived one.

minor comments (6)

[Abstract and Table 4] The units of '285M soft-label storage' are unclear: are these millions of stored scalar entries, bytes, or bits? The abstract says 'reduces by 100X' relative to the 28.33 GB full-soft-label storage; please express both quantities in the same units.
[Table 20 and §4.4] n_hard is set to 75 or 150 epochs depending on SLC, but no ablation over n_hard is reported. The text says it is 'aligned with the convergence time of soft-label-only training,' but that convergence time is not measured. Please report how these durations were selected and include a sensitivity analysis.
[§4.2, Table 3] GIFT is described in §2 as a method that 'fuses hard information into soft targets,' yet Table 3 shows GIFT performing approximately equal to Soft-Only. The discussion should reconcile this apparent contradiction with the related-work description.
[Appendix B.4, Corollary 1] The 'effective sample size' argument is a variance-reduction computation for gradient estimates, not an increase in the number of stored soft labels. The phrase 'increases the effective sample size from s to s_eff' may mislead readers into thinking s itself grows; please clarify that this is a variance-reduction interpretation.
[Figure 2] The heatmap labels in Figure 2 are too dense to read at print size. Consider showing fewer contour levels or enlarging the figure.
[General] The reproducibility statement says code 'will be released,' while the abstract and conclusion say code is available at a GitHub URL. Please reconcile these statements and, if the code is public, include a version or commit identifier.

Circularity Check

0 steps flagged · score 0.0 of 10

No significant circularity: theorems are generic concentration/ERM bounds, schedule hyperparameters are explicitly empirical, and self-citations are not load-bearing.

full rationale

I walked the claimed derivation chain. Lemma 1 is a direct covariance calculation. Theorem 1 is a standard Paley–Zygmund small-ball bound on the deviation of an empirical mean of per-crop losses from its expectation; it does not use Definition 1's prediction covariance, so the paper's statement that LVSD 'induces' the bound is a support gap, not a circular reduction. Theorem 2 is a standard ERM excess-risk lower bound under explicit assumptions (A1–A5), and Theorem 3 is a Lipschitz/mixing cosine-similarity bound; the monotonic decrease of D/m0 is an unproved assumption, not a restatement of the conclusion. Corollary 1 defines s_eff from the control-variate variance identity s_eff = s/(1−ρ^2); this is a definitional reparameterization, not a fitted parameter relabeled as a prediction, and the paper's main empirical claims are measured results, not derived from it. The self-citations (FADRM, etc.) are used as a generation method and background; matched-generation comparisons in Table 7 show HALD gains over Soft-Only on the same generator, so the self-citation is not load-bearing. Schedule hyperparameters (n_soft=200, hard epochs, α=0.8) are explicitly said to be determined empirically and validated by ablations (Tables 8, 10, 11, 20), not presented as theoretical predictions. Thus I find no step where a claimed derivation reduces by construction to its own inputs.

Assumptions & free parameters 4 free parameters · 4 assumptions · 0 invented entities

The main empirical claim rests on a handful of validated hyperparameters (α=0.8, phase lengths) and on assumptions (D/m0 decreasing, teacher-optimality, regularity A1–A5) that are plausible but not proven for the actual datasets. The theory does not introduce new physical entities; 'local semantic drift' is a named statistical phenomenon, not an entity.

free parameters (4)

Label smoothing rate α = 0.8
Chosen by validation sweep (Table 11); used in Stage B hard-label target and affects all main results.
Soft-label phase length n_soft = 200 epochs
Table 8 reports 'soft-label convergence length = 200'; schedule sets TA=TC=100. Chosen empirically.
Hard-label phase duration n_hard = 75 or 150 epochs on ImageNet-1K by SLC; 50 on Tiny-ImageNet
Appendix E.4 Table 20 sets durations per SLC 'aligned with the convergence time of soft-label-only training'; no theory predicts these values.
Storage bit-width b of soft labels = implied ~2 bytes/scalar from tables, never stated
Tables 1/2 quote MB per SLC from which b can be inferred but not verified; the abstract's 100X storage reduction depends on this undocumented constant.

assumptions (4)

standard math A1–A5 in Theorem 2: regularity, concentration, and local-ERM assumptions (unbiased scores, Hessian Lipschitz, local uniform concentration, ERM stays local, bounded loss).
These are standard statistical assumptions but are not verified for the actual distilled datasets; Theorem 2's 1/s lower bound holds only if they do.
domain assumption The teacher's full-coverage soft-label optimum θ̂⋆ is assumed to achieve the best attainable generalization.
Stated in Sec 3.2 before defining excess loss; if teacher soft labels carry noise or bias, deviating from them is not necessarily harmful.
ad hoc to paper D/m0 decreases monotonically during training, so soft and hard gradient directions align more over time.
Asserted in Sec 3.4 following Theorem 3 ('D tends to decrease faster than m0'); no proof, and it is the basis for choosing soft→hard→soft order.
domain assumption Student can fit finite-s soft-label pool to ERM within n_soft epochs.
Used in Sec 3.3 to allocate TA/TB/TC; in practice validated by Table 8, not derived.

how reviews work

0 comments

Cite this review

Pith. "Pith review of Hard Labels In! Rethinking the Role of Hard Labels in Mitigating Local Semantic Drift." pith.science (2026). https://pith.science/paper/PANFMDXA

@misc{pith2026251215647,
  author       = {Pith},
  title        = {Pith review of: Hard Labels In! Rethinking the Role of Hard Labels in Mitigating Local Semantic Drift},
  year         = {2026},
  howpublished = {\url{https://pith.science/paper/PANFMDXA}},
  note         = {Machine review of arXiv:2512.15647}
}

read the original abstract

Soft labels from teacher models are a de facto practice for knowledge transfer and large-scale dataset distillation (e.g., SRe2L, LPLD). However, when we limit the number of crops per image to reduce the substantial cost of storing precomputed soft labels, these methods suffer severely from local semantic drift: visually ambiguous crops can cause soft supervision to deviate from the image-level ground-truth semantics, leading to persistent errors and a train-test distribution mismatch. We revisit the overlooked role of hard labels and show that, when properly integrated, they can act as a content-invariant semantic anchor that calibrates such drift. We theoretically analyze the emergence of drift under sparse soft-label supervision and demonstrate that hybridizing hard and soft labels restores alignment between visual content and semantic supervision. Building on this insight, we propose a new training paradigm, Hard Label for Alleviating Local Semantic Drift (HALD), which uses hard labels as intermediate corrective signals while preserving the fine-grained benefits of soft labels. Extensive experiments on dataset distillation and large-scale classification benchmarks show consistent generalization improvements. On ImageNet-1K, our method achieves 42.7% accuracy with only 285M soft-label storage (reduces by 100X), outperforming prior state-of-the-art LPLD 9.0%.

Figures

Figures reproduced from arXiv: 2512.15647 by the authors.

**Figure 1.** Illustration of local-view semantic drift: partial crops may change object–label relations, yielding semantics that deviate from the full image. Soft labels have emerged as a standard and strong supervision signal derived from pretrained teacher models in knowledge distillation (Hinton et al., 2015) and dataset distillation (Yin et al., 2023) tasks. Unlike hard labels, which provide only class-level supervision, … view at source ↗

**Figure 2.** Train and test loss landscapes on an IPC=10 distilled dataset with SLC=50, comparing (i) finite soft-label coverage and (ii) our method. Few Soft Labels Make Train-Test Misaligned. Let ˆθ⋆ := arg minθ Lideal(θ) denote the oracle obtained under exhaustive local-view supervision from a strong teacher. By construction, ˆθ⋆ maximally aligns with the teacher’s predictive distribution across local views, we assume it … view at source ↗

**Figure 3.** Gradient similarity between hard- and soft-label losses over training, evaluated on real [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

Discussion (0). Sign in to comment.

Reference graph

Works this paper leans on

15 extracted references · 14 linked inside Pith

[2]

Fadrm: Fast and accurate data residual matching for dataset distillation.arXiv preprint arXiv:2506.24125, 2025a

Jiacheng Cui, Xinyue Bi, Yaxin Luo, Xiaohan Zhao, Jiacheng Liu, and Zhiqiang Shen. Fadrm: Fast and accurate data residual matching for dataset distillation.arXiv preprint arXiv:2506.24125, 2025a. Jiacheng Cui, Zhaoyi Li, Xiaochen Ma, Xinyue Bi, Yaxin Luo, and Zhiqiang Shen. Dataset distilla- tion via committee voting.arXiv preprint arXiv:2501.07575, 2025b...

arXiv
[6]

Efficient dataset distillation using random feature approximation.arXiv preprint arXiv:2210.12067,

Noel Loo, Ramin Hasani, Alexander Amini, and Daniela Rus. Efficient dataset distillation using random feature approximation.arXiv preprint arXiv:2210.12067,

arXiv
[9]

Dataset distillation.arXiv preprint arXiv:1811.10959,

Tongzhou Wang, Jun-Yan Zhu, Antonio Torralba, and Alexei A Efros. Dataset distillation.arXiv preprint arXiv:1811.10959,

arXiv
[10]

Eric Xue, Yijiang Li, Haoyang Liu, Yifan Shen, and Haohan Wang

URL https://proceedings.neurips.cc/paper_files/paper/2024/file/ 1da9ca7e9cef4b1af63913f05d1630a4-Paper-Conference.pdf. Eric Xue, Yijiang Li, Haoyang Liu, Yifan Shen, and Haohan Wang. Towards adversarially robust dataset distillation by curvature regularization.arXiv preprint arXiv:2403.10045,

arXiv 2024
[11]

Heavy labels out! dataset distillation with label space lightening.arXiv preprint arXiv:2408.08201,

Ruonan Yu, Songhua Liu, Zigeng Chen, Jingwen Ye, and Xinchao Wang. Heavy labels out! dataset distillation with label space lightening.arXiv preprint arXiv:2408.08201,

arXiv
[13]

Dataset condensation with gradient matching

Bo Zhao, Konda Reddy Mopuri, and Hakan Bilen. Dataset condensation with gradient matching. arXiv preprint arXiv:2006.05929,

arXiv 2006
[14]

Improve cross-architecture generalization on dataset distillation.arXiv preprint arXiv:2402.13007,

13 Preprint Binglin Zhou, Linhao Zhong, and Wentao Chen. Improve cross-architecture generalization on dataset distillation.arXiv preprint arXiv:2402.13007,

arXiv
[15]

Dataset distillation via vision- language category prototype.arXiv preprint arXiv:2506.23580,

Yawen Zou, Guang Li, Duo Su, Zi Wang, Jun Yu, and Chao Zhang. Dataset distillation via vision- language category prototype.arXiv preprint arXiv:2506.23580,

arXiv

Show all 15 references

[1991]

Cao 2: Rectifying inconsistencies in diffusion-based dataset distillation.arXiv preprint arXiv:2506.22637,

Haoxuan Wang, Zhenghao Zhao, Junyi Wu, Yuzhang Shang, Gaowen Liu, and Yan Yan. Cao 2: Rectifying inconsistencies in diffusion-based dataset distillation.arXiv preprint arXiv:2506.22637,

arXiv
[2009]

Remember the past: Distilling datasets into addressable mem- ories for neural networks.arXiv preprint arXiv:2206.02916,

Zhiwei Deng and Olga Russakovsky. Remember the past: Distilling datasets into addressable mem- ories for neural networks.arXiv preprint arXiv:2206.02916,

arXiv
[2021]

Dataset condensation with distribution matching

Bo Zhao and Hakan Bilen. Dataset condensation with distribution matching. InIEEE/CVF Winter Conference on Applications of Computer Vision, WACV 2023, Waikoloa, HI, USA, January 2-7, 2023,

2023
[2022]

The evolution of dataset distillation: Toward scalable and generalizable solutions.arXiv preprint arXiv:2502.05673,

Ping Liu and Jiawei Du. The evolution of dataset distillation: Toward scalable and generalizable solutions.arXiv preprint arXiv:2502.05673,

arXiv
[2023]

Gift: Unlocking full potential of labels in distilled dataset at near-zero cost.arXiv preprint arXiv:2405.14736,

Xinyi Shang, Peng Sun, and Tao Lin. Gift: Unlocking full potential of labels in distilled dataset at near-zero cost.arXiv preprint arXiv:2405.14736,

arXiv
[2024]

Distilling the knowledge in a neural network.arXiv preprint arXiv:1503.02531,

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network.arXiv preprint arXiv:1503.02531,

arXiv
[2025]

Dataset distillation via adversarial prediction matching.arXiv preprint arXiv:2312.08912,

Mingyang Chen, Bo Huang, Junda Lu, Bing Li, Yi Wang, Minhao Cheng, and Wei Wang. Dataset distillation via adversarial prediction matching.arXiv preprint arXiv:2312.08912,

arXiv

Pith tools

Reviewed August 3, 2026 · model on record in the stance chip above.

[1] [2]

Fadrm: Fast and accurate data residual matching for dataset distillation.arXiv preprint arXiv:2506.24125, 2025a

Jiacheng Cui, Xinyue Bi, Yaxin Luo, Xiaohan Zhao, Jiacheng Liu, and Zhiqiang Shen. Fadrm: Fast and accurate data residual matching for dataset distillation.arXiv preprint arXiv:2506.24125, 2025a. Jiacheng Cui, Zhaoyi Li, Xiaochen Ma, Xinyue Bi, Yaxin Luo, and Zhiqiang Shen. Dataset distilla- tion via committee voting.arXiv preprint arXiv:2501.07575, 2025b...

arXiv

[2] [6]

Efficient dataset distillation using random feature approximation.arXiv preprint arXiv:2210.12067,

Noel Loo, Ramin Hasani, Alexander Amini, and Daniela Rus. Efficient dataset distillation using random feature approximation.arXiv preprint arXiv:2210.12067,

arXiv

[3] [9]

Dataset distillation.arXiv preprint arXiv:1811.10959,

Tongzhou Wang, Jun-Yan Zhu, Antonio Torralba, and Alexei A Efros. Dataset distillation.arXiv preprint arXiv:1811.10959,

arXiv

[4] [10]

Eric Xue, Yijiang Li, Haoyang Liu, Yifan Shen, and Haohan Wang

URL https://proceedings.neurips.cc/paper_files/paper/2024/file/ 1da9ca7e9cef4b1af63913f05d1630a4-Paper-Conference.pdf. Eric Xue, Yijiang Li, Haoyang Liu, Yifan Shen, and Haohan Wang. Towards adversarially robust dataset distillation by curvature regularization.arXiv preprint arXiv:2403.10045,

arXiv 2024

[5] [11]

Heavy labels out! dataset distillation with label space lightening.arXiv preprint arXiv:2408.08201,

Ruonan Yu, Songhua Liu, Zigeng Chen, Jingwen Ye, and Xinchao Wang. Heavy labels out! dataset distillation with label space lightening.arXiv preprint arXiv:2408.08201,

arXiv

[6] [13]

Dataset condensation with gradient matching

Bo Zhao, Konda Reddy Mopuri, and Hakan Bilen. Dataset condensation with gradient matching. arXiv preprint arXiv:2006.05929,

arXiv 2006

[7] [14]

Improve cross-architecture generalization on dataset distillation.arXiv preprint arXiv:2402.13007,

13 Preprint Binglin Zhou, Linhao Zhong, and Wentao Chen. Improve cross-architecture generalization on dataset distillation.arXiv preprint arXiv:2402.13007,

arXiv

[8] [15]

Dataset distillation via vision- language category prototype.arXiv preprint arXiv:2506.23580,

Yawen Zou, Guang Li, Duo Su, Zi Wang, Jun Yu, and Chao Zhang. Dataset distillation via vision- language category prototype.arXiv preprint arXiv:2506.23580,

arXiv