Diffusion Models as Dataset Distillation Priors

Duo Su; Huanran Chen; Huyu Wu; Jun Zhu; Xi Ye; Yiming Shi; Yuzhu Wang

arxiv: 2510.17421 · v2 · submitted 2025-10-20 · 💻 cs.LG

Diffusion Models as Dataset Distillation Priors

Duo Su , Huyu Wu , Huanran Chen , Yiming Shi , Yuzhu Wang , Xi Ye , Jun Zhu This is my paper

Pith reviewed 2026-05-18 06:00 UTC · model grok-4.3

classification 💻 cs.LG

keywords dataset distillationdiffusion modelsMercer kernelrepresentativeness priorreverse diffusionsynthetic datacross-architecture generalization

0 comments

The pith

Diffusion models carry an inherent representativeness prior that a Mercer kernel can extract to guide the reverse diffusion process and improve dataset distillation without retraining.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Dataset distillation tries to turn large real datasets into small synthetic ones that still support good model training. The paper establishes that diffusion models already embed a useful prior for making synthetic samples representative of real ones, a prior that prior work overlooked. It formalizes this prior as the similarity between synthetic and real data points measured by a Mercer kernel in feature space, then injects the resulting signal to steer the denoising steps. A sympathetic reader would care because the method needs no extra training of the diffusion model yet delivers higher-fidelity synthetic data with stronger generalization across neural-network architectures on large-scale sets such as ImageNet-1K.

Core claim

DAP formalizes representativeness by quantifying the similarity between synthetic and real data in feature space using a Mercer kernel and introduces this prior as guidance to steer the reverse diffusion process, enhancing the representativeness of distilled samples without any retraining. This establishes a theoretical connection between diffusion priors and the objectives of dataset distillation while providing a practical, training-free framework for improving the quality of the distilled dataset.

What carries the argument

The DAP guidance term, which computes Mercer-kernel similarity in feature space and adds it as a steering signal during reverse diffusion to enforce representativeness in the generated samples.

If this is right

Distilled datasets achieve higher fidelity on ImageNet-1K and its subsets than existing methods.
The synthetic data exhibits superior generalization when used to train models of different architectures.
No retraining of the underlying diffusion model is needed to apply the improvement.
A direct theoretical link is drawn between the generative prior in diffusion models and the goals of dataset distillation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same kernel-based prior extraction might be adapted to other generative models to boost their use in data synthesis tasks.
Because the method is training-free it could allow rapid testing of new distillation objectives on existing diffusion checkpoints.
The approach hints that diffusion sampling trajectories can be lightly modified to enforce additional dataset properties such as explicit diversity control.

Load-bearing premise

That similarity measured by a Mercer kernel in feature space correctly captures the representativeness prior of diffusion models and that adding this guidance during the reverse process improves distilled data quality without harming diversity or generalization.

What would settle it

Train models on datasets distilled with and without the Mercer-kernel guidance on an ImageNet subset, then measure both downstream accuracy on real test data and cross-architecture transfer; consistent gains only with the guidance would support the claim, while equal or worse results would falsify it.

Figures

Figures reproduced from arXiv: 2510.17421 by Duo Su, Huanran Chen, Huyu Wu, Jun Zhu, Xi Ye, Yiming Shi, Yuzhu Wang.

**Figure 1.** Figure 1: Our diffusion as priors (DAP) method is beneficial for the DD task. Diversity: 1+FIDmax−FID. Representativeness: 1 d(ϕ(x),ϕ(y)) . Performance: classification results on ImageNet-1K. We propose Diffusion As Priors (DAP) and apply it to datasets of varying scales, including large-scale ImageNet-1K (Deng et al., 2009) and its small subsets. Both quantitative and qualitative results show that DAP significantl… view at source ↗

**Figure 2.** Figure 2: Visualization of average representativeness (∝ 1 d(ϕ(x),ϕ(y)) ) of distilled samples (IPC10). As γ increases, the representativeness (sector area) gets larger, yielding better DD performance. gradient field of diversity and generalization (∇x log p(x)) is determined and fixed by pre-trained DMs. Therefore, the gradient field of representativeness cannot be increased indefinitely, otherwise the other priors… view at source ↗

**Figure 3.** Figure 3: The comparison results on Stable Diffusion. The results are evaluated with both hardlabel (HL) and soft-label (SL) protocols based on ResNet-18. The results of SL protocol are marked with a light blue background, while those without background color are from HL protocol. 8 [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Visualization results of t-SNE. We compare the feature distribution of real (training and [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: Ablation studies under ResNet-18. (a-b) Top-1 Accuracy under different backbone layer selection. (c-d) Top-1 Accuracy under varied guidance scale γ. 5 CONCLUSION This paper introduces Diffusion as Priors, a framework for dataset distillation that leverages the inherent priors of diffusion models. We identify diversity, generalization, and representativeness priors in diffusion models, and demonstrate how … view at source ↗

**Figure 6.** Figure 6: A sketch map of the relationship between [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗

**Figure 7.** Figure 7: Curves of different Mercer kernel-induced distances [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗

**Figure 8.** Figure 8: Ablation study on tstop selection. A.4.6 SAMPLING-TIME SCALING DAP does not introduce additional training costs, since no external pre-training or fine-tuning is required. The representativeness prior is directly derived from the pre-trained diffusion backbone. However, to inject this prior during sampling and improve data quality, we must extract features from the noisy training data x train t using the b… view at source ↗

**Figure 9.** Figure 9: Samples distilled by DiT (left three columns) and SD (right three columns). The excessive [PITH_FULL_IMAGE:figures/full_fig_p019_9.png] view at source ↗

**Figure 10.** Figure 10: Visualization results of different DD methods. At the bottom of each group, we use the [PITH_FULL_IMAGE:figures/full_fig_p020_10.png] view at source ↗

read the original abstract

Dataset distillation aims to synthesize compact yet informative datasets from large ones. A significant challenge in this field is achieving a trifecta of diversity, generalization, and representativeness in a single distilled dataset. Although recent generative dataset distillation methods adopt powerful diffusion models as their foundation models, the inherent representativeness prior in diffusion models is overlooked. Consequently, these approaches often necessitate the integration of external constraints to enhance data quality. To address this, we propose Diffusion As Priors (DAP), which formalizes representativeness by quantifying the similarity between synthetic and real data in feature space using a Mercer kernel. We then introduce this prior as guidance to steer the reverse diffusion process, enhancing the representativeness of distilled samples without any retraining. Extensive experiments on large-scale datasets, such as ImageNet-1K and its subsets, demonstrate that DAP outperforms state-of-the-art methods in generating high-fidelity datasets while achieving superior cross-architecture generalization. Our work not only establishes a theoretical connection between diffusion priors and the objectives of dataset distillation but also provides a practical, training-free framework for improving the quality of the distilled dataset.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DAP adds a Mercer kernel similarity term as guidance in diffusion sampling to boost representativeness in dataset distillation without retraining, with reported gains on ImageNet-1K.

read the letter

The main point is that this work takes a pre-trained diffusion model and steers its reverse process with a guidance term based on Mercer kernel similarity between synthetic samples and real data in feature space. This is meant to capture representativeness directly and avoid the external constraints or retraining that other generative distillation methods often need. The formalization of that similarity measure and its injection as guidance looks like the concrete new piece relative to earlier diffusion-based distillation papers. They test the approach on ImageNet-1K and subsets, claiming higher fidelity distilled sets and stronger cross-architecture generalization than prior methods. That practical, training-free framing is the part that could interest people building compact training data for vision models. The results section apparently shows outperformance, which gives the method some empirical weight even if the numbers are not in the abstract. The softer spot is the claim that the kernel term reflects an inherent prior already inside the diffusion model. If the feature extractor is independent and the guidance is added heuristically rather than derived from the diffusion score or reverse SDE, it functions more like a standard external regularizer on top of black-box sampling. The paper would be tighter if it clarified that link or showed why this particular similarity measure emerges naturally from diffusion rather than being chosen for the distillation objective. Ablations on the kernel choice and failure cases would also help judge robustness. This is for the dataset distillation and generative data synthesis community. Readers working on large-scale vision data reduction would get usable ideas from the method and the ImageNet-scale experiments. It has enough substance and a clear experimental setup to deserve peer review, though referees will likely press on the prior connection and ask for more controls.

Referee Report

2 major / 2 minor

Summary. The paper proposes Diffusion As Priors (DAP) for dataset distillation. It formalizes representativeness by quantifying similarity between synthetic and real data in feature space via a Mercer kernel, then uses this quantity as guidance to steer the reverse diffusion process of a pre-trained model. The approach is training-free and is claimed to improve fidelity and cross-architecture generalization on ImageNet-1K and subsets relative to prior generative distillation methods, while establishing a theoretical link between diffusion priors and distillation objectives.

Significance. If the Mercer-kernel guidance is shown to encode an inherent diffusion prior (rather than an external regularizer) and the reported gains are robust, the work would supply a principled, retraining-free mechanism for enhancing representativeness in generative dataset distillation. This could influence downstream efficiency in large-scale training pipelines and strengthen connections between score-based generative models and data-synthesis objectives.

major comments (2)

[§3] §3 (guidance term derivation): the manuscript must demonstrate that the Mercer-kernel similarity is derived from the diffusion model's score function or reverse SDE rather than introduced as an independent external penalty. If the feature extractor is separate from the diffusion backbone, the construction risks reducing to heuristic classifier-free guidance plus a similarity term, undermining the claimed 'inherent representativeness prior' and the theoretical connection asserted in the abstract.
[§4] §4 (experiments): the central claim of outperformance and improved generalization rests on quantitative results that are not visible in the abstract; the full paper must include ablations on guidance strength, diversity metrics, and failure cases. Without these, it is impossible to verify whether the prior improves representativeness without trading off other desiderata.

minor comments (2)

[Abstract] Abstract: include at least one concrete performance number (e.g., top-1 accuracy or distillation ratio) to support the outperformance claim.
[Notation] Notation: define the Mercer kernel, feature extractor, and guidance coefficient explicitly with equation numbers to prevent ambiguity in how the similarity term is computed and scaled.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive comments. We address each major comment below and outline the revisions we will make to the manuscript.

read point-by-point responses

Referee: [§3] §3 (guidance term derivation): the manuscript must demonstrate that the Mercer-kernel similarity is derived from the diffusion model's score function or reverse SDE rather than introduced as an independent external penalty. If the feature extractor is separate from the diffusion backbone, the construction risks reducing to heuristic classifier-free guidance plus a similarity term, undermining the claimed 'inherent representativeness prior' and the theoretical connection asserted in the abstract.

Authors: We appreciate this insightful comment. In the original manuscript, the guidance term is introduced by modifying the reverse diffusion process to incorporate the representativeness prior, which is quantified using the Mercer kernel on features extracted from real and synthetic data. This is not merely an external penalty but is integrated into the sampling trajectory of the pre-trained diffusion model, thereby leveraging its inherent prior. The feature extractor is a separate component used to define the similarity measure in a semantically meaningful space, similar to how classifier guidance uses an external classifier. To strengthen the theoretical connection, we will revise §3 to include a step-by-step derivation showing how this guidance arises from adjusting the score function in the reverse SDE to favor representative samples. This will clarify that it encodes the diffusion prior rather than acting as a standalone regularizer. revision: yes
Referee: [§4] §4 (experiments): the central claim of outperformance and improved generalization rests on quantitative results that are not visible in the abstract; the full paper must include ablations on guidance strength, diversity metrics, and failure cases. Without these, it is impossible to verify whether the prior improves representativeness without trading off other desiderata.

Authors: Thank you for pointing this out. While the abstract summarizes the main results, the full manuscript in §4 presents quantitative comparisons on ImageNet-1K and subsets, showing improvements in fidelity and cross-architecture generalization. To address the request for more comprehensive analysis, we will add ablations on the guidance strength parameter, including its effect on representativeness and other metrics. We will also report diversity metrics (e.g., pairwise similarity or coverage) and discuss potential failure cases, such as when the prior overly constrains diversity. These additions will be included in the revised §4 to provide a more complete verification of the method's benefits. revision: yes

Circularity Check

0 steps flagged

Representativeness prior is defined as Mercer-kernel feature similarity then re-introduced as guidance to improve that same similarity

full rationale

The paper's core move is to define the 'inherent representativeness prior' of diffusion models explicitly as a Mercer-kernel similarity between synthetic and real features, then add that quantity as a guidance term during reverse diffusion. This construction is self-contained and does not reduce the final distilled dataset to a fitted parameter or self-citation chain; the guidance is an external regularizer applied to a pre-trained diffusion model. No equations are shown that make the guidance term mathematically identical to the distillation objective by definition, and no load-bearing self-citation is invoked to justify uniqueness. The derivation therefore remains non-circular, though the claim that the kernel similarity is 'inherent' to the diffusion model rather than an added heuristic is a separate correctness question.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that feature-space Mercer kernel similarity faithfully encodes the diffusion model's representativeness prior and that this quantity can be injected as guidance without side effects. No free parameters or invented entities are explicitly named in the abstract.

axioms (1)

domain assumption Mercer kernel similarity in feature space accurately quantifies representativeness between synthetic and real data.
Invoked when the paper states it formalizes representativeness by quantifying similarity using a Mercer kernel.

pith-pipeline@v0.9.0 · 5731 in / 1336 out tokens · 40773 ms · 2026-05-18T06:00:27.492081+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We formalize representativeness by quantifying the similarity between synthetic and real data in feature space using a Mercer kernel... energy function based on Mercer kernel... ∇x log p(R|x) ∝ −γ 1/N Σ ∇ d(ϕ(xsyn), ϕ(xtrain))
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean absolute_floor_iff_bare_distinguishability unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Theorem 3.1: Let K be a PSD kernel. Then the K-induced distance DK(x,y)=[K(x,x)+K(y,y)−2K(x,y)]^{1/2} satisfies non-negativity, symmetry, triangle inequality.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

12 extracted references · 12 canonical work pages

[1]

Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901,

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901,

work page 1901
[2]

Mgdˆ3: Mode-guided dataset distillation using diffusion models.arXiv preprint arXiv:2505.18963,

Jeffrey A Chan-Santiago, Praveen Tirupattur, Gaurav Kumar Nayak, Gaowen Liu, and Mubarak Shah. Mgdˆ3: Mode-guided dataset distillation using diffusion models.arXiv preprint arXiv:2505.18963,

work page arXiv
[3]

Available athttps://arxiv.org/ abs/2205.03257

10 Preprint. Under Review. Jeremy Howard. Imagenette: A smaller subset of 10 easily classified classes from imagenet, March 2019a. URLhttps://github.com/fastai/imagenette. Jeremy Howard. Imagewoof: a subset of 10 classes from imagenet that aren’t so easy to classify, March 2019b. URLhttps://github.com/fastai/imagenette#imagewoof. James Jordon, Lukasz Szpr...

work page arXiv
[4]

The evolution of dataset distillation: Toward scalable and generalizable solutions,

Ping Liu and Jiawei Du. The evolution of dataset distillation: Toward scalable and generalizable solutions.arXiv preprint arXiv:2502.05673,

work page arXiv
[5]

Scaling laws of synthetic data for language models.arXiv preprint arXiv:2503.19551,

Zeyu Qin, Qingxiu Dong, Xingxing Zhang, Li Dong, Xiaolong Huang, Ziyi Yang, Mahmoud Khademi, Dongdong Zhang, Hany Hassan Awadalla, Yi R Fung, et al. Scaling laws of synthetic data for language models.arXiv preprint arXiv:2503.19551,

work page arXiv
[6]

DiM: Distill- ing dataset into generative model,

Kai Wang, Jianyang Gu, Daquan Zhou, Zheng Zhu, Wei Jiang, and Yang You. Dim: Distilling dataset into generative model.arXiv preprint arXiv:2303.04707,

work page arXiv
[7]

Dataset Distillation , journal =

Tongzhou Wang, Jun-Yan Zhu, Antonio Torralba, and Alexei A Efros. Dataset distillation.arXiv preprint arXiv:1811.10959,

work page arXiv
[8]

Hierar- chical features matter: A deep exploration of gan priors for improved dataset distillation.arXiv preprint arXiv:2406.05704,

Xinhao Zhong, Hao Fang, Bin Chen, Xulin Gu, Tao Dai, Meikang Qiu, and Shu-Tao Xia. Hierar- chical features matter: A deep exploration of gan priors for improved dataset distillation.arXiv preprint arXiv:2406.05704,

work page arXiv
[9]

imitation

12 Preprint. Under Review. A APPENDIX Appendix organization: Section A.1: Background A.1.1: Dataset distillation A.1.2: Generative dataset distillation Section A.2: Proofs A.2.1: Validity of kernel-induced distance A.2.2: Distance factorization Section A.3: Experimental Setup A.3.1: Datasets and benchmarks A.3.2: Models and evaluation protocols A.3.3: Oth...

work page 2023
[10]

H-PD (Zhong et al.,

enhances cross-architecture generalization by distilling data into the latent space of pre-trained models like StyleGAN (Karras et al., 2019). H-PD (Zhong et al.,

work page 2019
[11]

• Symmetry:∥ϕ(x)−ϕ(y)∥=∥ϕ(y)−ϕ(x)∥

there exists a repro- ducing kernel Hilbert spaceHand a feature mapϕ:X → Hsuch that K(x, y) =⟨ϕ(x), ϕ(y)⟩ H.(10) Therefore, DK(x, y)2 =K(x, x) +K(y, y)−2K(x, y)(11) =⟨ϕ(x), ϕ(x)⟩ H +⟨ϕ(y), ϕ(y)⟩ H −2⟨ϕ(x), ϕ(y)⟩ H (12) =∥ϕ(x)−ϕ(y)∥ 2 H.(13) Thus, DK(x, y) =∥ϕ(x)−ϕ(y)∥ H.(14) Since the norm in Hilbert space∥ · ∥ H is a valid metric, it satisfies: • Non-neg...

work page 2009
[12]

A.3.2 MODELS AND EVALUATION PROTOCOLS For each dataset, we distill subsets of 10, 50, and 100 images per class (IPC) and assess their utility on downstream classification tasks

to evaluate performance. A.3.2 MODELS AND EVALUATION PROTOCOLS For each dataset, we distill subsets of 10, 50, and 100 images per class (IPC) and assess their utility on downstream classification tasks. Two evaluation protocols are adopted: • Hard-label protocol: Following Chen et al. (2025), we directly train classifiers from scratch using the distilled ...

work page 2025

[1] [1]

Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901,

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901,

work page 1901

[2] [2]

Mgdˆ3: Mode-guided dataset distillation using diffusion models.arXiv preprint arXiv:2505.18963,

Jeffrey A Chan-Santiago, Praveen Tirupattur, Gaurav Kumar Nayak, Gaowen Liu, and Mubarak Shah. Mgdˆ3: Mode-guided dataset distillation using diffusion models.arXiv preprint arXiv:2505.18963,

work page arXiv

[3] [3]

Available athttps://arxiv.org/ abs/2205.03257

10 Preprint. Under Review. Jeremy Howard. Imagenette: A smaller subset of 10 easily classified classes from imagenet, March 2019a. URLhttps://github.com/fastai/imagenette. Jeremy Howard. Imagewoof: a subset of 10 classes from imagenet that aren’t so easy to classify, March 2019b. URLhttps://github.com/fastai/imagenette#imagewoof. James Jordon, Lukasz Szpr...

work page arXiv

[4] [4]

The evolution of dataset distillation: Toward scalable and generalizable solutions,

Ping Liu and Jiawei Du. The evolution of dataset distillation: Toward scalable and generalizable solutions.arXiv preprint arXiv:2502.05673,

work page arXiv

[5] [5]

Scaling laws of synthetic data for language models.arXiv preprint arXiv:2503.19551,

Zeyu Qin, Qingxiu Dong, Xingxing Zhang, Li Dong, Xiaolong Huang, Ziyi Yang, Mahmoud Khademi, Dongdong Zhang, Hany Hassan Awadalla, Yi R Fung, et al. Scaling laws of synthetic data for language models.arXiv preprint arXiv:2503.19551,

work page arXiv

[6] [6]

DiM: Distill- ing dataset into generative model,

Kai Wang, Jianyang Gu, Daquan Zhou, Zheng Zhu, Wei Jiang, and Yang You. Dim: Distilling dataset into generative model.arXiv preprint arXiv:2303.04707,

work page arXiv

[7] [7]

Dataset Distillation , journal =

Tongzhou Wang, Jun-Yan Zhu, Antonio Torralba, and Alexei A Efros. Dataset distillation.arXiv preprint arXiv:1811.10959,

work page arXiv

[8] [8]

Hierar- chical features matter: A deep exploration of gan priors for improved dataset distillation.arXiv preprint arXiv:2406.05704,

Xinhao Zhong, Hao Fang, Bin Chen, Xulin Gu, Tao Dai, Meikang Qiu, and Shu-Tao Xia. Hierar- chical features matter: A deep exploration of gan priors for improved dataset distillation.arXiv preprint arXiv:2406.05704,

work page arXiv

[9] [9]

imitation

12 Preprint. Under Review. A APPENDIX Appendix organization: Section A.1: Background A.1.1: Dataset distillation A.1.2: Generative dataset distillation Section A.2: Proofs A.2.1: Validity of kernel-induced distance A.2.2: Distance factorization Section A.3: Experimental Setup A.3.1: Datasets and benchmarks A.3.2: Models and evaluation protocols A.3.3: Oth...

work page 2023

[10] [10]

H-PD (Zhong et al.,

enhances cross-architecture generalization by distilling data into the latent space of pre-trained models like StyleGAN (Karras et al., 2019). H-PD (Zhong et al.,

work page 2019

[11] [11]

• Symmetry:∥ϕ(x)−ϕ(y)∥=∥ϕ(y)−ϕ(x)∥

there exists a repro- ducing kernel Hilbert spaceHand a feature mapϕ:X → Hsuch that K(x, y) =⟨ϕ(x), ϕ(y)⟩ H.(10) Therefore, DK(x, y)2 =K(x, x) +K(y, y)−2K(x, y)(11) =⟨ϕ(x), ϕ(x)⟩ H +⟨ϕ(y), ϕ(y)⟩ H −2⟨ϕ(x), ϕ(y)⟩ H (12) =∥ϕ(x)−ϕ(y)∥ 2 H.(13) Thus, DK(x, y) =∥ϕ(x)−ϕ(y)∥ H.(14) Since the norm in Hilbert space∥ · ∥ H is a valid metric, it satisfies: • Non-neg...

work page 2009

[12] [12]

A.3.2 MODELS AND EVALUATION PROTOCOLS For each dataset, we distill subsets of 10, 50, and 100 images per class (IPC) and assess their utility on downstream classification tasks

to evaluate performance. A.3.2 MODELS AND EVALUATION PROTOCOLS For each dataset, we distill subsets of 10, 50, and 100 images per class (IPC) and assess their utility on downstream classification tasks. Two evaluation protocols are adopted: • Hard-label protocol: Following Chen et al. (2025), we directly train classifiers from scratch using the distilled ...

work page 2025