Closed-Form Linear-Probe Dataset Distillation for Pre-trained Vision Models

Bincheng Peng; Guang Li; Miki Haseyama; Ping Liu; Takahiro Ogawa

arxiv: 2605.07194 · v1 · submitted 2026-05-08 · 💻 cs.CV · cs.AI· cs.LG

Closed-Form Linear-Probe Dataset Distillation for Pre-trained Vision Models

Bincheng Peng , Guang Li , Ping Liu , Takahiro Ogawa , Miki Haseyama This is my paper

Pith reviewed 2026-05-11 02:30 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.LG

keywords dataset distillationlinear probingpre-trained vision modelsclosed-form optimizationbilevel optimizationtransfer learningsynthetic datakernel ridge regression

0 comments

The pith

A closed-form kernel ridge solver for linear probes enables dataset distillation that matches slower iterative methods at far lower cost for pre-trained vision models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that linear probing on frozen pre-trained features admits an exact closed-form solution, so dataset distillation need not rely on unrolling iterative updates or infinite-width kernel approximations. Instead, synthetic images are optimized in a bilevel setup where the inner problem solves for the optimal linear classifier induced by the synthetic set, and the outer problem uses that classifier's predictions on real features to drive a discriminative loss. A reader would care because most current visual transfer learning follows exactly this frozen-encoder plus linear-head pattern, making efficient creation of small synthetic sets directly relevant to practical pipelines. The outer loss must be chosen carefully; a standard regression loss falls short while the classification-style loss recovers most of the performance of prior methods.

Core claim

The central claim is that the linear probe induced by any candidate synthetic set can be obtained exactly via sample-space kernel ridge regression on the pre-trained features, after which the synthetic images are updated by evaluating the resulting classifier columns (treated as learned class anchors) on real data features through a temperature-scaled softmax cross-entropy objective. This formulation removes both the need for inner-loop trajectory simulation and the need for neural-tangent-kernel approximations, directly yielding synthetic sets whose induced probes perform well on the original feature distribution.

What carries the argument

Bilevel optimization whose inner problem is solved exactly by a sample-space kernel ridge regression that produces the linear classifier induced by the synthetic set, with the outer loss then measuring how well those classifier columns separate real features.

If this is right

The method substantially raises linear-probe accuracy over non-trajectory baselines across multiple pre-trained encoders on ImageNet-100.
It reaches performance levels close to trajectory-based distillation while using only a small fraction of the computation.
On the full ImageNet-1K scale it matches or exceeds trajectory-based results on three of four tested backbones.
Overall runtime is reduced by a factor of roughly fourteen and peak GPU memory drops below one-eighth of prior requirements.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same closed-form inner solver could be swapped into distillation pipelines for any downstream head whose optimal parameters admit an analytic expression, such as ridge regression or logistic regression.
Because the approach never simulates training trajectories, it may scale to distillation budgets that are currently intractable for iterative methods.
The emphasis on a discriminative outer loss suggests that future work could explore other outer objectives that directly optimize class separation in feature space rather than regression to one-hot targets.

Load-bearing premise

That the closed-form linear probe computed from synthetic features and evaluated on real features accurately reflects the downstream utility that would be obtained by actually training a linear head on the synthetic set.

What would settle it

Training linear probes from scratch on the distilled synthetic sets and measuring their accuracy on held-out real data; if this accuracy falls substantially below the outer objective value or below the accuracy achieved by trajectory-based distilled sets on the same backbones, the central claim is falsified.

Figures

Figures reproduced from arXiv: 2605.07194 by Bincheng Peng, Guang Li, Miki Haseyama, Ping Liu, Takahiro Ogawa.

**Figure 1.** Figure 1: Accuracy versus efficiency on ImageNet-100 with DINOv2 at IPC=1. CLPDD attains accuracy close to LGM with DSA at the runtime and memory of LGM without DSA. Bubble area denotes peak GPU memory. We propose Closed-Form Linear-Probe Dataset Distillation (CLP-DD), a bilevel framework that turns this structural simplification into a concrete distillation algorithm. The inner adaptation problem in our setting… view at source ↗

**Figure 2.** Figure 2: Examples of IPC=1 images distilled by CLP-DD on ImageNet-100 using the CLIP, DINOv2, MoCo-v3 and EVA-02 backbone. By default, CLP-DD uses standard image augmentations, including random cropping and noise injection, and does not use Differentiable Siamese Augmentation (DSA) [44]. For controlled evaluation, we extract features from the final layer of each backbone and use the same linear-probe training proto… view at source ↗

**Figure 3.** Figure 3: Convergence comparison on ImageNet-100 with DINOv2 at IPC=1 [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: PCA visualization of real and distilled image embeddings on ImageNet-Fruits. To qualitatively examine how CLP-DD organizes distilled images in the frozen feature space, we visualize real and distilled embeddings using a 2D Principal Component Analysis (PCA) projection in [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: Additional examples of images distilled by CLP-DD on ImageNet-100 using the DINOv2 [PITH_FULL_IMAGE:figures/full_fig_p017_5.png] view at source ↗

**Figure 6.** Figure 6: Additional examples of images distilled by CLP-DD on ImageNet-100 using the CLIP [PITH_FULL_IMAGE:figures/full_fig_p018_6.png] view at source ↗

**Figure 7.** Figure 7: Additional examples of images distilled by CLP-DD on ImageNet-100 using the EVA-02 [PITH_FULL_IMAGE:figures/full_fig_p019_7.png] view at source ↗

**Figure 8.** Figure 8: Additional examples of images distilled by CLP-DD on ImageNet-100 using the MoCo-v3 [PITH_FULL_IMAGE:figures/full_fig_p020_8.png] view at source ↗

read the original abstract

Dataset distillation compresses a large training set into a small synthetic set that preserves downstream training utility. While most existing methods target training networks from scratch, modern visual transfer learning often uses frozen pre-trained encoders followed by lightweight linear probing. Existing distillation methods for this setting either unroll iterative linear-probe updates with trajectory-based gradient matching, or rely on closed-form formulations originally designed for from-scratch training with neural-tangent-kernel (NTK) approximations. Neither route exploits the fact that frozen-feature linear probing admits a closed-form solution determined directly by the pre-trained features themselves, with no infinite-width approximation and no inner-loop trajectory. We propose Closed-Form Linear-Probe Dataset Distillation (CLP-DD), a bilevel formulation that computes the linear probe induced by the synthetic set with a sample-space kernel ridge solver. The synthetic images are then updated by evaluating this induced classifier on real features through a temperature-scaled softmax cross-entropy, where the classifier columns act as learned class anchors in feature space. We further show that the choice of outer objective is decisive: pairing the closed-form inner solver with a standard MSE outer loss substantially underperforms trajectory-based methods, while the discriminative outer loss closes most of the gap. On ImageNet-100 with four pre-trained backbones, CLP-DD substantially improves over LGM without DSA and approaches LGM with DSA at a fraction of the computational cost. On ImageNet-1K, CLP-DD matches or surpasses LGM with DSA on three of four backbones while running roughly $14\times$ faster and using less than one-eighth of the GPU memory.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CLP-DD gives a faster closed-form distillation route for linear probes on frozen encoders, with clear speed gains over prior work but an untested assumption that the ridge solver matches standard probing.

read the letter

This paper introduces a bilevel setup for dataset distillation that targets frozen pre-trained vision models followed by linear probes. The inner step solves for the probe weights directly with a sample-space kernel ridge regression on the synthetic features and labels, then the outer step updates the synthetic images by feeding real features through a temperature-scaled cross-entropy loss that treats the learned weights as class anchors. That combination is the main novelty relative to earlier trajectory-matching or NTK-based distillation methods aimed at from-scratch training.

Referee Report

1 major / 2 minor

Summary. The paper introduces Closed-Form Linear-Probe Dataset Distillation (CLP-DD), a bilevel method for distilling datasets targeted at frozen pre-trained vision encoders followed by linear probing. The inner level solves for the linear probe induced by synthetic images via a sample-space kernel-ridge regression on their features and (one-hot) labels; the outer level then updates the synthetic images by evaluating the resulting classifier columns as class anchors inside a temperature-scaled softmax cross-entropy loss computed on real features. The authors demonstrate that pairing the closed-form inner solver with an MSE outer objective underperforms trajectory-based baselines, while the discriminative outer objective recovers most of the performance gap. On ImageNet-100 and ImageNet-1K they report that CLP-DD substantially improves over LGM without DSA, matches or exceeds LGM with DSA on most backbones, and runs approximately 14× faster with less than one-eighth the GPU memory.

Significance. If the central empirical claims hold, the work supplies a practical, memory-efficient alternative to trajectory-matching distillation methods for the common transfer-learning setting of frozen encoders plus linear probes. The explicit separation of a parameter-free algebraic inner solver from a standard discriminative outer loss, together with the reported speed and memory gains, would be a useful contribution to the dataset-distillation literature.

major comments (1)

[§3 (method) and experimental sections] The manuscript does not report a direct side-by-side comparison of the closed-form kernel-ridge probe (used both for the inner solver and as class anchors in the outer loss) against a conventionally trained linear probe obtained by SGD/Adam minimization of cross-entropy on the same synthetic features. Because the outer objective and all reported downstream numbers rely on the ridge surrogate, any systematic discrepancy between ridge-induced weights and gradient-trained weights would mean the distilled sets are optimized for an unrepresentative proxy of standard linear-probe utility. This comparison is load-bearing for the ImageNet-100/1K claims.

minor comments (2)

[§3.2] The temperature scaling factor in the outer softmax is listed as a free hyper-parameter; its sensitivity and selection protocol should be stated explicitly.
[Figure 4] Figure captions and axis labels in the runtime/memory plots would benefit from explicit units (e.g., “GPU-hours” and “GB peak memory”).

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and detailed review of our work on Closed-Form Linear-Probe Dataset Distillation. We address the major comment point by point below and outline the revisions we will make.

read point-by-point responses

Referee: [§3 (method) and experimental sections] The manuscript does not report a direct side-by-side comparison of the closed-form kernel-ridge probe (used both for the inner solver and as class anchors in the outer loss) against a conventionally trained linear probe obtained by SGD/Adam minimization of cross-entropy on the same synthetic features. Because the outer objective and all reported downstream numbers rely on the ridge surrogate, any systematic discrepancy between ridge-induced weights and gradient-trained weights would mean the distilled sets are optimized for an unrepresentative proxy of standard linear-probe utility. This comparison is load-bearing for the ImageNet-100/1K claims.

Authors: We agree that explicitly comparing the closed-form kernel-ridge solution against a gradient-trained linear probe (SGD/Adam minimizing cross-entropy) on the same synthetic features would strengthen the validation of our approach. The kernel-ridge solver yields the exact minimizer of the regularized MSE objective with one-hot labels, which enables the efficient, non-iterative inner loop that is central to CLP-DD's speed and memory advantages. In the current experiments, this closed-form probe is used consistently for both optimization and reported downstream accuracies to maintain tractability. Nevertheless, because practical linear probing typically employs cross-entropy with iterative optimizers, we will add the requested side-by-side comparison in the revised manuscript. For each backbone and dataset, we will train linear probes on the distilled synthetic features using SGD/Adam, report the resulting test accuracies, and directly contrast them with the ridge-based numbers. This will quantify any discrepancy and confirm that the distilled sets remain effective under standard linear-probe protocols, thereby supporting the ImageNet-100 and ImageNet-1K claims. revision: yes

Circularity Check

0 steps flagged

No significant circularity in the derivation chain

full rationale

The paper's bilevel formulation computes the inner linear probe via an algebraic sample-space kernel ridge solver on synthetic features and labels, which is a direct closed-form operation with no iterative trajectories, NTK approximations, or self-referential fitting. The outer loss is a standard temperature-scaled softmax cross-entropy evaluated on real features using the induced classifier columns as class anchors. This structure does not reduce any claimed prediction or result to a quantity defined by the paper's own inputs or fitted parameters by construction. No load-bearing self-citations, uniqueness theorems, or smuggled ansatzes appear in the derivation; comparisons to external baselines such as LGM are independent. The central claims rest on empirical performance rather than circular reduction, making the method self-contained.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The method rests on the domain fact that linear probing on fixed features admits an exact closed-form solution; no new entities are postulated and the only free parameters are standard hyperparameters such as temperature.

free parameters (1)

temperature scaling in outer softmax
Hyperparameter in the discriminative outer loss; its value affects the gap between MSE and cross-entropy variants.

axioms (1)

domain assumption Frozen pre-trained features admit an exact closed-form linear probe via sample-space kernel ridge regression with no infinite-width approximation required.
Invoked to replace iterative inner-loop updates in the bilevel formulation.

pith-pipeline@v0.9.0 · 5609 in / 1409 out tokens · 43576 ms · 2026-05-11T02:30:49.670969+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

48 extracted references · 48 canonical work pages · 1 internal anchor

[1]

Zico Kolter, and Vladlen Koltun

Shaojie Bai, J. Zico Kolter, and Vladlen Koltun. Deep equilibrium models. InAdvances in Neural Information Processing Systems, volume 32, pages 688–699, 2019

work page 2019
[2]

Flexible dataset distillation: Learn labels instead of images.arXiv preprint arXiv:2006.08572, 2020

Ondrej Bohdal, Yongxin Yang, and Timothy Hospedales. Flexible dataset distillation: Learn labels instead of images.arXiv preprint arXiv:2006.08572, 2020

work page arXiv 2006
[3]

Dataset distillation for pre- trained self-supervised vision models

George Cazenavette, Antonio Torralba, and Vincent Sitzmann. Dataset distillation for pre- trained self-supervised vision models. InProceedings of the Advances in Neural Information Processing Systems (NeurIPS), 2025

work page 2025
[4]

Dataset distillation by matching training trajectories

George Cazenavette, Tongzhou Wang, Antonio Torralba, Alexei A Efros, and Jun-Yan Zhu. Dataset distillation by matching training trajectories. InProceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (CVPR), pages 4750–4759, 2022

work page 2022
[5]

Chan-Santiago and Mubarak Shah

Jeffrey A. Chan-Santiago and Mubarak Shah. Learnability-guided diffusion for dataset distilla- tion. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2026

work page 2026
[6]

An empirical study of training self-supervised vision transformers

Xinlei Chen, Saining Xie, and Kaiming He. An empirical study of training self-supervised vision transformers. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 9640–9649, 2021

work page 2021
[7]

Scaling up dataset distillation to imagenet- 1k with constant memory

Justin Cui, Ruochen Wang, Si Si, and Cho-Jui Hsieh. Scaling up dataset distillation to imagenet- 1k with constant memory. InInternational Conference on Machine Learning (ICML), pages 6565–6590, 2023

work page 2023
[8]

Remember the past: Distilling datasets into addressable memories for neural networks

Zhiwei Deng and Olga Russakovsky. Remember the past: Distilling datasets into addressable memories for neural networks. InProceedings of the Advances in Neural Information Processing Systems (NeurIPS), 2022

work page 2022
[9]

Eva-02: A visual representation for neon genesis,

Yuxin Fang, Quan Sun, Xinggang Wang, Tiejun Huang, Xinlong Wang, and Yue Cao. EV A-02: A visual representation for neon genesis.arXiv preprint arXiv:2303.11331, 2023

work page arXiv 2023
[10]

To- wards lossless dataset distillation via difficulty-aligned trajectory matching

Ziyao Guo, Kai Wang, George Cazenavette, Hui Li, Kaipeng Zhang, and Yang You. To- wards lossless dataset distillation via difficulty-aligned trajectory matching. InInternational Conference on Learning Representations (ICLR), 2024

work page 2024
[11]

Beyond pixels: Efficient dataset distillation via sparse gaussian representation.arXiv preprint arXiv:2509.26219, 2025

Chenyang Jiang, Zhengcen Li, Hang Zhao, Qiben Shan, Shaocong Wu, and Jingyong Su. Beyond pixels: Efficient dataset distillation via sparse gaussian representation.arXiv preprint arXiv:2509.26219, 2025

work page arXiv 2025
[12]

Dataset condensation via efficient synthetic-data parameteri- zation

Jang-Hyun Kim, Jinuk Kim, Seong Joon Oh, Sangdoo Yun, Hwanjun Song, Joonhyun Jeong, Jung-Woo Ha, and Hyun Oh Song. Dataset condensation via efficient synthetic-data parameteri- zation. InProceedings of the International Conference on Machine Learning (ICML), pages 11102–11118, 2022

work page 2022
[13]

Dataset condensation with contrastive signals

Saehyung Lee, Sanghyuk Chun, Sangwon Jung, Sangdoo Yun, and Sungroh Yoon. Dataset condensation with contrastive signals. InInternational Conference on Machine Learning, pages 12352–12364, 2022

work page 2022
[14]

SelMatch: Effectively scaling up dataset distillation via selection-based initialization and partial updates by trajectory matching

Yongmin Lee and Hye Won Chung. SelMatch: Effectively scaling up dataset distillation via selection-based initialization and partial updates by trajectory matching. InProceedings of the International Conference on Machine Learning (ICML), 2024. 10

work page 2024
[15]

Soft-label anonymous gastric x-ray image distillation

Guang Li, Ren Togo, Takahiro Ogawa, and Miki Haseyama. Soft-label anonymous gastric x-ray image distillation. InProceedings of the IEEE International Conference on Image Processing (ICIP), pages 305–309, 2020

work page 2020
[16]

Compressed gastric image generation based on soft-label dataset distillation for medical data sharing.Computer Methods and Programs in Biomedicine, 227:107189, 2022

Guang Li, Ren Togo, Takahiro Ogawa, and Miki Haseyama. Compressed gastric image generation based on soft-label dataset distillation for medical data sharing.Computer Methods and Programs in Biomedicine, 227:107189, 2022

work page 2022
[17]

Importance-aware adaptive dataset distillation.Neural Networks, 2024

Guang Li, Ren Togo, Takahiro Ogawa, and Miki Haseyama. Importance-aware adaptive dataset distillation.Neural Networks, 2024

work page 2024
[18]

Awesome dataset distillation

Guang Li, Bo Zhao, and Tongzhou Wang. Awesome dataset distillation. https://github. com/Guang000/Awesome-Dataset-Distillation, 2022

work page 2022
[19]

Generative dataset distillation: Balancing global structure and local details

Longzhen Li, Guang Li, Ren Togo, Keisuke Maeda, Takahiro Ogawa, and Miki Haseyama. Generative dataset distillation: Balancing global structure and local details. InIEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pages 7664–7671, 2024

work page 2024
[20]

Diversity-driven generative dataset distillation based on diffusion model with self-adaptive memory

Mingzhuo Li, Guang Li, Jiafeng Mao, Takahiro Ogawa, and Miki Haseyama. Diversity-driven generative dataset distillation based on diffusion model with self-adaptive memory. InIEEE International Conference on Image Processing (ICIP), 2024

work page 2024
[21]

Hyperbolic dataset distillation

Wenyuan Li, Guang Li, Keisuke Maeda, Takahiro Ogawa, and Miki Haseyama. Hyperbolic dataset distillation. InProceedings of the Advances in Neural Information Processing Systems (NeurIPS), 2025

work page 2025
[22]

Dataset distillation via the wasserstein metric

Haoyang Liu, Yijiang Li, Tiancheng Xing, Peiran Wang, Vibhu Dalal, Luwei Li, Jingrui He, and Haohan Wang. Dataset distillation via the wasserstein metric. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2025

work page 2025
[23]

Efficient dataset distillation using random feature approximation

Noel Loo, Ramin Hasani, Alexander Amini, and Daniela Rus. Efficient dataset distillation using random feature approximation. InProceedings of the Advances in Neural Information Processing Systems (NeurIPS), 2022

work page 2022
[24]

Krikamol Muandet, Bharath Sriperumbudur, Kenji Fukumizu, Arthur Gretton, and Bernhard Schölkopf

Aengus Lynch, Gbètondji J-S Dovonon, Jean Kaddour, and Ricardo Silva. Spawrious: A benchmark for fine control of spurious correlation biases.arXiv preprint arXiv:2303.05470, 2023

work page arXiv 2023
[25]

Fd2: A dedicated framework for fine-grained dataset distillation

Hongxu Ma, Guang Li, Shijie Wang, Dongzhan Zhou, Baoli Sun, Zhihui Wang, Takahiro Ogawa, and Miki Haseyama. Fd2: A dedicated framework for fine-grained dataset distillation. arXiv preprint arXiv:2603.25144, 2026

work page arXiv 2026
[26]

Learning to generate synthetic training data using gradient matching and implicit differentiation

Dmitry Medvedev and Alexander D’yakonov. Learning to generate synthetic training data using gradient matching and implicit differentiation. InInternational Conference on Analysis of Images, Social Networks and Texts, pages 138–150, 2021

work page 2021
[27]

Dataset meta-learning from kernel ridge-regression

Timothy Nguyen, Zhourong Chen, and Jaehoon Lee. Dataset meta-learning from kernel ridge-regression. InInternational Conference on Learning Representations (ICLR), 2021

work page 2021
[28]

Dataset distillation with infinitely wide convolutional networks.Advances in Neural Information Processing Systems, 34:5186–5198, 2021

Timothy Nguyen, Roman Novak, Lechao Xiao, and Jaehoon Lee. Dataset distillation with infinitely wide convolutional networks.Advances in Neural Information Processing Systems, 34:5186–5198, 2021

work page 2021
[29]

DINOv2: Learning robust visual features without supervision.Transactions on Machine Learning Research, 2024

Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Mahmoud Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Hervé Jégou, Julien Mairal, Patrick La...

work page 2024
[30]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational Conference on Machine Learning (ICML), pages 8748–8763. PMLR, 2021. 11

work page 2021
[31]

Meta-learning with implicit gradients

Aravind Rajeswaran, Chelsea Finn, Sham M Kakade, and Sergey Levine. Meta-learning with implicit gradients. InAdvances in Neural Information Processing Systems (NeurIPS), volume 32, pages 113–124, 2019

work page 2019
[32]

TGDD: Trajectory guided dataset distillation with balanced distribution

Fengli Ran, Xiao Pu, Bo Liu, Xiuli Bi, and Bin Xiao. TGDD: Trajectory guided dataset distillation with balanced distribution. InProceedings of the AAAI Conference on Artificial Intelligence (AAAI), 2026

work page 2026
[33]

Imagenet large scale visual recognition challenge.International Journal of Computer Vision (IJCV), 115(3):211–252, 2015

Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge.International Journal of Computer Vision (IJCV), 115(3):211–252, 2015

work page 2015
[34]

Distributionally robust neural networks for group shifts: On the importance of regularization for worst-case generalization

Shiori Sagawa, Pang Wei Koh, Tatsunori B Hashimoto, and Percy Liang. Distributionally robust neural networks for group shifts: On the importance of regularization for worst-case generalization. InInternational Conference on Learning Representations (ICLR), 2020

work page 2020
[35]

Datadam: Efficient dataset distillation with attention matching

Ahmad Sajedi, Samir Khaki, Ehsan Amjadian, Lucy Z Liu, Yuri A Lawryshyn, and Konstanti- nos N Plataniotis. Datadam: Efficient dataset distillation with attention matching. InIEEE/CVF International Conference on Computer Vision, pages 17097–17107, 2023

work page 2023
[36]

Dˆ 4: Dataset distillation via disentangled diffusion model

Duo Su, Junjie Hou, Weizhi Gao, Yingjie Tian, and Bowen Tang. Dˆ 4: Dataset distillation via disentangled diffusion model. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5809–5818, 2024

work page 2024
[37]

Generative dataset distillation based on diffusion model

Duo Su, Junjie Hou, Guang Li, Ren Togo, Rui Song, Takahiro Ogawa, and Miki Haseyama. Generative dataset distillation based on diffusion model. InEuropean Conference on Computer Vision Workshops, 2024

work page 2024
[38]

Contrastive multiview coding

Yonglong Tian, Dilip Krishnan, and Phillip Isola. Contrastive multiview coding. InEuropean Conference on Computer Vision (ECCV), pages 776–794. Springer, 2020

work page 2020
[39]

Post training quantization for efficient dataset condensation

Linh-Tam Tran and Sung-Ho Bae. Post training quantization for efficient dataset condensation. InProceedings of the AAAI Conference on Artificial Intelligence (AAAI), 2026

work page 2026
[40]

Cafe: Learning to condense dataset by aligning features

Kai Wang, Bo Zhao, Xiangyu Peng, Zheng Zhu, Shuo Yang, Shuo Wang, Guan Huang, Hakan Bilen, Xinchao Wang, and Yang You. Cafe: Learning to condense dataset by aligning features. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 12196–12205, 2022

work page 2022
[41]

Dataset Distillation

Tongzhou Wang, Jun-Yan Zhu, Antonio Torralba, and Alexei A Efros. Dataset distillation. arXiv preprint arXiv:1811.10959, 2018

work page internal anchor Pith review arXiv 2018
[42]

Plataniotis

Linfeng Ye, Shayan Mohajer Hamidi, Guang Li, Takahiro Ogawa, Miki Haseyama, and Kon- stantinos N. Plataniotis. Information-guided diffusion sampling for dataset distillation. In Advances in Neural Information Processing Systems Workshops, 2025

work page 2025
[43]

A comprehensive survey to dataset distillation

Ruonan Yu, Songhua Liu, and Xinchao Wang. A comprehensive survey to dataset distillation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(1):150–170, 2023

work page 2023
[44]

Dataset condensation with differentiable siamese augmentation

Bo Zhao and Hakan Bilen. Dataset condensation with differentiable siamese augmentation. In International Conference on Machine Learning, pages 12674–12685, 2021

work page 2021
[45]

Dataset condensation with distribution matching

Bo Zhao and Hakan Bilen. Dataset condensation with distribution matching. InIEEE/CVF Winter Conference on Applications of Computer Vision, pages 6514–6523, 2023

work page 2023
[46]

Dataset condensation with gradient matching

Bo Zhao, Konda Reddy Mopuri, and Hakan Bilen. Dataset condensation with gradient matching. InInternational Conference on Learning Representations (ICLR), 2021

work page 2021
[47]

Limitations

Yongchao Zhou, Ehsan Nezhadarya, and Jimmy Ba. Dataset distillation using neural feature regression. InAdvances in Neural Information Processing Systems (NeurIPS), volume 35, pages 9813–9827, 2022. 12 A Detailed Mathematical Derivations This appendix provides the derivations used in Closed-Form Linear-Probe Dataset Distillation (CLP- DD). We first derive ...

work page 2022
[48]

Guidelines: • The answer [N/A] means that the paper does not involve crowdsourcing nor research with human subjects

Institutional review board (IRB) approvals or equivalent for research with human subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or ...

work page

[1] [1]

Zico Kolter, and Vladlen Koltun

Shaojie Bai, J. Zico Kolter, and Vladlen Koltun. Deep equilibrium models. InAdvances in Neural Information Processing Systems, volume 32, pages 688–699, 2019

work page 2019

[2] [2]

Flexible dataset distillation: Learn labels instead of images.arXiv preprint arXiv:2006.08572, 2020

Ondrej Bohdal, Yongxin Yang, and Timothy Hospedales. Flexible dataset distillation: Learn labels instead of images.arXiv preprint arXiv:2006.08572, 2020

work page arXiv 2006

[3] [3]

Dataset distillation for pre- trained self-supervised vision models

George Cazenavette, Antonio Torralba, and Vincent Sitzmann. Dataset distillation for pre- trained self-supervised vision models. InProceedings of the Advances in Neural Information Processing Systems (NeurIPS), 2025

work page 2025

[4] [4]

Dataset distillation by matching training trajectories

George Cazenavette, Tongzhou Wang, Antonio Torralba, Alexei A Efros, and Jun-Yan Zhu. Dataset distillation by matching training trajectories. InProceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (CVPR), pages 4750–4759, 2022

work page 2022

[5] [5]

Chan-Santiago and Mubarak Shah

Jeffrey A. Chan-Santiago and Mubarak Shah. Learnability-guided diffusion for dataset distilla- tion. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2026

work page 2026

[6] [6]

An empirical study of training self-supervised vision transformers

Xinlei Chen, Saining Xie, and Kaiming He. An empirical study of training self-supervised vision transformers. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 9640–9649, 2021

work page 2021

[7] [7]

Scaling up dataset distillation to imagenet- 1k with constant memory

Justin Cui, Ruochen Wang, Si Si, and Cho-Jui Hsieh. Scaling up dataset distillation to imagenet- 1k with constant memory. InInternational Conference on Machine Learning (ICML), pages 6565–6590, 2023

work page 2023

[8] [8]

Remember the past: Distilling datasets into addressable memories for neural networks

Zhiwei Deng and Olga Russakovsky. Remember the past: Distilling datasets into addressable memories for neural networks. InProceedings of the Advances in Neural Information Processing Systems (NeurIPS), 2022

work page 2022

[9] [9]

Eva-02: A visual representation for neon genesis,

Yuxin Fang, Quan Sun, Xinggang Wang, Tiejun Huang, Xinlong Wang, and Yue Cao. EV A-02: A visual representation for neon genesis.arXiv preprint arXiv:2303.11331, 2023

work page arXiv 2023

[10] [10]

To- wards lossless dataset distillation via difficulty-aligned trajectory matching

Ziyao Guo, Kai Wang, George Cazenavette, Hui Li, Kaipeng Zhang, and Yang You. To- wards lossless dataset distillation via difficulty-aligned trajectory matching. InInternational Conference on Learning Representations (ICLR), 2024

work page 2024

[11] [11]

Beyond pixels: Efficient dataset distillation via sparse gaussian representation.arXiv preprint arXiv:2509.26219, 2025

Chenyang Jiang, Zhengcen Li, Hang Zhao, Qiben Shan, Shaocong Wu, and Jingyong Su. Beyond pixels: Efficient dataset distillation via sparse gaussian representation.arXiv preprint arXiv:2509.26219, 2025

work page arXiv 2025

[12] [12]

Dataset condensation via efficient synthetic-data parameteri- zation

Jang-Hyun Kim, Jinuk Kim, Seong Joon Oh, Sangdoo Yun, Hwanjun Song, Joonhyun Jeong, Jung-Woo Ha, and Hyun Oh Song. Dataset condensation via efficient synthetic-data parameteri- zation. InProceedings of the International Conference on Machine Learning (ICML), pages 11102–11118, 2022

work page 2022

[13] [13]

Dataset condensation with contrastive signals

Saehyung Lee, Sanghyuk Chun, Sangwon Jung, Sangdoo Yun, and Sungroh Yoon. Dataset condensation with contrastive signals. InInternational Conference on Machine Learning, pages 12352–12364, 2022

work page 2022

[14] [14]

SelMatch: Effectively scaling up dataset distillation via selection-based initialization and partial updates by trajectory matching

Yongmin Lee and Hye Won Chung. SelMatch: Effectively scaling up dataset distillation via selection-based initialization and partial updates by trajectory matching. InProceedings of the International Conference on Machine Learning (ICML), 2024. 10

work page 2024

[15] [15]

Soft-label anonymous gastric x-ray image distillation

Guang Li, Ren Togo, Takahiro Ogawa, and Miki Haseyama. Soft-label anonymous gastric x-ray image distillation. InProceedings of the IEEE International Conference on Image Processing (ICIP), pages 305–309, 2020

work page 2020

[16] [16]

Compressed gastric image generation based on soft-label dataset distillation for medical data sharing.Computer Methods and Programs in Biomedicine, 227:107189, 2022

Guang Li, Ren Togo, Takahiro Ogawa, and Miki Haseyama. Compressed gastric image generation based on soft-label dataset distillation for medical data sharing.Computer Methods and Programs in Biomedicine, 227:107189, 2022

work page 2022

[17] [17]

Importance-aware adaptive dataset distillation.Neural Networks, 2024

Guang Li, Ren Togo, Takahiro Ogawa, and Miki Haseyama. Importance-aware adaptive dataset distillation.Neural Networks, 2024

work page 2024

[18] [18]

Awesome dataset distillation

Guang Li, Bo Zhao, and Tongzhou Wang. Awesome dataset distillation. https://github. com/Guang000/Awesome-Dataset-Distillation, 2022

work page 2022

[19] [19]

Generative dataset distillation: Balancing global structure and local details

Longzhen Li, Guang Li, Ren Togo, Keisuke Maeda, Takahiro Ogawa, and Miki Haseyama. Generative dataset distillation: Balancing global structure and local details. InIEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pages 7664–7671, 2024

work page 2024

[20] [20]

Diversity-driven generative dataset distillation based on diffusion model with self-adaptive memory

Mingzhuo Li, Guang Li, Jiafeng Mao, Takahiro Ogawa, and Miki Haseyama. Diversity-driven generative dataset distillation based on diffusion model with self-adaptive memory. InIEEE International Conference on Image Processing (ICIP), 2024

work page 2024

[21] [21]

Hyperbolic dataset distillation

Wenyuan Li, Guang Li, Keisuke Maeda, Takahiro Ogawa, and Miki Haseyama. Hyperbolic dataset distillation. InProceedings of the Advances in Neural Information Processing Systems (NeurIPS), 2025

work page 2025

[22] [22]

Dataset distillation via the wasserstein metric

Haoyang Liu, Yijiang Li, Tiancheng Xing, Peiran Wang, Vibhu Dalal, Luwei Li, Jingrui He, and Haohan Wang. Dataset distillation via the wasserstein metric. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2025

work page 2025

[23] [23]

Efficient dataset distillation using random feature approximation

Noel Loo, Ramin Hasani, Alexander Amini, and Daniela Rus. Efficient dataset distillation using random feature approximation. InProceedings of the Advances in Neural Information Processing Systems (NeurIPS), 2022

work page 2022

[24] [24]

Krikamol Muandet, Bharath Sriperumbudur, Kenji Fukumizu, Arthur Gretton, and Bernhard Schölkopf

Aengus Lynch, Gbètondji J-S Dovonon, Jean Kaddour, and Ricardo Silva. Spawrious: A benchmark for fine control of spurious correlation biases.arXiv preprint arXiv:2303.05470, 2023

work page arXiv 2023

[25] [25]

Fd2: A dedicated framework for fine-grained dataset distillation

Hongxu Ma, Guang Li, Shijie Wang, Dongzhan Zhou, Baoli Sun, Zhihui Wang, Takahiro Ogawa, and Miki Haseyama. Fd2: A dedicated framework for fine-grained dataset distillation. arXiv preprint arXiv:2603.25144, 2026

work page arXiv 2026

[26] [26]

Learning to generate synthetic training data using gradient matching and implicit differentiation

Dmitry Medvedev and Alexander D’yakonov. Learning to generate synthetic training data using gradient matching and implicit differentiation. InInternational Conference on Analysis of Images, Social Networks and Texts, pages 138–150, 2021

work page 2021

[27] [27]

Dataset meta-learning from kernel ridge-regression

Timothy Nguyen, Zhourong Chen, and Jaehoon Lee. Dataset meta-learning from kernel ridge-regression. InInternational Conference on Learning Representations (ICLR), 2021

work page 2021

[28] [28]

Dataset distillation with infinitely wide convolutional networks.Advances in Neural Information Processing Systems, 34:5186–5198, 2021

Timothy Nguyen, Roman Novak, Lechao Xiao, and Jaehoon Lee. Dataset distillation with infinitely wide convolutional networks.Advances in Neural Information Processing Systems, 34:5186–5198, 2021

work page 2021

[29] [29]

DINOv2: Learning robust visual features without supervision.Transactions on Machine Learning Research, 2024

Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Mahmoud Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Hervé Jégou, Julien Mairal, Patrick La...

work page 2024

[30] [30]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational Conference on Machine Learning (ICML), pages 8748–8763. PMLR, 2021. 11

work page 2021

[31] [31]

Meta-learning with implicit gradients

Aravind Rajeswaran, Chelsea Finn, Sham M Kakade, and Sergey Levine. Meta-learning with implicit gradients. InAdvances in Neural Information Processing Systems (NeurIPS), volume 32, pages 113–124, 2019

work page 2019

[32] [32]

TGDD: Trajectory guided dataset distillation with balanced distribution

Fengli Ran, Xiao Pu, Bo Liu, Xiuli Bi, and Bin Xiao. TGDD: Trajectory guided dataset distillation with balanced distribution. InProceedings of the AAAI Conference on Artificial Intelligence (AAAI), 2026

work page 2026

[33] [33]

Imagenet large scale visual recognition challenge.International Journal of Computer Vision (IJCV), 115(3):211–252, 2015

Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge.International Journal of Computer Vision (IJCV), 115(3):211–252, 2015

work page 2015

[34] [34]

Distributionally robust neural networks for group shifts: On the importance of regularization for worst-case generalization

Shiori Sagawa, Pang Wei Koh, Tatsunori B Hashimoto, and Percy Liang. Distributionally robust neural networks for group shifts: On the importance of regularization for worst-case generalization. InInternational Conference on Learning Representations (ICLR), 2020

work page 2020

[35] [35]

Datadam: Efficient dataset distillation with attention matching

Ahmad Sajedi, Samir Khaki, Ehsan Amjadian, Lucy Z Liu, Yuri A Lawryshyn, and Konstanti- nos N Plataniotis. Datadam: Efficient dataset distillation with attention matching. InIEEE/CVF International Conference on Computer Vision, pages 17097–17107, 2023

work page 2023

[36] [36]

Dˆ 4: Dataset distillation via disentangled diffusion model

Duo Su, Junjie Hou, Weizhi Gao, Yingjie Tian, and Bowen Tang. Dˆ 4: Dataset distillation via disentangled diffusion model. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5809–5818, 2024

work page 2024

[37] [37]

Generative dataset distillation based on diffusion model

Duo Su, Junjie Hou, Guang Li, Ren Togo, Rui Song, Takahiro Ogawa, and Miki Haseyama. Generative dataset distillation based on diffusion model. InEuropean Conference on Computer Vision Workshops, 2024

work page 2024

[38] [38]

Contrastive multiview coding

Yonglong Tian, Dilip Krishnan, and Phillip Isola. Contrastive multiview coding. InEuropean Conference on Computer Vision (ECCV), pages 776–794. Springer, 2020

work page 2020

[39] [39]

Post training quantization for efficient dataset condensation

Linh-Tam Tran and Sung-Ho Bae. Post training quantization for efficient dataset condensation. InProceedings of the AAAI Conference on Artificial Intelligence (AAAI), 2026

work page 2026

[40] [40]

Cafe: Learning to condense dataset by aligning features

Kai Wang, Bo Zhao, Xiangyu Peng, Zheng Zhu, Shuo Yang, Shuo Wang, Guan Huang, Hakan Bilen, Xinchao Wang, and Yang You. Cafe: Learning to condense dataset by aligning features. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 12196–12205, 2022

work page 2022

[41] [41]

Dataset Distillation

Tongzhou Wang, Jun-Yan Zhu, Antonio Torralba, and Alexei A Efros. Dataset distillation. arXiv preprint arXiv:1811.10959, 2018

work page internal anchor Pith review arXiv 2018

[42] [42]

Plataniotis

Linfeng Ye, Shayan Mohajer Hamidi, Guang Li, Takahiro Ogawa, Miki Haseyama, and Kon- stantinos N. Plataniotis. Information-guided diffusion sampling for dataset distillation. In Advances in Neural Information Processing Systems Workshops, 2025

work page 2025

[43] [43]

A comprehensive survey to dataset distillation

Ruonan Yu, Songhua Liu, and Xinchao Wang. A comprehensive survey to dataset distillation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(1):150–170, 2023

work page 2023

[44] [44]

Dataset condensation with differentiable siamese augmentation

Bo Zhao and Hakan Bilen. Dataset condensation with differentiable siamese augmentation. In International Conference on Machine Learning, pages 12674–12685, 2021

work page 2021

[45] [45]

Dataset condensation with distribution matching

Bo Zhao and Hakan Bilen. Dataset condensation with distribution matching. InIEEE/CVF Winter Conference on Applications of Computer Vision, pages 6514–6523, 2023

work page 2023

[46] [46]

Dataset condensation with gradient matching

Bo Zhao, Konda Reddy Mopuri, and Hakan Bilen. Dataset condensation with gradient matching. InInternational Conference on Learning Representations (ICLR), 2021

work page 2021

[47] [47]

Limitations

Yongchao Zhou, Ehsan Nezhadarya, and Jimmy Ba. Dataset distillation using neural feature regression. InAdvances in Neural Information Processing Systems (NeurIPS), volume 35, pages 9813–9827, 2022. 12 A Detailed Mathematical Derivations This appendix provides the derivations used in Closed-Form Linear-Probe Dataset Distillation (CLP- DD). We first derive ...

work page 2022

[48] [48]

Guidelines: • The answer [N/A] means that the paper does not involve crowdsourcing nor research with human subjects

Institutional review board (IRB) approvals or equivalent for research with human subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or ...

work page