Closed-Form Linear-Probe Dataset Distillation for Pre-trained Vision Models
Pith reviewed 2026-05-11 02:30 UTC · model grok-4.3
The pith
A closed-form kernel ridge solver for linear probes enables dataset distillation that matches slower iterative methods at far lower cost for pre-trained vision models.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that the linear probe induced by any candidate synthetic set can be obtained exactly via sample-space kernel ridge regression on the pre-trained features, after which the synthetic images are updated by evaluating the resulting classifier columns (treated as learned class anchors) on real data features through a temperature-scaled softmax cross-entropy objective. This formulation removes both the need for inner-loop trajectory simulation and the need for neural-tangent-kernel approximations, directly yielding synthetic sets whose induced probes perform well on the original feature distribution.
What carries the argument
Bilevel optimization whose inner problem is solved exactly by a sample-space kernel ridge regression that produces the linear classifier induced by the synthetic set, with the outer loss then measuring how well those classifier columns separate real features.
If this is right
- The method substantially raises linear-probe accuracy over non-trajectory baselines across multiple pre-trained encoders on ImageNet-100.
- It reaches performance levels close to trajectory-based distillation while using only a small fraction of the computation.
- On the full ImageNet-1K scale it matches or exceeds trajectory-based results on three of four tested backbones.
- Overall runtime is reduced by a factor of roughly fourteen and peak GPU memory drops below one-eighth of prior requirements.
Where Pith is reading between the lines
- The same closed-form inner solver could be swapped into distillation pipelines for any downstream head whose optimal parameters admit an analytic expression, such as ridge regression or logistic regression.
- Because the approach never simulates training trajectories, it may scale to distillation budgets that are currently intractable for iterative methods.
- The emphasis on a discriminative outer loss suggests that future work could explore other outer objectives that directly optimize class separation in feature space rather than regression to one-hot targets.
Load-bearing premise
That the closed-form linear probe computed from synthetic features and evaluated on real features accurately reflects the downstream utility that would be obtained by actually training a linear head on the synthetic set.
What would settle it
Training linear probes from scratch on the distilled synthetic sets and measuring their accuracy on held-out real data; if this accuracy falls substantially below the outer objective value or below the accuracy achieved by trajectory-based distilled sets on the same backbones, the central claim is falsified.
Figures
read the original abstract
Dataset distillation compresses a large training set into a small synthetic set that preserves downstream training utility. While most existing methods target training networks from scratch, modern visual transfer learning often uses frozen pre-trained encoders followed by lightweight linear probing. Existing distillation methods for this setting either unroll iterative linear-probe updates with trajectory-based gradient matching, or rely on closed-form formulations originally designed for from-scratch training with neural-tangent-kernel (NTK) approximations. Neither route exploits the fact that frozen-feature linear probing admits a closed-form solution determined directly by the pre-trained features themselves, with no infinite-width approximation and no inner-loop trajectory. We propose Closed-Form Linear-Probe Dataset Distillation (CLP-DD), a bilevel formulation that computes the linear probe induced by the synthetic set with a sample-space kernel ridge solver. The synthetic images are then updated by evaluating this induced classifier on real features through a temperature-scaled softmax cross-entropy, where the classifier columns act as learned class anchors in feature space. We further show that the choice of outer objective is decisive: pairing the closed-form inner solver with a standard MSE outer loss substantially underperforms trajectory-based methods, while the discriminative outer loss closes most of the gap. On ImageNet-100 with four pre-trained backbones, CLP-DD substantially improves over LGM without DSA and approaches LGM with DSA at a fraction of the computational cost. On ImageNet-1K, CLP-DD matches or surpasses LGM with DSA on three of four backbones while running roughly $14\times$ faster and using less than one-eighth of the GPU memory.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Closed-Form Linear-Probe Dataset Distillation (CLP-DD), a bilevel method for distilling datasets targeted at frozen pre-trained vision encoders followed by linear probing. The inner level solves for the linear probe induced by synthetic images via a sample-space kernel-ridge regression on their features and (one-hot) labels; the outer level then updates the synthetic images by evaluating the resulting classifier columns as class anchors inside a temperature-scaled softmax cross-entropy loss computed on real features. The authors demonstrate that pairing the closed-form inner solver with an MSE outer objective underperforms trajectory-based baselines, while the discriminative outer objective recovers most of the performance gap. On ImageNet-100 and ImageNet-1K they report that CLP-DD substantially improves over LGM without DSA, matches or exceeds LGM with DSA on most backbones, and runs approximately 14× faster with less than one-eighth the GPU memory.
Significance. If the central empirical claims hold, the work supplies a practical, memory-efficient alternative to trajectory-matching distillation methods for the common transfer-learning setting of frozen encoders plus linear probes. The explicit separation of a parameter-free algebraic inner solver from a standard discriminative outer loss, together with the reported speed and memory gains, would be a useful contribution to the dataset-distillation literature.
major comments (1)
- [§3 (method) and experimental sections] The manuscript does not report a direct side-by-side comparison of the closed-form kernel-ridge probe (used both for the inner solver and as class anchors in the outer loss) against a conventionally trained linear probe obtained by SGD/Adam minimization of cross-entropy on the same synthetic features. Because the outer objective and all reported downstream numbers rely on the ridge surrogate, any systematic discrepancy between ridge-induced weights and gradient-trained weights would mean the distilled sets are optimized for an unrepresentative proxy of standard linear-probe utility. This comparison is load-bearing for the ImageNet-100/1K claims.
minor comments (2)
- [§3.2] The temperature scaling factor in the outer softmax is listed as a free hyper-parameter; its sensitivity and selection protocol should be stated explicitly.
- [Figure 4] Figure captions and axis labels in the runtime/memory plots would benefit from explicit units (e.g., “GPU-hours” and “GB peak memory”).
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and detailed review of our work on Closed-Form Linear-Probe Dataset Distillation. We address the major comment point by point below and outline the revisions we will make.
read point-by-point responses
-
Referee: [§3 (method) and experimental sections] The manuscript does not report a direct side-by-side comparison of the closed-form kernel-ridge probe (used both for the inner solver and as class anchors in the outer loss) against a conventionally trained linear probe obtained by SGD/Adam minimization of cross-entropy on the same synthetic features. Because the outer objective and all reported downstream numbers rely on the ridge surrogate, any systematic discrepancy between ridge-induced weights and gradient-trained weights would mean the distilled sets are optimized for an unrepresentative proxy of standard linear-probe utility. This comparison is load-bearing for the ImageNet-100/1K claims.
Authors: We agree that explicitly comparing the closed-form kernel-ridge solution against a gradient-trained linear probe (SGD/Adam minimizing cross-entropy) on the same synthetic features would strengthen the validation of our approach. The kernel-ridge solver yields the exact minimizer of the regularized MSE objective with one-hot labels, which enables the efficient, non-iterative inner loop that is central to CLP-DD's speed and memory advantages. In the current experiments, this closed-form probe is used consistently for both optimization and reported downstream accuracies to maintain tractability. Nevertheless, because practical linear probing typically employs cross-entropy with iterative optimizers, we will add the requested side-by-side comparison in the revised manuscript. For each backbone and dataset, we will train linear probes on the distilled synthetic features using SGD/Adam, report the resulting test accuracies, and directly contrast them with the ridge-based numbers. This will quantify any discrepancy and confirm that the distilled sets remain effective under standard linear-probe protocols, thereby supporting the ImageNet-100 and ImageNet-1K claims. revision: yes
Circularity Check
No significant circularity in the derivation chain
full rationale
The paper's bilevel formulation computes the inner linear probe via an algebraic sample-space kernel ridge solver on synthetic features and labels, which is a direct closed-form operation with no iterative trajectories, NTK approximations, or self-referential fitting. The outer loss is a standard temperature-scaled softmax cross-entropy evaluated on real features using the induced classifier columns as class anchors. This structure does not reduce any claimed prediction or result to a quantity defined by the paper's own inputs or fitted parameters by construction. No load-bearing self-citations, uniqueness theorems, or smuggled ansatzes appear in the derivation; comparisons to external baselines such as LGM are independent. The central claims rest on empirical performance rather than circular reduction, making the method self-contained.
Axiom & Free-Parameter Ledger
free parameters (1)
- temperature scaling in outer softmax
axioms (1)
- domain assumption Frozen pre-trained features admit an exact closed-form linear probe via sample-space kernel ridge regression with no infinite-width approximation required.
Reference graph
Works this paper leans on
-
[1]
Zico Kolter, and Vladlen Koltun
Shaojie Bai, J. Zico Kolter, and Vladlen Koltun. Deep equilibrium models. InAdvances in Neural Information Processing Systems, volume 32, pages 688–699, 2019
work page 2019
-
[2]
Flexible dataset distillation: Learn labels instead of images.arXiv preprint arXiv:2006.08572, 2020
Ondrej Bohdal, Yongxin Yang, and Timothy Hospedales. Flexible dataset distillation: Learn labels instead of images.arXiv preprint arXiv:2006.08572, 2020
-
[3]
Dataset distillation for pre- trained self-supervised vision models
George Cazenavette, Antonio Torralba, and Vincent Sitzmann. Dataset distillation for pre- trained self-supervised vision models. InProceedings of the Advances in Neural Information Processing Systems (NeurIPS), 2025
work page 2025
-
[4]
Dataset distillation by matching training trajectories
George Cazenavette, Tongzhou Wang, Antonio Torralba, Alexei A Efros, and Jun-Yan Zhu. Dataset distillation by matching training trajectories. InProceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (CVPR), pages 4750–4759, 2022
work page 2022
-
[5]
Chan-Santiago and Mubarak Shah
Jeffrey A. Chan-Santiago and Mubarak Shah. Learnability-guided diffusion for dataset distilla- tion. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2026
work page 2026
-
[6]
An empirical study of training self-supervised vision transformers
Xinlei Chen, Saining Xie, and Kaiming He. An empirical study of training self-supervised vision transformers. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 9640–9649, 2021
work page 2021
-
[7]
Scaling up dataset distillation to imagenet- 1k with constant memory
Justin Cui, Ruochen Wang, Si Si, and Cho-Jui Hsieh. Scaling up dataset distillation to imagenet- 1k with constant memory. InInternational Conference on Machine Learning (ICML), pages 6565–6590, 2023
work page 2023
-
[8]
Remember the past: Distilling datasets into addressable memories for neural networks
Zhiwei Deng and Olga Russakovsky. Remember the past: Distilling datasets into addressable memories for neural networks. InProceedings of the Advances in Neural Information Processing Systems (NeurIPS), 2022
work page 2022
-
[9]
Eva-02: A visual representation for neon genesis,
Yuxin Fang, Quan Sun, Xinggang Wang, Tiejun Huang, Xinlong Wang, and Yue Cao. EV A-02: A visual representation for neon genesis.arXiv preprint arXiv:2303.11331, 2023
-
[10]
To- wards lossless dataset distillation via difficulty-aligned trajectory matching
Ziyao Guo, Kai Wang, George Cazenavette, Hui Li, Kaipeng Zhang, and Yang You. To- wards lossless dataset distillation via difficulty-aligned trajectory matching. InInternational Conference on Learning Representations (ICLR), 2024
work page 2024
-
[11]
Chenyang Jiang, Zhengcen Li, Hang Zhao, Qiben Shan, Shaocong Wu, and Jingyong Su. Beyond pixels: Efficient dataset distillation via sparse gaussian representation.arXiv preprint arXiv:2509.26219, 2025
-
[12]
Dataset condensation via efficient synthetic-data parameteri- zation
Jang-Hyun Kim, Jinuk Kim, Seong Joon Oh, Sangdoo Yun, Hwanjun Song, Joonhyun Jeong, Jung-Woo Ha, and Hyun Oh Song. Dataset condensation via efficient synthetic-data parameteri- zation. InProceedings of the International Conference on Machine Learning (ICML), pages 11102–11118, 2022
work page 2022
-
[13]
Dataset condensation with contrastive signals
Saehyung Lee, Sanghyuk Chun, Sangwon Jung, Sangdoo Yun, and Sungroh Yoon. Dataset condensation with contrastive signals. InInternational Conference on Machine Learning, pages 12352–12364, 2022
work page 2022
-
[14]
Yongmin Lee and Hye Won Chung. SelMatch: Effectively scaling up dataset distillation via selection-based initialization and partial updates by trajectory matching. InProceedings of the International Conference on Machine Learning (ICML), 2024. 10
work page 2024
-
[15]
Soft-label anonymous gastric x-ray image distillation
Guang Li, Ren Togo, Takahiro Ogawa, and Miki Haseyama. Soft-label anonymous gastric x-ray image distillation. InProceedings of the IEEE International Conference on Image Processing (ICIP), pages 305–309, 2020
work page 2020
-
[16]
Guang Li, Ren Togo, Takahiro Ogawa, and Miki Haseyama. Compressed gastric image generation based on soft-label dataset distillation for medical data sharing.Computer Methods and Programs in Biomedicine, 227:107189, 2022
work page 2022
-
[17]
Importance-aware adaptive dataset distillation.Neural Networks, 2024
Guang Li, Ren Togo, Takahiro Ogawa, and Miki Haseyama. Importance-aware adaptive dataset distillation.Neural Networks, 2024
work page 2024
-
[18]
Guang Li, Bo Zhao, and Tongzhou Wang. Awesome dataset distillation. https://github. com/Guang000/Awesome-Dataset-Distillation, 2022
work page 2022
-
[19]
Generative dataset distillation: Balancing global structure and local details
Longzhen Li, Guang Li, Ren Togo, Keisuke Maeda, Takahiro Ogawa, and Miki Haseyama. Generative dataset distillation: Balancing global structure and local details. InIEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pages 7664–7671, 2024
work page 2024
-
[20]
Diversity-driven generative dataset distillation based on diffusion model with self-adaptive memory
Mingzhuo Li, Guang Li, Jiafeng Mao, Takahiro Ogawa, and Miki Haseyama. Diversity-driven generative dataset distillation based on diffusion model with self-adaptive memory. InIEEE International Conference on Image Processing (ICIP), 2024
work page 2024
-
[21]
Hyperbolic dataset distillation
Wenyuan Li, Guang Li, Keisuke Maeda, Takahiro Ogawa, and Miki Haseyama. Hyperbolic dataset distillation. InProceedings of the Advances in Neural Information Processing Systems (NeurIPS), 2025
work page 2025
-
[22]
Dataset distillation via the wasserstein metric
Haoyang Liu, Yijiang Li, Tiancheng Xing, Peiran Wang, Vibhu Dalal, Luwei Li, Jingrui He, and Haohan Wang. Dataset distillation via the wasserstein metric. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2025
work page 2025
-
[23]
Efficient dataset distillation using random feature approximation
Noel Loo, Ramin Hasani, Alexander Amini, and Daniela Rus. Efficient dataset distillation using random feature approximation. InProceedings of the Advances in Neural Information Processing Systems (NeurIPS), 2022
work page 2022
-
[24]
Krikamol Muandet, Bharath Sriperumbudur, Kenji Fukumizu, Arthur Gretton, and Bernhard Schölkopf
Aengus Lynch, Gbètondji J-S Dovonon, Jean Kaddour, and Ricardo Silva. Spawrious: A benchmark for fine control of spurious correlation biases.arXiv preprint arXiv:2303.05470, 2023
-
[25]
Fd2: A dedicated framework for fine-grained dataset distillation
Hongxu Ma, Guang Li, Shijie Wang, Dongzhan Zhou, Baoli Sun, Zhihui Wang, Takahiro Ogawa, and Miki Haseyama. Fd2: A dedicated framework for fine-grained dataset distillation. arXiv preprint arXiv:2603.25144, 2026
-
[26]
Learning to generate synthetic training data using gradient matching and implicit differentiation
Dmitry Medvedev and Alexander D’yakonov. Learning to generate synthetic training data using gradient matching and implicit differentiation. InInternational Conference on Analysis of Images, Social Networks and Texts, pages 138–150, 2021
work page 2021
-
[27]
Dataset meta-learning from kernel ridge-regression
Timothy Nguyen, Zhourong Chen, and Jaehoon Lee. Dataset meta-learning from kernel ridge-regression. InInternational Conference on Learning Representations (ICLR), 2021
work page 2021
-
[28]
Timothy Nguyen, Roman Novak, Lechao Xiao, and Jaehoon Lee. Dataset distillation with infinitely wide convolutional networks.Advances in Neural Information Processing Systems, 34:5186–5198, 2021
work page 2021
-
[29]
Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Mahmoud Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Hervé Jégou, Julien Mairal, Patrick La...
work page 2024
-
[30]
Learning transferable visual models from natural language supervision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational Conference on Machine Learning (ICML), pages 8748–8763. PMLR, 2021. 11
work page 2021
-
[31]
Meta-learning with implicit gradients
Aravind Rajeswaran, Chelsea Finn, Sham M Kakade, and Sergey Levine. Meta-learning with implicit gradients. InAdvances in Neural Information Processing Systems (NeurIPS), volume 32, pages 113–124, 2019
work page 2019
-
[32]
TGDD: Trajectory guided dataset distillation with balanced distribution
Fengli Ran, Xiao Pu, Bo Liu, Xiuli Bi, and Bin Xiao. TGDD: Trajectory guided dataset distillation with balanced distribution. InProceedings of the AAAI Conference on Artificial Intelligence (AAAI), 2026
work page 2026
-
[33]
Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge.International Journal of Computer Vision (IJCV), 115(3):211–252, 2015
work page 2015
-
[34]
Shiori Sagawa, Pang Wei Koh, Tatsunori B Hashimoto, and Percy Liang. Distributionally robust neural networks for group shifts: On the importance of regularization for worst-case generalization. InInternational Conference on Learning Representations (ICLR), 2020
work page 2020
-
[35]
Datadam: Efficient dataset distillation with attention matching
Ahmad Sajedi, Samir Khaki, Ehsan Amjadian, Lucy Z Liu, Yuri A Lawryshyn, and Konstanti- nos N Plataniotis. Datadam: Efficient dataset distillation with attention matching. InIEEE/CVF International Conference on Computer Vision, pages 17097–17107, 2023
work page 2023
-
[36]
Dˆ 4: Dataset distillation via disentangled diffusion model
Duo Su, Junjie Hou, Weizhi Gao, Yingjie Tian, and Bowen Tang. Dˆ 4: Dataset distillation via disentangled diffusion model. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5809–5818, 2024
work page 2024
-
[37]
Generative dataset distillation based on diffusion model
Duo Su, Junjie Hou, Guang Li, Ren Togo, Rui Song, Takahiro Ogawa, and Miki Haseyama. Generative dataset distillation based on diffusion model. InEuropean Conference on Computer Vision Workshops, 2024
work page 2024
-
[38]
Yonglong Tian, Dilip Krishnan, and Phillip Isola. Contrastive multiview coding. InEuropean Conference on Computer Vision (ECCV), pages 776–794. Springer, 2020
work page 2020
-
[39]
Post training quantization for efficient dataset condensation
Linh-Tam Tran and Sung-Ho Bae. Post training quantization for efficient dataset condensation. InProceedings of the AAAI Conference on Artificial Intelligence (AAAI), 2026
work page 2026
-
[40]
Cafe: Learning to condense dataset by aligning features
Kai Wang, Bo Zhao, Xiangyu Peng, Zheng Zhu, Shuo Yang, Shuo Wang, Guan Huang, Hakan Bilen, Xinchao Wang, and Yang You. Cafe: Learning to condense dataset by aligning features. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 12196–12205, 2022
work page 2022
-
[41]
Tongzhou Wang, Jun-Yan Zhu, Antonio Torralba, and Alexei A Efros. Dataset distillation. arXiv preprint arXiv:1811.10959, 2018
work page internal anchor Pith review arXiv 2018
-
[42]
Linfeng Ye, Shayan Mohajer Hamidi, Guang Li, Takahiro Ogawa, Miki Haseyama, and Kon- stantinos N. Plataniotis. Information-guided diffusion sampling for dataset distillation. In Advances in Neural Information Processing Systems Workshops, 2025
work page 2025
-
[43]
A comprehensive survey to dataset distillation
Ruonan Yu, Songhua Liu, and Xinchao Wang. A comprehensive survey to dataset distillation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(1):150–170, 2023
work page 2023
-
[44]
Dataset condensation with differentiable siamese augmentation
Bo Zhao and Hakan Bilen. Dataset condensation with differentiable siamese augmentation. In International Conference on Machine Learning, pages 12674–12685, 2021
work page 2021
-
[45]
Dataset condensation with distribution matching
Bo Zhao and Hakan Bilen. Dataset condensation with distribution matching. InIEEE/CVF Winter Conference on Applications of Computer Vision, pages 6514–6523, 2023
work page 2023
-
[46]
Dataset condensation with gradient matching
Bo Zhao, Konda Reddy Mopuri, and Hakan Bilen. Dataset condensation with gradient matching. InInternational Conference on Learning Representations (ICLR), 2021
work page 2021
-
[47]
Yongchao Zhou, Ehsan Nezhadarya, and Jimmy Ba. Dataset distillation using neural feature regression. InAdvances in Neural Information Processing Systems (NeurIPS), volume 35, pages 9813–9827, 2022. 12 A Detailed Mathematical Derivations This appendix provides the derivations used in Closed-Form Linear-Probe Dataset Distillation (CLP- DD). We first derive ...
work page 2022
-
[48]
Institutional review board (IRB) approvals or equivalent for research with human subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or ...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.