pith. sign in

arxiv: 2605.23482 · v1 · pith:IG5MMYZPnew · submitted 2026-05-22 · 💻 cs.CV · cs.AI

Multimodal Distribution Matching for Vision-Language Dataset Distillation

Pith reviewed 2026-05-25 04:27 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords dataset distillationvision-languagemultimodal distillationimage-text retrievaldistribution matchingsynthetic datasets
0
0 comments X

The pith

MDM produces compact synthetic image-text datasets that preserve multimodal semantics and retrieval performance with reduced computation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes Multimodal Distribution Matching as a way to distill large vision-language datasets into smaller synthetic versions. It combines sampling from joint embedding clusters for data initialization, interpolating models by angular deviation for a mixed teacher, and a geometry-aware loss on the hypersphere for distribution matching. This setup aims to keep cross-modal alignments intact while using less compute than previous methods. If successful, it would allow efficient creation of training data for multimodal systems that works even when tested on different model architectures.

Core claim

The central discovery is that integrating cluster sampling in the joint embedding space, angular interpolation of fine-tuned models to form a mixed teacher, and matching joint distributions on the unit hypersphere with a geometry-aware objective that uses cross-modal agreement, discrepancy, and symmetric contrastive learning produces synthetic image-text pairs that maintain performance on retrieval tasks.

What carries the argument

The geometry-aware matching objective on the unit hypersphere, which matches distributions by exploiting features in agreement and discrepancy directions along with symmetric contrastive learning.

If this is right

  • MDM yields compact synthetic sets that preserve multimodal semantics on image-text retrieval benchmarks.
  • Distillation cost is substantially reduced compared to prior methods.
  • Performance remains robust across different architectures in cross-architecture evaluations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The approach could be tested on other multimodal tasks such as captioning or visual reasoning to see if alignment preservation transfers.
  • Lower compute requirements might make dataset distillation feasible for smaller research groups without access to large clusters.
  • Future work could explore whether these synthetic sets improve generalization when used in combination with real data.

Load-bearing premise

The three components of cluster sampling, angular model interpolation, and hyperspherical geometry-aware matching will together preserve cross-modal alignment without the heavy compute of earlier approaches.

What would settle it

A clear falsifier would be if the synthetic datasets generated by MDM show significantly degraded image-text retrieval accuracy compared to real data when evaluated using a model architecture different from those used in distillation.

Figures

Figures reproduced from arXiv: 2605.23482 by Hoyong Kwon, Jongoh Jeong, Kuk-Jin Yoon, Minseok Kim.

Figure 1
Figure 1. Figure 1: Comparison between prior multimodal dataset distilla [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of MDM. Our MDM method consists of (i) synthetic data initialization using k-means clustering, (ii) image-text model initialization using weight-space interpolation between a pretrained and N finetuned models, and (iii) multimodal distribution matching that minimizes geodesic kernel energy between real and synthetic pairs on the unit hypersphere. the evaluation emphasizes cross-modal alignment. Th… view at source ↗
Figure 3
Figure 3. Figure 3: Qualitative results of synthesized data. We compare the initial (left) and distilled samples (right). 4.2. Main Results Image-Text Retrieval. We report in [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Performance curve across datasets and data pairs. Ours [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
read the original abstract

Dataset distillation compresses large training sets into compact synthetic datasets while preserving downstream performance. As modern systems increasingly operate on paired vision-language inputs, multimodal distillation must preserve representation quality and cross-modal alignment under tight compute and memory budgets, yet prior methods often require heavy computes and overlook their correlations. To address this, we present Multimodal Distribution Matching (MDM), a geometry-aware framework for efficient and generalizable multimodal distillation. Specifically, MDM integrates complementary components at the data, model, and loss levels. At the data level, it initializes synthetic image-text pairs by sampling from clusters in the joint embedding space. At the model level, it forms a mixed teacher by interpolating independently fine-tuned models in weight space according to their angular deviation from the pretrained anchor. At the loss level, it matches joint distributions on the unit hypersphere using a geometry-aware matching objective that exploits the joint features in the cross-modal agreement and discrepancy directions along with symmetric contrastive learning. Across image-text retrieval benchmarks with cross-architecture evaluation, MDM yields compact synthetic sets that preserve multimodal semantics, substantially reduce distillation cost, and remain robust across architectures.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes Multimodal Distribution Matching (MDM), a geometry-aware framework for vision-language dataset distillation. It initializes synthetic image-text pairs by sampling from clusters in the joint embedding space, forms a mixed teacher via weight-space interpolation of independently fine-tuned models according to angular deviation from the pretrained anchor, and applies a geometry-aware matching objective on the unit hypersphere that exploits cross-modal agreement/discrepancy directions together with symmetric contrastive learning. On image-text retrieval benchmarks with cross-architecture evaluation, the method is claimed to yield compact synthetic sets that preserve multimodal semantics, substantially reduce distillation cost, and remain robust across architectures.

Significance. If the claims hold, MDM could meaningfully lower the compute barrier for VL dataset distillation while maintaining cross-modal fidelity, which would be useful for resource-constrained settings. The three-level integration (data, model, loss) is a coherent design choice, but the significance is tempered by the absence of any reported quantitative cost reductions, ablation results on the interpolation step, or direct comparisons showing superiority over prior multimodal distillation baselines.

major comments (2)
  1. [Model-level component (abstract and §3)] Model-level component (abstract and §3 description): the claim that interpolating independently fine-tuned models by angular deviation from the pretrained anchor produces a mixed teacher whose joint vision-language representations remain aligned is load-bearing for the entire pipeline, yet no derivation, correlation analysis, or ablation is provided showing that angular deviation in parameter space correlates with cross-modal agreement on retrieval metrics. Without this, the downstream geometry-aware matching on the hypersphere cannot be guaranteed to distill faithful pairs.
  2. [Evaluation section] Evaluation section: the abstract asserts 'substantially reduce distillation cost' and 'remain robust across architectures,' but the provided text contains no tables, figures, or numerical results quantifying cost (e.g., GPU-hours or memory) or cross-architecture retrieval metrics (e.g., R@1 deltas), making it impossible to verify whether the central efficiency and robustness claims are supported.
minor comments (2)
  1. [Loss level (abstract and §4)] Notation for the geometry-aware objective (loss level) is introduced without an explicit equation or pseudocode, which hinders reproducibility.
  2. [Introduction / abstract] The abstract refers to 'prior methods often require heavy computes' without citing specific multimodal distillation baselines or their reported costs.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and will revise the paper to strengthen the presentation of the model-level component and the supporting evaluation results.

read point-by-point responses
  1. Referee: Model-level component (abstract and §3 description): the claim that interpolating independently fine-tuned models by angular deviation from the pretrained anchor produces a mixed teacher whose joint vision-language representations remain aligned is load-bearing for the entire pipeline, yet no derivation, correlation analysis, or ablation is provided showing that angular deviation in parameter space correlates with cross-modal agreement on retrieval metrics. Without this, the downstream geometry-aware matching on the hypersphere cannot be guaranteed to distill faithful pairs.

    Authors: We agree that the manuscript currently lacks an explicit derivation or empirical analysis linking angular deviation in parameter space to cross-modal agreement on retrieval metrics. This is a valid observation. In the revised version we will add a dedicated subsection containing (i) a short geometric argument relating angular deviation to representation drift and (ii) a correlation study plus ablation that quantifies how the interpolation step affects downstream retrieval performance. These additions will directly support the load-bearing claim. revision: yes

  2. Referee: Evaluation section: the abstract asserts 'substantially reduce distillation cost' and 'remain robust across architectures,' but the provided text contains no tables, figures, or numerical results quantifying cost (e.g., GPU-hours or memory) or cross-architecture retrieval metrics (e.g., R@1 deltas), making it impossible to verify whether the central efficiency and robustness claims are supported.

    Authors: The referee is correct that the submitted manuscript text does not contain the requested quantitative tables or figures for distillation cost (GPU-hours, memory) or cross-architecture R@1 deltas. We will insert new tables and figures reporting these metrics, including direct comparisons against prior multimodal distillation baselines, to substantiate the efficiency and robustness claims made in the abstract. revision: yes

Circularity Check

0 steps flagged

No circularity; high-level method description contains no equations or self-referential reductions.

full rationale

The provided abstract and method summary describe MDM via three complementary components (data-level cluster sampling, model-level angular interpolation of fine-tuned models, loss-level geometry-aware matching on the hypersphere) but supply no equations, no fitted parameters renamed as predictions, and no self-citations that bear the central claim. The interpolation step is presented as an independent modeling choice rather than a definitional tautology, and the overall framework is not shown to reduce to its inputs by construction. Absent any load-bearing derivation chain that collapses, the paper is self-contained at the level of description given.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only; no free parameters, axioms, or invented entities can be identified from the provided text.

pith-pipeline@v0.9.0 · 5728 in / 1054 out tokens · 20004 ms · 2026-05-25T04:27:59.090817+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

87 extracted references · 87 canonical work pages · 8 internal anchors

  1. [1]

    GPT-4 Technical Report

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023. 1

  2. [2]

    Contextual diversity for active learning

    Sharat Agarwal, Himanshu Arora, Saket Anand, and Chetan Arora. Contextual diversity for active learning. InEuropean Conference on Computer Vision, pages 137–153. Springer,

  3. [3]

    Flamingo: a visual language model for few-shot learning.Advances in neural information processing systems, 35:23716–23736,

    Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning.Advances in neural information processing systems, 35:23716–23736,

  4. [4]

    Qwen Technical Report

    Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xi- aodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report.arXiv preprint arXiv:2309.16609,

  5. [5]

    Dataset distillation as data compression: A rate-utility perspective

    Youneng Bao, Yiping Liu, Zhuo Chen, Yongsheng Liang, Mu Li, and Kede Ma. Dataset distillation as data compression: A rate-utility perspective. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 519– 529, 2025. 1

  6. [6]

    High-performance large-scale image recognition without normalization

    Andy Brock, Soham De, Samuel L Smith, and Karen Si- monyan. High-performance large-scale image recognition without normalization. InInternational conference on ma- chine learning, pages 1059–1071. PMLR, 2021. 5

  7. [7]

    Coyo-700m: Image-text pair dataset

    Minwoo Byeon, Beomhee Park, Haecheon Kim, Sungjun Lee, Woonhyuk Baek, and Saehoon Kim. Coyo-700m: Image-text pair dataset. https://github.com/kakaobrain/ coyo-dataset, 2022. 1

  8. [8]

    Dataset distillation by matching training trajectories

    George Cazenavette, Tongzhou Wang, Antonio Torralba, Alexei A Efros, and Jun-Yan Zhu. Dataset distillation by matching training trajectories. InCVPR, 2022. 2, 4

  9. [9]

    Generalizing dataset distillation via deep generative prior

    George Cazenavette, Tongzhou Wang, Antonio Torralba, Alexei A Efros, and Jun-Yan Zhu. Generalizing dataset distillation via deep generative prior. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3739–3748, 2023. 2

  10. [10]

    Selection via proxy: Efficient data se- lection for deep learning.arXiv preprint arXiv:1906.11829,

    Cody Coleman, Christopher Yeh, Stephen Mussmann, Baha- ran Mirzasoleiman, Peter Bailis, Percy Liang, Jure Leskovec, and Matei Zaharia. Selection via proxy: Efficient data se- lection for deep learning.arXiv preprint arXiv:1906.11829,

  11. [11]

    Dc- bench: Dataset condensation benchmark.Advances in Neural Information Processing Systems, 35:810–822, 2022

    Justin Cui, Ruochen Wang, Si Si, and Cho-Jui Hsieh. Dc- bench: Dataset condensation benchmark.Advances in Neural Information Processing Systems, 35:810–822, 2022. 8

  12. [12]

    Scaling up dataset distillation to imagenet-1k with constant memory

    Justin Cui, Ruochen Wang, Si Si, and Cho-Jui Hsieh. Scaling up dataset distillation to imagenet-1k with constant memory. InInternational Conference on Machine Learning, pages 6565–6590. PMLR, 2023. 2, 5, 6

  13. [13]

    Optical: Leveraging optimal transport for con- tribution allocation in dataset distillation

    Xiao Cui, Yulei Qin, Wengang Zhou, Hongsheng Li, and Houqiang Li. Optical: Leveraging optimal transport for con- tribution allocation in dataset distillation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 15245–15254, 2025. 1

  14. [14]

    Ex- ploiting inter-sample and inter-feature relations in dataset distillation

    Wenxiao Deng, Wenbin Li, Tianyu Ding, Lei Wang, Hong- guang Zhang, Kuihua Huang, Jing Huo, and Yang Gao. Ex- ploiting inter-sample and inter-feature relations in dataset distillation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17057– 17066, 2024. 1

  15. [15]

    Remember the past: Dis- tilling datasets into addressable memories for neural networks

    Zhiwei Deng and Olga Russakovsky. Remember the past: Dis- tilling datasets into addressable memories for neural networks. InNeurIPS, 2022. 2

  16. [16]

    BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional trans- formers for language understanding.arXiv preprint arXiv:1810.04805, 2018. 5, 7

  17. [17]

    Minimizing the accumulated trajectory error to improve dataset distillation

    Jiawei Du, Yidi Jiang, Vincent YF Tan, Joey Tianyi Zhou, and Haizhou Li. Minimizing the accumulated trajectory error to improve dataset distillation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3749–3758, 2023. 2

  18. [18]

    Adversarial Active Learning for Deep Networks: a Margin Based Approach

    Melanie Ducoffe and Frederic Precioso. Adversarial active learning for deep networks: a margin based approach.arXiv preprint arXiv:1802.09841, 2018. 5

  19. [19]

    Springer Science & Business Media, 2009

    Reza Zanjirani Farahani and Masoud Hekmatfar.Facility loca- tion: concepts, models, algorithms and case studies. Springer Science & Business Media, 2009. 2, 5, 6, 7

  20. [20]

    Deepcore: A comprehensive library for coreset selection in deep learning

    Chengcheng Guo, Bo Zhao, and Yanbing Bai. Deepcore: A comprehensive library for coreset selection in deep learning. InInternational Conference on Database and Expert Systems Applications, pages 181–195. Springer, 2022. 5

  21. [21]

    Algorithm as 136: A k-means clustering algorithm.Journal of the royal statistical society

    John A Hartigan and Manchek A Wong. Algorithm as 136: A k-means clustering algorithm.Journal of the royal statistical society. series c (applied statistics), 28(1):100–108, 1979. 4

  22. [22]

    You only condense once: Two rules for pruning condensed datasets

    Yang He, Lingao Xiao, and Joey Tianyi Zhou. You only condense once: Two rules for pruning condensed datasets. arXiv preprint arXiv:2310.14019, 2023. 2

  23. [23]

    Fram- ing image description as a ranking task: Data, models and evaluation metrics.Journal of Artificial Intelligence Research, 47:853–899, 2013

    Micah Hodosh, Peter Young, and Julia Hockenmaier. Fram- ing image description as a ranking task: Data, models and evaluation metrics.Journal of Artificial Intelligence Research, 47:853–899, 2013. 1, 5, 6, 7, 3, 4

  24. [24]

    Submodular combinatorial information measures with applications in machine learning

    Rishabh Iyer, Ninad Khargoankar, Jeff Bilmes, and Himanshu Asanani. Submodular combinatorial information measures with applications in machine learning. InAlgorithmic Learn- ing Theory, pages 722–754. PMLR, 2021. 5

  25. [25]

    Model stock: All we need is just a few fine-tuned models

    Dong-Hwan Jang, Sangdoo Yun, and Dongyoon Han. Model stock: All we need is just a few fine-tuned models. In European Conference on Computer Vision, pages 207–223. Springer, 2024. 4, 2, 6

  26. [26]

    Grad-match: Gra- dient matching based data subset selection for efficient deep model training

    Krishnateja Killamsetty, Sivasubramanian Durga, Ganesh Ra- makrishnan, Abir De, and Rishabh Iyer. Grad-match: Gra- dient matching based data subset selection for efficient deep model training. InInternational Conference on Machine Learning, pages 5464–5474. PMLR, 2021. 5

  27. [27]

    Glister: Generalization based data subset selection for efficient and robust learning

    Krishnateja Killamsetty, Durga Sivasubramanian, Ganesh Ra- makrishnan, and Rishabh Iyer. Glister: Generalization based data subset selection for efficient and robust learning. In Proceedings of the AAAI conference on artificial intelligence, pages 8110–8118, 2021. 5

  28. [28]

    On divergence measures for bayesian pseudocoresets.arXiv preprint arXiv:2210.06205, 2022

    Balhae Kim, Jungwon Choi, Seanie Lee, Yoonho Lee, Jung- Woo Ha, and Juho Lee. On divergence measures for bayesian pseudocoresets.arXiv preprint arXiv:2210.06205, 2022. 2

  29. [29]

    Dataset condensation via efficient synthetic-data pa- rameterization

    Jang-Hyun Kim, Jinuk Kim, Seong Joon Oh, Sangdoo Yun, Hwanjun Song, Joonhyun Jeong, Jung-Woo Ha, and Hyun Oh Song. Dataset condensation via efficient synthetic-data pa- rameterization. InICML, 2022. 2

  30. [30]

    Computing geodesic paths on manifolds.Proceedings of the national academy of Sciences, 95(15):8431–8435, 1998

    Ron Kimmel and James A Sethian. Computing geodesic paths on manifolds.Proceedings of the national academy of Sciences, 95(15):8431–8435, 1998. 4

  31. [31]

    Dataset condensation with latent space knowledge factorization and sharing.arXiv preprint arXiv:2208.10494, 2022

    Hae Beom Lee, Dong Bok Lee, and Sung Ju Hwang. Dataset condensation with latent space knowledge factorization and sharing.arXiv preprint arXiv:2208.10494, 2022. 2

  32. [32]

    A comprehensive survey of dataset distillation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(1):17–32, 2023

    Shiye Lei and Dacheng Tao. A comprehensive survey of dataset distillation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(1):17–32, 2023. 1

  33. [33]

    Diversity-enhanced distribution alignment for dataset distillation

    Hongcheng Li, Yucan Zhou, Xiaoyan Gu, Bo Li, and Weiping Wang. Diversity-enhanced distribution alignment for dataset distillation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 3747–3756, 2025. 1

  34. [34]

    Blip: Bootstrapping language-image pre-training for unified vision- language understanding and generation

    Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for unified vision- language understanding and generation. InInternational Con- ference on Machine Learning, pages 12888–12900. PMLR,

  35. [35]

    Microsoft coco: Common objects in context

    Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages 740–755. Springer, 2014. 1, 3, 5, 6, 7

  36. [36]

    Dataset distillation by automatic training trajectories

    Dai Liu, Jindong Gu, Hu Cao, Carsten Trinitis, and Martin Schulz. Dataset distillation by automatic training trajectories. InEuropean Conference on Computer Vision, pages 334–351. Springer, 2024. 2

  37. [37]

    Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023. 1

  38. [38]

    Dataset distillation via the wasserstein metric

    Haoyang Liu, Yijiang Li, Tiancheng Xing, Peiran Wang, Vibhu Dalal, Luwei Li, Jingrui He, and Haohan Wang. Dataset distillation via the wasserstein metric. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 1205–1215, 2025. 1

  39. [39]

    The evolution of dataset distillation: Toward scalable and generalizable solutions.arXiv preprint arXiv:2502.05673, 2025

    Ping Liu and Jiawei Du. The evolution of dataset distillation: Toward scalable and generalizable solutions.arXiv preprint arXiv:2502.05673, 2025. 1

  40. [40]

    Dataset distillation via factorization

    Songhua Liu, Kai Wang, Xingyi Yang, Jingwen Ye, and Xin- chao Wang. Dataset distillation via factorization. InNeurIPS,

  41. [41]

    Slimmable dataset condensation

    Songhua Liu, Jingwen Ye, Runpeng Yu, and Xinchao Wang. Slimmable dataset condensation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3759–3768, 2023. 2

  42. [42]

    Dream: Efficient dataset distillation by repre- sentative matching.arXiv preprint arXiv:2302.14416, 2023

    Yanqing Liu, Jianyang Gu, Kai Wang, Zheng Zhu, Wei Jiang, and Yang You. Dream: Efficient dataset distillation by repre- sentative matching.arXiv preprint arXiv:2302.14416, 2023. 2

  43. [43]

    Efficient dataset distillation using random feature approxima- tion

    Noel Loo, Ramin Hasani, Alexander Amini, and Daniela Rus. Efficient dataset distillation using random feature approxima- tion. InNeurIPS, 2022. 2

  44. [44]

    Dataset distillation with convexified implicit gradients.arXiv preprint arXiv:2302.06755, 2023

    Noel Loo, Ramin Hasani, Mathias Lechner, and Daniela Rus. Dataset distillation with convexified implicit gradients.arXiv preprint arXiv:2302.06755, 2023. 2

  45. [45]

    Bayesian pseudocoresets

    Dionysis Manousakas, Zuheng Xu, Cecilia Mascolo, and Trevor Campbell. Bayesian pseudocoresets. InNeurIPS,

  46. [46]

    Active learning by acquiring contrastive examples.arXiv preprint arXiv:2109.03764, 2021

    Katerina Margatina, Giorgos Vernikos, Loïc Barrault, and Nikolaos Aletras. Active learning by acquiring contrastive examples.arXiv preprint arXiv:2109.03764, 2021. 5

  47. [47]

    Geomm: On geodesic perspective for multi-modal learning

    Shibin Mei, Hang Wang, and Bingbing Ni. Geomm: On geodesic perspective for multi-modal learning. InProceed- ings of the Computer Vision and Pattern Recognition Confer- ence, pages 4776–4786, 2025. 4, 2

  48. [48]

    Coresets for data-efficient training of machine learning mod- els

    Baharan Mirzasoleiman, Jeff Bilmes, and Jure Leskovec. Coresets for data-efficient training of machine learning mod- els. InInternational Conference on Machine Learning, pages 6950–6960. PMLR, 2020. 5

  49. [49]

    Dataset meta-learning from kernel ridge-regression.arXiv preprint arXiv:2011.00050, 2020

    Timothy Nguyen, Zhourong Chen, and Jaehoon Lee. Dataset meta-learning from kernel ridge-regression.arXiv preprint arXiv:2011.00050, 2020. 2

  50. [50]

    Dataset distillation with infinitely wide convolutional networks

    Timothy Nguyen, Roman Novak, Lechao Xiao, and Jaehoon Lee. Dataset distillation with infinitely wide convolutional networks. InNeurIPS, 2021. 2

  51. [51]

    Deep learning on a data diet: Finding important examples early in training.Advances in neural information processing systems, 34:20596–20607, 2021

    Mansheej Paul, Surya Ganguli, and Gintare Karolina Dz- iugaite. Deep learning on a data diet: Finding important examples early in training.Advances in neural information processing systems, 34:20596–20607, 2021. 5

  52. [52]

    Learning transferable visual models from natural language supervi- sion

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 3

  53. [53]

    Datadam: Efficient dataset distillation with attention matching

    Ahmad Sajedi, Samir Khaki, Ehsan Amjadian, Lucy Z Liu, Yuri A Lawryshyn, and Konstantinos N Plataniotis. Datadam: Efficient dataset distillation with attention matching. InPro- ceedings of the IEEE/CVF International Conference on Com- puter Vision, pages 17097–17107, 2023. 8

  54. [54]

    DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter

    Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter.arXiv preprint arXiv:1910.01108, 2019. 7

  55. [55]

    Laion-5b: An open large-scale dataset for training next gen- eration image-text models.Advances in Neural Information Processing Systems, 35:25278–25294, 2022

    Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, et al. Laion-5b: An open large-scale dataset for training next gen- eration image-text models.Advances in Neural Information Processing Systems, 35:25278–25294, 2022. 1

  56. [56]

    Active Learning for Convolutional Neural Networks: A Core-Set Approach

    Ozan Sener and Silvio Savarese. Active learning for convolu- tional neural networks: A core-set approach.arXiv preprint arXiv:1708.00489, 2017. 5

  57. [57]

    Fre- quency domain-based dataset distillation.Advances in Neural Information Processing Systems, 36:70033–70044, 2023

    Donghyeok Shin, Seungjae Shin, and Il-Chul Moon. Fre- quency domain-based dataset distillation.Advances in Neural Information Processing Systems, 36:70033–70044, 2023. 1, 2

  58. [58]

    Fyi: Flip your images for dataset distillation

    Byunggwan Son, Youngmin Oh, Donghyeon Baek, and Bum- sub Ham. Fyi: Flip your images for dataset distillation. In European Conference on Computer Vision, pages 214–230. Springer, 2024

  59. [59]

    D^4m: Dataset distillation via disentangled diffusion model

    Duo Su, Junjie Hou, Weizhi Gao, Yingjie Tian, and Bowen Tang. D^4m: Dataset distillation via disentangled diffusion model. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, pages 5809–5818, 2024. 1

  60. [60]

    Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

    Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, et al. Gemini 1.5: Unlocking mul- timodal understanding across millions of tokens of context. arXiv preprint arXiv:2403.05530, 2024. 1

  61. [61]

    Yfcc100m: The new data in multimedia research

    Bart Thomee, David A Shamma, Gerald Friedland, Benjamin Elizalde, Karl Ni, Douglas Poland, Damian Borth, and Li- Jia Li. Yfcc100m: The new data in multimedia research. Communications of the ACM, 59(2):64–73, 2016. 1

  62. [62]

    Con- structing bayesian pseudo-coresets using contrastive diver- gence.arXiv preprint arXiv:2303.11278, 2023

    Piyush Tiwary, Kumar Shubham, Vivek Kashyap, et al. Con- structing bayesian pseudo-coresets using contrastive diver- gence.arXiv preprint arXiv:2303.11278, 2023. 2

  63. [63]

    An empirical study of example forgetting during deep neural network learning.arXiv preprint arXiv:1812.05159,

    Mariya Toneva, Alessandro Sordoni, Remi Tachet des Combes, Adam Trischler, Yoshua Bengio, and Geoffrey J Gor- don. An empirical study of example forgetting during deep neural network learning.arXiv preprint arXiv:1812.05159,

  64. [64]

    Cao2: Rectifying inconsistencies in diffusion-based dataset distillation, 2025

    Haoxuan Wang, Zhenghao Zhao, Junyi Wu, Yuzhang Shang, Gaowen Liu, and Yan Yan. Cao2: Rectifying inconsistencies in diffusion-based dataset distillation, 2025. 1

  65. [65]

    Cafe: Learning to condense dataset by aligning features

    Kai Wang, Bo Zhao, Xiangyu Peng, Zheng Zhu, Shuo Yang, Shuo Wang, Guan Huang, Hakan Bilen, Xinchao Wang, and Yang You. Cafe: Learning to condense dataset by aligning features. InCVPR, 2022. 2

  66. [66]

    Dataset Distillation

    Tongzhou Wang, Jun-Yan Zhu, Antonio Torralba, and Alexei A Efros. Dataset distillation.arXiv preprint arXiv:1811.10959, 2018. 1, 8

  67. [67]

    Herding dynamical weights to learn

    Max Welling. Herding dynamical weights to learn. InPro- ceedings of the 26th Annual International Conference on Machine Learning, pages 1121–1128, 2009. 2, 5, 6, 7

  68. [68]

    Vision-language dataset distillation, 2024

    Xindi Wu, Byron Zhang, Zhiwei Deng, and Olga Rus- sakovsky. Vision-language dataset distillation, 2024. TMLR

  69. [69]

    Low-rank similarity mining for multimodal dataset distilla- tion

    Yue Xu, Zhilin Lin, Yusong Qiu, Cewu Lu, and Yong-Lu Li. Low-rank similarity mining for multimodal dataset distilla- tion. InProceedings of the 41st International Conference on Machine Learning, pages 55144–55161. PMLR, 2024. 2, 3, 4, 5, 6, 7

  70. [70]

    Dataset distillation via cur- riculum data synthesis in large data era.Transactions on Machine Learning Research, 2024

    Zeyuan Yin and Zhiqiang Shen. Dataset distillation via cur- riculum data synthesis in large data era.Transactions on Machine Learning Research, 2024. 8

  71. [71]

    Squeeze, recover and relabel: Dataset condensation at imagenet scale from a new perspective.arXiv preprint arXiv:2306.13092, 2023

    Zeyuan Yin, Eric Xing, and Zhiqiang Shen. Squeeze, recover and relabel: Dataset condensation at imagenet scale from a new perspective.arXiv preprint arXiv:2306.13092, 2023. 2

  72. [72]

    Peter Young, Alice Lai, Micah Hodosh, and Julia Hocken- maier. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descrip- tions.Transactions of the association for computational lin- guistics, 2:67–78, 2014. 1, 3, 5, 6, 7

  73. [73]

    Dataset distilla- tion: A comprehensive review.IEEE transactions on pattern analysis and machine intelligence, 46(1):150–170, 2023

    Ruonan Yu, Songhua Liu, and Xinchao Wang. Dataset distilla- tion: A comprehensive review.IEEE transactions on pattern analysis and machine intelligence, 46(1):150–170, 2023. 1

  74. [74]

    Sigmoid loss for language image pre-training

    Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. InProceedings of the IEEE/CVF international conference on computer vision, pages 11975–11986, 2023. 1

  75. [75]

    Dance: Dual-view distri- bution alignment for dataset condensation.arXiv preprint arXiv:2406.01063, 2024

    Hansong Zhang, Shikun Li, Fanzhao Lin, Weiping Wang, Zhenxing Qian, and Shiming Ge. Dance: Dual-view distri- bution alignment for dataset condensation.arXiv preprint arXiv:2406.01063, 2024. 2, 8

  76. [76]

    M3d: Dataset condensation by minimizing maximum mean discrepancy

    Hansong Zhang, Shikun Li, Pengju Wang, Dan Zeng, and Shiming Ge. M3d: Dataset condensation by minimizing maximum mean discrepancy. InProceedings of the AAAI Conference on Artificial Intelligence, pages 9314–9322, 2024. 1

  77. [77]

    Accelerating dataset distillation via model augmentation

    Lei Zhang, Jie Zhang, Bowen Lei, Subhabrata Mukherjee, Xiang Pan, Bo Zhao, Caiwen Ding, Yao Li, and Dongkuan Xu. Accelerating dataset distillation via model augmentation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11950–11959, 2023. 2

  78. [78]

    Dataset condensation with differ- entiable siamese augmentation

    Bo Zhao and Hakan Bilen. Dataset condensation with differ- entiable siamese augmentation. InICML, 2021. 2

  79. [79]

    Synthesizing informative training samples with gan.arXiv preprint arXiv:2204.07513, 2022

    Bo Zhao and Hakan Bilen. Synthesizing informative training samples with gan.arXiv preprint arXiv:2204.07513, 2022. 2

  80. [80]

    Dataset condensation with distri- bution matching

    Bo Zhao and Hakan Bilen. Dataset condensation with distri- bution matching. InWACV, 2023. 2, 3

Showing first 80 references.