Multimodal Distribution Matching for Vision-Language Dataset Distillation

Hoyong Kwon; Jongoh Jeong; Kuk-Jin Yoon; Minseok Kim

arxiv: 2605.23482 · v1 · pith:IG5MMYZPnew · submitted 2026-05-22 · 💻 cs.CV · cs.AI

Multimodal Distribution Matching for Vision-Language Dataset Distillation

Jongoh Jeong , Hoyong Kwon , Minseok Kim , Kuk-Jin Yoon This is my paper

Pith reviewed 2026-05-25 04:27 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords dataset distillationvision-languagemultimodal distillationimage-text retrievaldistribution matchingsynthetic datasets

0 comments

The pith

MDM produces compact synthetic image-text datasets that preserve multimodal semantics and retrieval performance with reduced computation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes Multimodal Distribution Matching as a way to distill large vision-language datasets into smaller synthetic versions. It combines sampling from joint embedding clusters for data initialization, interpolating models by angular deviation for a mixed teacher, and a geometry-aware loss on the hypersphere for distribution matching. This setup aims to keep cross-modal alignments intact while using less compute than previous methods. If successful, it would allow efficient creation of training data for multimodal systems that works even when tested on different model architectures.

Core claim

The central discovery is that integrating cluster sampling in the joint embedding space, angular interpolation of fine-tuned models to form a mixed teacher, and matching joint distributions on the unit hypersphere with a geometry-aware objective that uses cross-modal agreement, discrepancy, and symmetric contrastive learning produces synthetic image-text pairs that maintain performance on retrieval tasks.

What carries the argument

The geometry-aware matching objective on the unit hypersphere, which matches distributions by exploiting features in agreement and discrepancy directions along with symmetric contrastive learning.

If this is right

MDM yields compact synthetic sets that preserve multimodal semantics on image-text retrieval benchmarks.
Distillation cost is substantially reduced compared to prior methods.
Performance remains robust across different architectures in cross-architecture evaluations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach could be tested on other multimodal tasks such as captioning or visual reasoning to see if alignment preservation transfers.
Lower compute requirements might make dataset distillation feasible for smaller research groups without access to large clusters.
Future work could explore whether these synthetic sets improve generalization when used in combination with real data.

Load-bearing premise

The three components of cluster sampling, angular model interpolation, and hyperspherical geometry-aware matching will together preserve cross-modal alignment without the heavy compute of earlier approaches.

What would settle it

A clear falsifier would be if the synthetic datasets generated by MDM show significantly degraded image-text retrieval accuracy compared to real data when evaluated using a model architecture different from those used in distillation.

Figures

Figures reproduced from arXiv: 2605.23482 by Hoyong Kwon, Jongoh Jeong, Kuk-Jin Yoon, Minseok Kim.

**Figure 2.** Figure 2: Overview of MDM. Our MDM method consists of (i) synthetic data initialization using k-means clustering, (ii) image-text model initialization using weight-space interpolation between a pretrained and N finetuned models, and (iii) multimodal distribution matching that minimizes geodesic kernel energy between real and synthetic pairs on the unit hypersphere. the evaluation emphasizes cross-modal alignment. Th… view at source ↗

**Figure 3.** Figure 3: Qualitative results of synthesized data. We compare the initial (left) and distilled samples (right). 4.2. Main Results Image-Text Retrieval. We report in [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Performance curve across datasets and data pairs. Ours [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

read the original abstract

Dataset distillation compresses large training sets into compact synthetic datasets while preserving downstream performance. As modern systems increasingly operate on paired vision-language inputs, multimodal distillation must preserve representation quality and cross-modal alignment under tight compute and memory budgets, yet prior methods often require heavy computes and overlook their correlations. To address this, we present Multimodal Distribution Matching (MDM), a geometry-aware framework for efficient and generalizable multimodal distillation. Specifically, MDM integrates complementary components at the data, model, and loss levels. At the data level, it initializes synthetic image-text pairs by sampling from clusters in the joint embedding space. At the model level, it forms a mixed teacher by interpolating independently fine-tuned models in weight space according to their angular deviation from the pretrained anchor. At the loss level, it matches joint distributions on the unit hypersphere using a geometry-aware matching objective that exploits the joint features in the cross-modal agreement and discrepancy directions along with symmetric contrastive learning. Across image-text retrieval benchmarks with cross-architecture evaluation, MDM yields compact synthetic sets that preserve multimodal semantics, substantially reduce distillation cost, and remain robust across architectures.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MDM's three-level framework for multimodal distillation is new on paper but the angular-deviation weight interpolation lacks any shown link to preserved cross-modal alignment.

read the letter

The main point for you is that this paper puts forward MDM as an integrated method that clusters in joint embedding space, mixes fine-tuned models by angular deviation from a pretrained anchor, and matches distributions on the hypersphere with agreement and discrepancy terms plus symmetric contrastive loss. That combination at data-model-loss levels does not appear in prior distillation work from the abstract description, so the framework itself counts as the novelty claim.

Referee Report

2 major / 2 minor

Summary. The paper proposes Multimodal Distribution Matching (MDM), a geometry-aware framework for vision-language dataset distillation. It initializes synthetic image-text pairs by sampling from clusters in the joint embedding space, forms a mixed teacher via weight-space interpolation of independently fine-tuned models according to angular deviation from the pretrained anchor, and applies a geometry-aware matching objective on the unit hypersphere that exploits cross-modal agreement/discrepancy directions together with symmetric contrastive learning. On image-text retrieval benchmarks with cross-architecture evaluation, the method is claimed to yield compact synthetic sets that preserve multimodal semantics, substantially reduce distillation cost, and remain robust across architectures.

Significance. If the claims hold, MDM could meaningfully lower the compute barrier for VL dataset distillation while maintaining cross-modal fidelity, which would be useful for resource-constrained settings. The three-level integration (data, model, loss) is a coherent design choice, but the significance is tempered by the absence of any reported quantitative cost reductions, ablation results on the interpolation step, or direct comparisons showing superiority over prior multimodal distillation baselines.

major comments (2)

[Model-level component (abstract and §3)] Model-level component (abstract and §3 description): the claim that interpolating independently fine-tuned models by angular deviation from the pretrained anchor produces a mixed teacher whose joint vision-language representations remain aligned is load-bearing for the entire pipeline, yet no derivation, correlation analysis, or ablation is provided showing that angular deviation in parameter space correlates with cross-modal agreement on retrieval metrics. Without this, the downstream geometry-aware matching on the hypersphere cannot be guaranteed to distill faithful pairs.
[Evaluation section] Evaluation section: the abstract asserts 'substantially reduce distillation cost' and 'remain robust across architectures,' but the provided text contains no tables, figures, or numerical results quantifying cost (e.g., GPU-hours or memory) or cross-architecture retrieval metrics (e.g., R@1 deltas), making it impossible to verify whether the central efficiency and robustness claims are supported.

minor comments (2)

[Loss level (abstract and §4)] Notation for the geometry-aware objective (loss level) is introduced without an explicit equation or pseudocode, which hinders reproducibility.
[Introduction / abstract] The abstract refers to 'prior methods often require heavy computes' without citing specific multimodal distillation baselines or their reported costs.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and will revise the paper to strengthen the presentation of the model-level component and the supporting evaluation results.

read point-by-point responses

Referee: Model-level component (abstract and §3 description): the claim that interpolating independently fine-tuned models by angular deviation from the pretrained anchor produces a mixed teacher whose joint vision-language representations remain aligned is load-bearing for the entire pipeline, yet no derivation, correlation analysis, or ablation is provided showing that angular deviation in parameter space correlates with cross-modal agreement on retrieval metrics. Without this, the downstream geometry-aware matching on the hypersphere cannot be guaranteed to distill faithful pairs.

Authors: We agree that the manuscript currently lacks an explicit derivation or empirical analysis linking angular deviation in parameter space to cross-modal agreement on retrieval metrics. This is a valid observation. In the revised version we will add a dedicated subsection containing (i) a short geometric argument relating angular deviation to representation drift and (ii) a correlation study plus ablation that quantifies how the interpolation step affects downstream retrieval performance. These additions will directly support the load-bearing claim. revision: yes
Referee: Evaluation section: the abstract asserts 'substantially reduce distillation cost' and 'remain robust across architectures,' but the provided text contains no tables, figures, or numerical results quantifying cost (e.g., GPU-hours or memory) or cross-architecture retrieval metrics (e.g., R@1 deltas), making it impossible to verify whether the central efficiency and robustness claims are supported.

Authors: The referee is correct that the submitted manuscript text does not contain the requested quantitative tables or figures for distillation cost (GPU-hours, memory) or cross-architecture R@1 deltas. We will insert new tables and figures reporting these metrics, including direct comparisons against prior multimodal distillation baselines, to substantiate the efficiency and robustness claims made in the abstract. revision: yes

Circularity Check

0 steps flagged

No circularity; high-level method description contains no equations or self-referential reductions.

full rationale

The provided abstract and method summary describe MDM via three complementary components (data-level cluster sampling, model-level angular interpolation of fine-tuned models, loss-level geometry-aware matching on the hypersphere) but supply no equations, no fitted parameters renamed as predictions, and no self-citations that bear the central claim. The interpolation step is presented as an independent modeling choice rather than a definitional tautology, and the overall framework is not shown to reduce to its inputs by construction. Absent any load-bearing derivation chain that collapses, the paper is self-contained at the level of description given.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only; no free parameters, axioms, or invented entities can be identified from the provided text.

pith-pipeline@v0.9.0 · 5728 in / 1054 out tokens · 20004 ms · 2026-05-25T04:27:59.090817+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

matches joint distributions on the unit hypersphere using a geometry-aware matching objective... geodesic kernel energies over cross-modal agreement and discrepancy directions

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

87 extracted references · 87 canonical work pages · 8 internal anchors

[1]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023. 1

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

Contextual diversity for active learning

Sharat Agarwal, Himanshu Arora, Saket Anand, and Chetan Arora. Contextual diversity for active learning. InEuropean Conference on Computer Vision, pages 137–153. Springer,

work page
[3]

Flamingo: a visual language model for few-shot learning.Advances in neural information processing systems, 35:23716–23736,

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning.Advances in neural information processing systems, 35:23716–23736,

work page
[4]

Qwen Technical Report

Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xi- aodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report.arXiv preprint arXiv:2309.16609,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

Dataset distillation as data compression: A rate-utility perspective

Youneng Bao, Yiping Liu, Zhuo Chen, Yongsheng Liang, Mu Li, and Kede Ma. Dataset distillation as data compression: A rate-utility perspective. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 519– 529, 2025. 1

work page 2025
[6]

High-performance large-scale image recognition without normalization

Andy Brock, Soham De, Samuel L Smith, and Karen Si- monyan. High-performance large-scale image recognition without normalization. InInternational conference on ma- chine learning, pages 1059–1071. PMLR, 2021. 5

work page 2021
[7]

Coyo-700m: Image-text pair dataset

Minwoo Byeon, Beomhee Park, Haecheon Kim, Sungjun Lee, Woonhyuk Baek, and Saehoon Kim. Coyo-700m: Image-text pair dataset. https://github.com/kakaobrain/ coyo-dataset, 2022. 1

work page 2022
[8]

Dataset distillation by matching training trajectories

George Cazenavette, Tongzhou Wang, Antonio Torralba, Alexei A Efros, and Jun-Yan Zhu. Dataset distillation by matching training trajectories. InCVPR, 2022. 2, 4

work page 2022
[9]

Generalizing dataset distillation via deep generative prior

George Cazenavette, Tongzhou Wang, Antonio Torralba, Alexei A Efros, and Jun-Yan Zhu. Generalizing dataset distillation via deep generative prior. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3739–3748, 2023. 2

work page 2023
[10]

Selection via proxy: Efficient data se- lection for deep learning.arXiv preprint arXiv:1906.11829,

Cody Coleman, Christopher Yeh, Stephen Mussmann, Baha- ran Mirzasoleiman, Peter Bailis, Percy Liang, Jure Leskovec, and Matei Zaharia. Selection via proxy: Efficient data se- lection for deep learning.arXiv preprint arXiv:1906.11829,

work page arXiv 1906
[11]

Dc- bench: Dataset condensation benchmark.Advances in Neural Information Processing Systems, 35:810–822, 2022

Justin Cui, Ruochen Wang, Si Si, and Cho-Jui Hsieh. Dc- bench: Dataset condensation benchmark.Advances in Neural Information Processing Systems, 35:810–822, 2022. 8

work page 2022
[12]

Scaling up dataset distillation to imagenet-1k with constant memory

Justin Cui, Ruochen Wang, Si Si, and Cho-Jui Hsieh. Scaling up dataset distillation to imagenet-1k with constant memory. InInternational Conference on Machine Learning, pages 6565–6590. PMLR, 2023. 2, 5, 6

work page 2023
[13]

Optical: Leveraging optimal transport for con- tribution allocation in dataset distillation

Xiao Cui, Yulei Qin, Wengang Zhou, Hongsheng Li, and Houqiang Li. Optical: Leveraging optimal transport for con- tribution allocation in dataset distillation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 15245–15254, 2025. 1

work page 2025
[14]

Ex- ploiting inter-sample and inter-feature relations in dataset distillation

Wenxiao Deng, Wenbin Li, Tianyu Ding, Lei Wang, Hong- guang Zhang, Kuihua Huang, Jing Huo, and Yang Gao. Ex- ploiting inter-sample and inter-feature relations in dataset distillation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17057– 17066, 2024. 1

work page 2024
[15]

Remember the past: Dis- tilling datasets into addressable memories for neural networks

Zhiwei Deng and Olga Russakovsky. Remember the past: Dis- tilling datasets into addressable memories for neural networks. InNeurIPS, 2022. 2

work page 2022
[16]

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional trans- formers for language understanding.arXiv preprint arXiv:1810.04805, 2018. 5, 7

work page internal anchor Pith review Pith/arXiv arXiv 2018
[17]

Minimizing the accumulated trajectory error to improve dataset distillation

Jiawei Du, Yidi Jiang, Vincent YF Tan, Joey Tianyi Zhou, and Haizhou Li. Minimizing the accumulated trajectory error to improve dataset distillation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3749–3758, 2023. 2

work page 2023
[18]

Adversarial Active Learning for Deep Networks: a Margin Based Approach

Melanie Ducoffe and Frederic Precioso. Adversarial active learning for deep networks: a margin based approach.arXiv preprint arXiv:1802.09841, 2018. 5

work page internal anchor Pith review Pith/arXiv arXiv 2018
[19]

Springer Science & Business Media, 2009

Reza Zanjirani Farahani and Masoud Hekmatfar.Facility loca- tion: concepts, models, algorithms and case studies. Springer Science & Business Media, 2009. 2, 5, 6, 7

work page 2009
[20]

Deepcore: A comprehensive library for coreset selection in deep learning

Chengcheng Guo, Bo Zhao, and Yanbing Bai. Deepcore: A comprehensive library for coreset selection in deep learning. InInternational Conference on Database and Expert Systems Applications, pages 181–195. Springer, 2022. 5

work page 2022
[21]

Algorithm as 136: A k-means clustering algorithm.Journal of the royal statistical society

John A Hartigan and Manchek A Wong. Algorithm as 136: A k-means clustering algorithm.Journal of the royal statistical society. series c (applied statistics), 28(1):100–108, 1979. 4

work page 1979
[22]

You only condense once: Two rules for pruning condensed datasets

Yang He, Lingao Xiao, and Joey Tianyi Zhou. You only condense once: Two rules for pruning condensed datasets. arXiv preprint arXiv:2310.14019, 2023. 2

work page arXiv 2023
[23]

Fram- ing image description as a ranking task: Data, models and evaluation metrics.Journal of Artificial Intelligence Research, 47:853–899, 2013

Micah Hodosh, Peter Young, and Julia Hockenmaier. Fram- ing image description as a ranking task: Data, models and evaluation metrics.Journal of Artificial Intelligence Research, 47:853–899, 2013. 1, 5, 6, 7, 3, 4

work page 2013
[24]

Submodular combinatorial information measures with applications in machine learning

Rishabh Iyer, Ninad Khargoankar, Jeff Bilmes, and Himanshu Asanani. Submodular combinatorial information measures with applications in machine learning. InAlgorithmic Learn- ing Theory, pages 722–754. PMLR, 2021. 5

work page 2021
[25]

Model stock: All we need is just a few fine-tuned models

Dong-Hwan Jang, Sangdoo Yun, and Dongyoon Han. Model stock: All we need is just a few fine-tuned models. In European Conference on Computer Vision, pages 207–223. Springer, 2024. 4, 2, 6

work page 2024
[26]

Grad-match: Gra- dient matching based data subset selection for efficient deep model training

Krishnateja Killamsetty, Sivasubramanian Durga, Ganesh Ra- makrishnan, Abir De, and Rishabh Iyer. Grad-match: Gra- dient matching based data subset selection for efficient deep model training. InInternational Conference on Machine Learning, pages 5464–5474. PMLR, 2021. 5

work page 2021
[27]

Glister: Generalization based data subset selection for efficient and robust learning

Krishnateja Killamsetty, Durga Sivasubramanian, Ganesh Ra- makrishnan, and Rishabh Iyer. Glister: Generalization based data subset selection for efficient and robust learning. In Proceedings of the AAAI conference on artificial intelligence, pages 8110–8118, 2021. 5

work page 2021
[28]

On divergence measures for bayesian pseudocoresets.arXiv preprint arXiv:2210.06205, 2022

Balhae Kim, Jungwon Choi, Seanie Lee, Yoonho Lee, Jung- Woo Ha, and Juho Lee. On divergence measures for bayesian pseudocoresets.arXiv preprint arXiv:2210.06205, 2022. 2

work page arXiv 2022
[29]

Dataset condensation via efficient synthetic-data pa- rameterization

Jang-Hyun Kim, Jinuk Kim, Seong Joon Oh, Sangdoo Yun, Hwanjun Song, Joonhyun Jeong, Jung-Woo Ha, and Hyun Oh Song. Dataset condensation via efficient synthetic-data pa- rameterization. InICML, 2022. 2

work page 2022
[30]

Computing geodesic paths on manifolds.Proceedings of the national academy of Sciences, 95(15):8431–8435, 1998

Ron Kimmel and James A Sethian. Computing geodesic paths on manifolds.Proceedings of the national academy of Sciences, 95(15):8431–8435, 1998. 4

work page 1998
[31]

Dataset condensation with latent space knowledge factorization and sharing.arXiv preprint arXiv:2208.10494, 2022

Hae Beom Lee, Dong Bok Lee, and Sung Ju Hwang. Dataset condensation with latent space knowledge factorization and sharing.arXiv preprint arXiv:2208.10494, 2022. 2

work page arXiv 2022
[32]

A comprehensive survey of dataset distillation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(1):17–32, 2023

Shiye Lei and Dacheng Tao. A comprehensive survey of dataset distillation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(1):17–32, 2023. 1

work page 2023
[33]

Diversity-enhanced distribution alignment for dataset distillation

Hongcheng Li, Yucan Zhou, Xiaoyan Gu, Bo Li, and Weiping Wang. Diversity-enhanced distribution alignment for dataset distillation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 3747–3756, 2025. 1

work page 2025
[34]

Blip: Bootstrapping language-image pre-training for unified vision- language understanding and generation

Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for unified vision- language understanding and generation. InInternational Con- ference on Machine Learning, pages 12888–12900. PMLR,

work page
[35]

Microsoft coco: Common objects in context

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages 740–755. Springer, 2014. 1, 3, 5, 6, 7

work page 2014
[36]

Dataset distillation by automatic training trajectories

Dai Liu, Jindong Gu, Hu Cao, Carsten Trinitis, and Martin Schulz. Dataset distillation by automatic training trajectories. InEuropean Conference on Computer Vision, pages 334–351. Springer, 2024. 2

work page 2024
[37]

Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023. 1

work page 2023
[38]

Dataset distillation via the wasserstein metric

Haoyang Liu, Yijiang Li, Tiancheng Xing, Peiran Wang, Vibhu Dalal, Luwei Li, Jingrui He, and Haohan Wang. Dataset distillation via the wasserstein metric. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 1205–1215, 2025. 1

work page 2025
[39]

The evolution of dataset distillation: Toward scalable and generalizable solutions.arXiv preprint arXiv:2502.05673, 2025

Ping Liu and Jiawei Du. The evolution of dataset distillation: Toward scalable and generalizable solutions.arXiv preprint arXiv:2502.05673, 2025. 1

work page arXiv 2025
[40]

Dataset distillation via factorization

Songhua Liu, Kai Wang, Xingyi Yang, Jingwen Ye, and Xin- chao Wang. Dataset distillation via factorization. InNeurIPS,

work page
[41]

Slimmable dataset condensation

Songhua Liu, Jingwen Ye, Runpeng Yu, and Xinchao Wang. Slimmable dataset condensation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3759–3768, 2023. 2

work page 2023
[42]

Dream: Efficient dataset distillation by repre- sentative matching.arXiv preprint arXiv:2302.14416, 2023

Yanqing Liu, Jianyang Gu, Kai Wang, Zheng Zhu, Wei Jiang, and Yang You. Dream: Efficient dataset distillation by repre- sentative matching.arXiv preprint arXiv:2302.14416, 2023. 2

work page arXiv 2023
[43]

Efficient dataset distillation using random feature approxima- tion

Noel Loo, Ramin Hasani, Alexander Amini, and Daniela Rus. Efficient dataset distillation using random feature approxima- tion. InNeurIPS, 2022. 2

work page 2022
[44]

Dataset distillation with convexified implicit gradients.arXiv preprint arXiv:2302.06755, 2023

Noel Loo, Ramin Hasani, Mathias Lechner, and Daniela Rus. Dataset distillation with convexified implicit gradients.arXiv preprint arXiv:2302.06755, 2023. 2

work page arXiv 2023
[45]

Bayesian pseudocoresets

Dionysis Manousakas, Zuheng Xu, Cecilia Mascolo, and Trevor Campbell. Bayesian pseudocoresets. InNeurIPS,

work page
[46]

Active learning by acquiring contrastive examples.arXiv preprint arXiv:2109.03764, 2021

Katerina Margatina, Giorgos Vernikos, Loïc Barrault, and Nikolaos Aletras. Active learning by acquiring contrastive examples.arXiv preprint arXiv:2109.03764, 2021. 5

work page arXiv 2021
[47]

Geomm: On geodesic perspective for multi-modal learning

Shibin Mei, Hang Wang, and Bingbing Ni. Geomm: On geodesic perspective for multi-modal learning. InProceed- ings of the Computer Vision and Pattern Recognition Confer- ence, pages 4776–4786, 2025. 4, 2

work page 2025
[48]

Coresets for data-efficient training of machine learning mod- els

Baharan Mirzasoleiman, Jeff Bilmes, and Jure Leskovec. Coresets for data-efficient training of machine learning mod- els. InInternational Conference on Machine Learning, pages 6950–6960. PMLR, 2020. 5

work page 2020
[49]

Dataset meta-learning from kernel ridge-regression.arXiv preprint arXiv:2011.00050, 2020

Timothy Nguyen, Zhourong Chen, and Jaehoon Lee. Dataset meta-learning from kernel ridge-regression.arXiv preprint arXiv:2011.00050, 2020. 2

work page arXiv 2011
[50]

Dataset distillation with infinitely wide convolutional networks

Timothy Nguyen, Roman Novak, Lechao Xiao, and Jaehoon Lee. Dataset distillation with infinitely wide convolutional networks. InNeurIPS, 2021. 2

work page 2021
[51]

Deep learning on a data diet: Finding important examples early in training.Advances in neural information processing systems, 34:20596–20607, 2021

Mansheej Paul, Surya Ganguli, and Gintare Karolina Dz- iugaite. Deep learning on a data diet: Finding important examples early in training.Advances in neural information processing systems, 34:20596–20607, 2021. 5

work page 2021
[52]

Learning transferable visual models from natural language supervi- sion

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 3

work page 2021
[53]

Datadam: Efficient dataset distillation with attention matching

Ahmad Sajedi, Samir Khaki, Ehsan Amjadian, Lucy Z Liu, Yuri A Lawryshyn, and Konstantinos N Plataniotis. Datadam: Efficient dataset distillation with attention matching. InPro- ceedings of the IEEE/CVF International Conference on Com- puter Vision, pages 17097–17107, 2023. 8

work page 2023
[54]

DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter

Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter.arXiv preprint arXiv:1910.01108, 2019. 7

work page internal anchor Pith review Pith/arXiv arXiv 1910
[55]

Laion-5b: An open large-scale dataset for training next gen- eration image-text models.Advances in Neural Information Processing Systems, 35:25278–25294, 2022

Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, et al. Laion-5b: An open large-scale dataset for training next gen- eration image-text models.Advances in Neural Information Processing Systems, 35:25278–25294, 2022. 1

work page 2022
[56]

Active Learning for Convolutional Neural Networks: A Core-Set Approach

Ozan Sener and Silvio Savarese. Active learning for convolu- tional neural networks: A core-set approach.arXiv preprint arXiv:1708.00489, 2017. 5

work page internal anchor Pith review Pith/arXiv arXiv 2017
[57]

Fre- quency domain-based dataset distillation.Advances in Neural Information Processing Systems, 36:70033–70044, 2023

Donghyeok Shin, Seungjae Shin, and Il-Chul Moon. Fre- quency domain-based dataset distillation.Advances in Neural Information Processing Systems, 36:70033–70044, 2023. 1, 2

work page 2023
[58]

Fyi: Flip your images for dataset distillation

Byunggwan Son, Youngmin Oh, Donghyeon Baek, and Bum- sub Ham. Fyi: Flip your images for dataset distillation. In European Conference on Computer Vision, pages 214–230. Springer, 2024

work page 2024
[59]

D^4m: Dataset distillation via disentangled diffusion model

Duo Su, Junjie Hou, Weizhi Gao, Yingjie Tian, and Bowen Tang. D^4m: Dataset distillation via disentangled diffusion model. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, pages 5809–5818, 2024. 1

work page 2024
[60]

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, et al. Gemini 1.5: Unlocking mul- timodal understanding across millions of tokens of context. arXiv preprint arXiv:2403.05530, 2024. 1

work page internal anchor Pith review Pith/arXiv arXiv 2024
[61]

Yfcc100m: The new data in multimedia research

Bart Thomee, David A Shamma, Gerald Friedland, Benjamin Elizalde, Karl Ni, Douglas Poland, Damian Borth, and Li- Jia Li. Yfcc100m: The new data in multimedia research. Communications of the ACM, 59(2):64–73, 2016. 1

work page 2016
[62]

Con- structing bayesian pseudo-coresets using contrastive diver- gence.arXiv preprint arXiv:2303.11278, 2023

Piyush Tiwary, Kumar Shubham, Vivek Kashyap, et al. Con- structing bayesian pseudo-coresets using contrastive diver- gence.arXiv preprint arXiv:2303.11278, 2023. 2

work page arXiv 2023
[63]

An empirical study of example forgetting during deep neural network learning.arXiv preprint arXiv:1812.05159,

Mariya Toneva, Alessandro Sordoni, Remi Tachet des Combes, Adam Trischler, Yoshua Bengio, and Geoffrey J Gor- don. An empirical study of example forgetting during deep neural network learning.arXiv preprint arXiv:1812.05159,

work page arXiv
[64]

Cao2: Rectifying inconsistencies in diffusion-based dataset distillation, 2025

Haoxuan Wang, Zhenghao Zhao, Junyi Wu, Yuzhang Shang, Gaowen Liu, and Yan Yan. Cao2: Rectifying inconsistencies in diffusion-based dataset distillation, 2025. 1

work page 2025
[65]

Cafe: Learning to condense dataset by aligning features

Kai Wang, Bo Zhao, Xiangyu Peng, Zheng Zhu, Shuo Yang, Shuo Wang, Guan Huang, Hakan Bilen, Xinchao Wang, and Yang You. Cafe: Learning to condense dataset by aligning features. InCVPR, 2022. 2

work page 2022
[66]

Dataset Distillation

Tongzhou Wang, Jun-Yan Zhu, Antonio Torralba, and Alexei A Efros. Dataset distillation.arXiv preprint arXiv:1811.10959, 2018. 1, 8

work page internal anchor Pith review Pith/arXiv arXiv 2018
[67]

Herding dynamical weights to learn

Max Welling. Herding dynamical weights to learn. InPro- ceedings of the 26th Annual International Conference on Machine Learning, pages 1121–1128, 2009. 2, 5, 6, 7

work page 2009
[68]

Vision-language dataset distillation, 2024

Xindi Wu, Byron Zhang, Zhiwei Deng, and Olga Rus- sakovsky. Vision-language dataset distillation, 2024. TMLR

work page 2024
[69]

Low-rank similarity mining for multimodal dataset distilla- tion

Yue Xu, Zhilin Lin, Yusong Qiu, Cewu Lu, and Yong-Lu Li. Low-rank similarity mining for multimodal dataset distilla- tion. InProceedings of the 41st International Conference on Machine Learning, pages 55144–55161. PMLR, 2024. 2, 3, 4, 5, 6, 7

work page 2024
[70]

Dataset distillation via cur- riculum data synthesis in large data era.Transactions on Machine Learning Research, 2024

Zeyuan Yin and Zhiqiang Shen. Dataset distillation via cur- riculum data synthesis in large data era.Transactions on Machine Learning Research, 2024. 8

work page 2024
[71]

Squeeze, recover and relabel: Dataset condensation at imagenet scale from a new perspective.arXiv preprint arXiv:2306.13092, 2023

Zeyuan Yin, Eric Xing, and Zhiqiang Shen. Squeeze, recover and relabel: Dataset condensation at imagenet scale from a new perspective.arXiv preprint arXiv:2306.13092, 2023. 2

work page arXiv 2023
[72]

Peter Young, Alice Lai, Micah Hodosh, and Julia Hocken- maier. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descrip- tions.Transactions of the association for computational lin- guistics, 2:67–78, 2014. 1, 3, 5, 6, 7

work page 2014
[73]

Dataset distilla- tion: A comprehensive review.IEEE transactions on pattern analysis and machine intelligence, 46(1):150–170, 2023

Ruonan Yu, Songhua Liu, and Xinchao Wang. Dataset distilla- tion: A comprehensive review.IEEE transactions on pattern analysis and machine intelligence, 46(1):150–170, 2023. 1

work page 2023
[74]

Sigmoid loss for language image pre-training

Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. InProceedings of the IEEE/CVF international conference on computer vision, pages 11975–11986, 2023. 1

work page 2023
[75]

Dance: Dual-view distri- bution alignment for dataset condensation.arXiv preprint arXiv:2406.01063, 2024

Hansong Zhang, Shikun Li, Fanzhao Lin, Weiping Wang, Zhenxing Qian, and Shiming Ge. Dance: Dual-view distri- bution alignment for dataset condensation.arXiv preprint arXiv:2406.01063, 2024. 2, 8

work page arXiv 2024
[76]

M3d: Dataset condensation by minimizing maximum mean discrepancy

Hansong Zhang, Shikun Li, Pengju Wang, Dan Zeng, and Shiming Ge. M3d: Dataset condensation by minimizing maximum mean discrepancy. InProceedings of the AAAI Conference on Artificial Intelligence, pages 9314–9322, 2024. 1

work page 2024
[77]

Accelerating dataset distillation via model augmentation

Lei Zhang, Jie Zhang, Bowen Lei, Subhabrata Mukherjee, Xiang Pan, Bo Zhao, Caiwen Ding, Yao Li, and Dongkuan Xu. Accelerating dataset distillation via model augmentation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11950–11959, 2023. 2

work page 2023
[78]

Dataset condensation with differ- entiable siamese augmentation

Bo Zhao and Hakan Bilen. Dataset condensation with differ- entiable siamese augmentation. InICML, 2021. 2

work page 2021
[79]

Synthesizing informative training samples with gan.arXiv preprint arXiv:2204.07513, 2022

Bo Zhao and Hakan Bilen. Synthesizing informative training samples with gan.arXiv preprint arXiv:2204.07513, 2022. 2

work page arXiv 2022
[80]

Dataset condensation with distri- bution matching

Bo Zhao and Hakan Bilen. Dataset condensation with distri- bution matching. InWACV, 2023. 2, 3

work page 2023

Showing first 80 references.

[1] [1]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023. 1

work page internal anchor Pith review Pith/arXiv arXiv 2023

[2] [2]

Contextual diversity for active learning

Sharat Agarwal, Himanshu Arora, Saket Anand, and Chetan Arora. Contextual diversity for active learning. InEuropean Conference on Computer Vision, pages 137–153. Springer,

work page

[3] [3]

Flamingo: a visual language model for few-shot learning.Advances in neural information processing systems, 35:23716–23736,

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning.Advances in neural information processing systems, 35:23716–23736,

work page

[4] [4]

Qwen Technical Report

Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xi- aodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report.arXiv preprint arXiv:2309.16609,

work page internal anchor Pith review Pith/arXiv arXiv

[5] [5]

Dataset distillation as data compression: A rate-utility perspective

Youneng Bao, Yiping Liu, Zhuo Chen, Yongsheng Liang, Mu Li, and Kede Ma. Dataset distillation as data compression: A rate-utility perspective. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 519– 529, 2025. 1

work page 2025

[6] [6]

High-performance large-scale image recognition without normalization

Andy Brock, Soham De, Samuel L Smith, and Karen Si- monyan. High-performance large-scale image recognition without normalization. InInternational conference on ma- chine learning, pages 1059–1071. PMLR, 2021. 5

work page 2021

[7] [7]

Coyo-700m: Image-text pair dataset

Minwoo Byeon, Beomhee Park, Haecheon Kim, Sungjun Lee, Woonhyuk Baek, and Saehoon Kim. Coyo-700m: Image-text pair dataset. https://github.com/kakaobrain/ coyo-dataset, 2022. 1

work page 2022

[8] [8]

Dataset distillation by matching training trajectories

George Cazenavette, Tongzhou Wang, Antonio Torralba, Alexei A Efros, and Jun-Yan Zhu. Dataset distillation by matching training trajectories. InCVPR, 2022. 2, 4

work page 2022

[9] [9]

Generalizing dataset distillation via deep generative prior

George Cazenavette, Tongzhou Wang, Antonio Torralba, Alexei A Efros, and Jun-Yan Zhu. Generalizing dataset distillation via deep generative prior. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3739–3748, 2023. 2

work page 2023

[10] [10]

Selection via proxy: Efficient data se- lection for deep learning.arXiv preprint arXiv:1906.11829,

Cody Coleman, Christopher Yeh, Stephen Mussmann, Baha- ran Mirzasoleiman, Peter Bailis, Percy Liang, Jure Leskovec, and Matei Zaharia. Selection via proxy: Efficient data se- lection for deep learning.arXiv preprint arXiv:1906.11829,

work page arXiv 1906

[11] [11]

Dc- bench: Dataset condensation benchmark.Advances in Neural Information Processing Systems, 35:810–822, 2022

Justin Cui, Ruochen Wang, Si Si, and Cho-Jui Hsieh. Dc- bench: Dataset condensation benchmark.Advances in Neural Information Processing Systems, 35:810–822, 2022. 8

work page 2022

[12] [12]

Scaling up dataset distillation to imagenet-1k with constant memory

Justin Cui, Ruochen Wang, Si Si, and Cho-Jui Hsieh. Scaling up dataset distillation to imagenet-1k with constant memory. InInternational Conference on Machine Learning, pages 6565–6590. PMLR, 2023. 2, 5, 6

work page 2023

[13] [13]

Optical: Leveraging optimal transport for con- tribution allocation in dataset distillation

Xiao Cui, Yulei Qin, Wengang Zhou, Hongsheng Li, and Houqiang Li. Optical: Leveraging optimal transport for con- tribution allocation in dataset distillation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 15245–15254, 2025. 1

work page 2025

[14] [14]

Ex- ploiting inter-sample and inter-feature relations in dataset distillation

Wenxiao Deng, Wenbin Li, Tianyu Ding, Lei Wang, Hong- guang Zhang, Kuihua Huang, Jing Huo, and Yang Gao. Ex- ploiting inter-sample and inter-feature relations in dataset distillation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17057– 17066, 2024. 1

work page 2024

[15] [15]

Remember the past: Dis- tilling datasets into addressable memories for neural networks

Zhiwei Deng and Olga Russakovsky. Remember the past: Dis- tilling datasets into addressable memories for neural networks. InNeurIPS, 2022. 2

work page 2022

[16] [16]

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional trans- formers for language understanding.arXiv preprint arXiv:1810.04805, 2018. 5, 7

work page internal anchor Pith review Pith/arXiv arXiv 2018

[17] [17]

Minimizing the accumulated trajectory error to improve dataset distillation

Jiawei Du, Yidi Jiang, Vincent YF Tan, Joey Tianyi Zhou, and Haizhou Li. Minimizing the accumulated trajectory error to improve dataset distillation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3749–3758, 2023. 2

work page 2023

[18] [18]

Adversarial Active Learning for Deep Networks: a Margin Based Approach

Melanie Ducoffe and Frederic Precioso. Adversarial active learning for deep networks: a margin based approach.arXiv preprint arXiv:1802.09841, 2018. 5

work page internal anchor Pith review Pith/arXiv arXiv 2018

[19] [19]

Springer Science & Business Media, 2009

Reza Zanjirani Farahani and Masoud Hekmatfar.Facility loca- tion: concepts, models, algorithms and case studies. Springer Science & Business Media, 2009. 2, 5, 6, 7

work page 2009

[20] [20]

Deepcore: A comprehensive library for coreset selection in deep learning

Chengcheng Guo, Bo Zhao, and Yanbing Bai. Deepcore: A comprehensive library for coreset selection in deep learning. InInternational Conference on Database and Expert Systems Applications, pages 181–195. Springer, 2022. 5

work page 2022

[21] [21]

Algorithm as 136: A k-means clustering algorithm.Journal of the royal statistical society

John A Hartigan and Manchek A Wong. Algorithm as 136: A k-means clustering algorithm.Journal of the royal statistical society. series c (applied statistics), 28(1):100–108, 1979. 4

work page 1979

[22] [22]

You only condense once: Two rules for pruning condensed datasets

Yang He, Lingao Xiao, and Joey Tianyi Zhou. You only condense once: Two rules for pruning condensed datasets. arXiv preprint arXiv:2310.14019, 2023. 2

work page arXiv 2023

[23] [23]

Fram- ing image description as a ranking task: Data, models and evaluation metrics.Journal of Artificial Intelligence Research, 47:853–899, 2013

Micah Hodosh, Peter Young, and Julia Hockenmaier. Fram- ing image description as a ranking task: Data, models and evaluation metrics.Journal of Artificial Intelligence Research, 47:853–899, 2013. 1, 5, 6, 7, 3, 4

work page 2013

[24] [24]

Submodular combinatorial information measures with applications in machine learning

Rishabh Iyer, Ninad Khargoankar, Jeff Bilmes, and Himanshu Asanani. Submodular combinatorial information measures with applications in machine learning. InAlgorithmic Learn- ing Theory, pages 722–754. PMLR, 2021. 5

work page 2021

[25] [25]

Model stock: All we need is just a few fine-tuned models

Dong-Hwan Jang, Sangdoo Yun, and Dongyoon Han. Model stock: All we need is just a few fine-tuned models. In European Conference on Computer Vision, pages 207–223. Springer, 2024. 4, 2, 6

work page 2024

[26] [26]

Grad-match: Gra- dient matching based data subset selection for efficient deep model training

Krishnateja Killamsetty, Sivasubramanian Durga, Ganesh Ra- makrishnan, Abir De, and Rishabh Iyer. Grad-match: Gra- dient matching based data subset selection for efficient deep model training. InInternational Conference on Machine Learning, pages 5464–5474. PMLR, 2021. 5

work page 2021

[27] [27]

Glister: Generalization based data subset selection for efficient and robust learning

Krishnateja Killamsetty, Durga Sivasubramanian, Ganesh Ra- makrishnan, and Rishabh Iyer. Glister: Generalization based data subset selection for efficient and robust learning. In Proceedings of the AAAI conference on artificial intelligence, pages 8110–8118, 2021. 5

work page 2021

[28] [28]

On divergence measures for bayesian pseudocoresets.arXiv preprint arXiv:2210.06205, 2022

Balhae Kim, Jungwon Choi, Seanie Lee, Yoonho Lee, Jung- Woo Ha, and Juho Lee. On divergence measures for bayesian pseudocoresets.arXiv preprint arXiv:2210.06205, 2022. 2

work page arXiv 2022

[29] [29]

Dataset condensation via efficient synthetic-data pa- rameterization

Jang-Hyun Kim, Jinuk Kim, Seong Joon Oh, Sangdoo Yun, Hwanjun Song, Joonhyun Jeong, Jung-Woo Ha, and Hyun Oh Song. Dataset condensation via efficient synthetic-data pa- rameterization. InICML, 2022. 2

work page 2022

[30] [30]

Computing geodesic paths on manifolds.Proceedings of the national academy of Sciences, 95(15):8431–8435, 1998

Ron Kimmel and James A Sethian. Computing geodesic paths on manifolds.Proceedings of the national academy of Sciences, 95(15):8431–8435, 1998. 4

work page 1998

[31] [31]

Dataset condensation with latent space knowledge factorization and sharing.arXiv preprint arXiv:2208.10494, 2022

Hae Beom Lee, Dong Bok Lee, and Sung Ju Hwang. Dataset condensation with latent space knowledge factorization and sharing.arXiv preprint arXiv:2208.10494, 2022. 2

work page arXiv 2022

[32] [32]

A comprehensive survey of dataset distillation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(1):17–32, 2023

Shiye Lei and Dacheng Tao. A comprehensive survey of dataset distillation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(1):17–32, 2023. 1

work page 2023

[33] [33]

Diversity-enhanced distribution alignment for dataset distillation

Hongcheng Li, Yucan Zhou, Xiaoyan Gu, Bo Li, and Weiping Wang. Diversity-enhanced distribution alignment for dataset distillation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 3747–3756, 2025. 1

work page 2025

[34] [34]

Blip: Bootstrapping language-image pre-training for unified vision- language understanding and generation

Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for unified vision- language understanding and generation. InInternational Con- ference on Machine Learning, pages 12888–12900. PMLR,

work page

[35] [35]

Microsoft coco: Common objects in context

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages 740–755. Springer, 2014. 1, 3, 5, 6, 7

work page 2014

[36] [36]

Dataset distillation by automatic training trajectories

Dai Liu, Jindong Gu, Hu Cao, Carsten Trinitis, and Martin Schulz. Dataset distillation by automatic training trajectories. InEuropean Conference on Computer Vision, pages 334–351. Springer, 2024. 2

work page 2024

[37] [37]

Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023. 1

work page 2023

[38] [38]

Dataset distillation via the wasserstein metric

Haoyang Liu, Yijiang Li, Tiancheng Xing, Peiran Wang, Vibhu Dalal, Luwei Li, Jingrui He, and Haohan Wang. Dataset distillation via the wasserstein metric. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 1205–1215, 2025. 1

work page 2025

[39] [39]

The evolution of dataset distillation: Toward scalable and generalizable solutions.arXiv preprint arXiv:2502.05673, 2025

Ping Liu and Jiawei Du. The evolution of dataset distillation: Toward scalable and generalizable solutions.arXiv preprint arXiv:2502.05673, 2025. 1

work page arXiv 2025

[40] [40]

Dataset distillation via factorization

Songhua Liu, Kai Wang, Xingyi Yang, Jingwen Ye, and Xin- chao Wang. Dataset distillation via factorization. InNeurIPS,

work page

[41] [41]

Slimmable dataset condensation

Songhua Liu, Jingwen Ye, Runpeng Yu, and Xinchao Wang. Slimmable dataset condensation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3759–3768, 2023. 2

work page 2023

[42] [42]

Dream: Efficient dataset distillation by repre- sentative matching.arXiv preprint arXiv:2302.14416, 2023

Yanqing Liu, Jianyang Gu, Kai Wang, Zheng Zhu, Wei Jiang, and Yang You. Dream: Efficient dataset distillation by repre- sentative matching.arXiv preprint arXiv:2302.14416, 2023. 2

work page arXiv 2023

[43] [43]

Efficient dataset distillation using random feature approxima- tion

Noel Loo, Ramin Hasani, Alexander Amini, and Daniela Rus. Efficient dataset distillation using random feature approxima- tion. InNeurIPS, 2022. 2

work page 2022

[44] [44]

Dataset distillation with convexified implicit gradients.arXiv preprint arXiv:2302.06755, 2023

Noel Loo, Ramin Hasani, Mathias Lechner, and Daniela Rus. Dataset distillation with convexified implicit gradients.arXiv preprint arXiv:2302.06755, 2023. 2

work page arXiv 2023

[45] [45]

Bayesian pseudocoresets

Dionysis Manousakas, Zuheng Xu, Cecilia Mascolo, and Trevor Campbell. Bayesian pseudocoresets. InNeurIPS,

work page

[46] [46]

Active learning by acquiring contrastive examples.arXiv preprint arXiv:2109.03764, 2021

Katerina Margatina, Giorgos Vernikos, Loïc Barrault, and Nikolaos Aletras. Active learning by acquiring contrastive examples.arXiv preprint arXiv:2109.03764, 2021. 5

work page arXiv 2021

[47] [47]

Geomm: On geodesic perspective for multi-modal learning

Shibin Mei, Hang Wang, and Bingbing Ni. Geomm: On geodesic perspective for multi-modal learning. InProceed- ings of the Computer Vision and Pattern Recognition Confer- ence, pages 4776–4786, 2025. 4, 2

work page 2025

[48] [48]

Coresets for data-efficient training of machine learning mod- els

Baharan Mirzasoleiman, Jeff Bilmes, and Jure Leskovec. Coresets for data-efficient training of machine learning mod- els. InInternational Conference on Machine Learning, pages 6950–6960. PMLR, 2020. 5

work page 2020

[49] [49]

Dataset meta-learning from kernel ridge-regression.arXiv preprint arXiv:2011.00050, 2020

Timothy Nguyen, Zhourong Chen, and Jaehoon Lee. Dataset meta-learning from kernel ridge-regression.arXiv preprint arXiv:2011.00050, 2020. 2

work page arXiv 2011

[50] [50]

Dataset distillation with infinitely wide convolutional networks

Timothy Nguyen, Roman Novak, Lechao Xiao, and Jaehoon Lee. Dataset distillation with infinitely wide convolutional networks. InNeurIPS, 2021. 2

work page 2021

[51] [51]

Deep learning on a data diet: Finding important examples early in training.Advances in neural information processing systems, 34:20596–20607, 2021

Mansheej Paul, Surya Ganguli, and Gintare Karolina Dz- iugaite. Deep learning on a data diet: Finding important examples early in training.Advances in neural information processing systems, 34:20596–20607, 2021. 5

work page 2021

[52] [52]

Learning transferable visual models from natural language supervi- sion

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 3

work page 2021

[53] [53]

Datadam: Efficient dataset distillation with attention matching

Ahmad Sajedi, Samir Khaki, Ehsan Amjadian, Lucy Z Liu, Yuri A Lawryshyn, and Konstantinos N Plataniotis. Datadam: Efficient dataset distillation with attention matching. InPro- ceedings of the IEEE/CVF International Conference on Com- puter Vision, pages 17097–17107, 2023. 8

work page 2023

[54] [54]

DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter

Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter.arXiv preprint arXiv:1910.01108, 2019. 7

work page internal anchor Pith review Pith/arXiv arXiv 1910

[55] [55]

Laion-5b: An open large-scale dataset for training next gen- eration image-text models.Advances in Neural Information Processing Systems, 35:25278–25294, 2022

Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, et al. Laion-5b: An open large-scale dataset for training next gen- eration image-text models.Advances in Neural Information Processing Systems, 35:25278–25294, 2022. 1

work page 2022

[56] [56]

Active Learning for Convolutional Neural Networks: A Core-Set Approach

Ozan Sener and Silvio Savarese. Active learning for convolu- tional neural networks: A core-set approach.arXiv preprint arXiv:1708.00489, 2017. 5

work page internal anchor Pith review Pith/arXiv arXiv 2017

[57] [57]

Fre- quency domain-based dataset distillation.Advances in Neural Information Processing Systems, 36:70033–70044, 2023

Donghyeok Shin, Seungjae Shin, and Il-Chul Moon. Fre- quency domain-based dataset distillation.Advances in Neural Information Processing Systems, 36:70033–70044, 2023. 1, 2

work page 2023

[58] [58]

Fyi: Flip your images for dataset distillation

Byunggwan Son, Youngmin Oh, Donghyeon Baek, and Bum- sub Ham. Fyi: Flip your images for dataset distillation. In European Conference on Computer Vision, pages 214–230. Springer, 2024

work page 2024

[59] [59]

D^4m: Dataset distillation via disentangled diffusion model

Duo Su, Junjie Hou, Weizhi Gao, Yingjie Tian, and Bowen Tang. D^4m: Dataset distillation via disentangled diffusion model. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, pages 5809–5818, 2024. 1

work page 2024

[60] [60]

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, et al. Gemini 1.5: Unlocking mul- timodal understanding across millions of tokens of context. arXiv preprint arXiv:2403.05530, 2024. 1

work page internal anchor Pith review Pith/arXiv arXiv 2024

[61] [61]

Yfcc100m: The new data in multimedia research

Bart Thomee, David A Shamma, Gerald Friedland, Benjamin Elizalde, Karl Ni, Douglas Poland, Damian Borth, and Li- Jia Li. Yfcc100m: The new data in multimedia research. Communications of the ACM, 59(2):64–73, 2016. 1

work page 2016

[62] [62]

Con- structing bayesian pseudo-coresets using contrastive diver- gence.arXiv preprint arXiv:2303.11278, 2023

Piyush Tiwary, Kumar Shubham, Vivek Kashyap, et al. Con- structing bayesian pseudo-coresets using contrastive diver- gence.arXiv preprint arXiv:2303.11278, 2023. 2

work page arXiv 2023

[63] [63]

An empirical study of example forgetting during deep neural network learning.arXiv preprint arXiv:1812.05159,

Mariya Toneva, Alessandro Sordoni, Remi Tachet des Combes, Adam Trischler, Yoshua Bengio, and Geoffrey J Gor- don. An empirical study of example forgetting during deep neural network learning.arXiv preprint arXiv:1812.05159,

work page arXiv

[64] [64]

Cao2: Rectifying inconsistencies in diffusion-based dataset distillation, 2025

Haoxuan Wang, Zhenghao Zhao, Junyi Wu, Yuzhang Shang, Gaowen Liu, and Yan Yan. Cao2: Rectifying inconsistencies in diffusion-based dataset distillation, 2025. 1

work page 2025

[65] [65]

Cafe: Learning to condense dataset by aligning features

Kai Wang, Bo Zhao, Xiangyu Peng, Zheng Zhu, Shuo Yang, Shuo Wang, Guan Huang, Hakan Bilen, Xinchao Wang, and Yang You. Cafe: Learning to condense dataset by aligning features. InCVPR, 2022. 2

work page 2022

[66] [66]

Dataset Distillation

Tongzhou Wang, Jun-Yan Zhu, Antonio Torralba, and Alexei A Efros. Dataset distillation.arXiv preprint arXiv:1811.10959, 2018. 1, 8

work page internal anchor Pith review Pith/arXiv arXiv 2018

[67] [67]

Herding dynamical weights to learn

Max Welling. Herding dynamical weights to learn. InPro- ceedings of the 26th Annual International Conference on Machine Learning, pages 1121–1128, 2009. 2, 5, 6, 7

work page 2009

[68] [68]

Vision-language dataset distillation, 2024

Xindi Wu, Byron Zhang, Zhiwei Deng, and Olga Rus- sakovsky. Vision-language dataset distillation, 2024. TMLR

work page 2024

[69] [69]

Low-rank similarity mining for multimodal dataset distilla- tion

Yue Xu, Zhilin Lin, Yusong Qiu, Cewu Lu, and Yong-Lu Li. Low-rank similarity mining for multimodal dataset distilla- tion. InProceedings of the 41st International Conference on Machine Learning, pages 55144–55161. PMLR, 2024. 2, 3, 4, 5, 6, 7

work page 2024

[70] [70]

Dataset distillation via cur- riculum data synthesis in large data era.Transactions on Machine Learning Research, 2024

Zeyuan Yin and Zhiqiang Shen. Dataset distillation via cur- riculum data synthesis in large data era.Transactions on Machine Learning Research, 2024. 8

work page 2024

[71] [71]

Squeeze, recover and relabel: Dataset condensation at imagenet scale from a new perspective.arXiv preprint arXiv:2306.13092, 2023

Zeyuan Yin, Eric Xing, and Zhiqiang Shen. Squeeze, recover and relabel: Dataset condensation at imagenet scale from a new perspective.arXiv preprint arXiv:2306.13092, 2023. 2

work page arXiv 2023

[72] [72]

Peter Young, Alice Lai, Micah Hodosh, and Julia Hocken- maier. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descrip- tions.Transactions of the association for computational lin- guistics, 2:67–78, 2014. 1, 3, 5, 6, 7

work page 2014

[73] [73]

Dataset distilla- tion: A comprehensive review.IEEE transactions on pattern analysis and machine intelligence, 46(1):150–170, 2023

Ruonan Yu, Songhua Liu, and Xinchao Wang. Dataset distilla- tion: A comprehensive review.IEEE transactions on pattern analysis and machine intelligence, 46(1):150–170, 2023. 1

work page 2023

[74] [74]

Sigmoid loss for language image pre-training

Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. InProceedings of the IEEE/CVF international conference on computer vision, pages 11975–11986, 2023. 1

work page 2023

[75] [75]

Dance: Dual-view distri- bution alignment for dataset condensation.arXiv preprint arXiv:2406.01063, 2024

Hansong Zhang, Shikun Li, Fanzhao Lin, Weiping Wang, Zhenxing Qian, and Shiming Ge. Dance: Dual-view distri- bution alignment for dataset condensation.arXiv preprint arXiv:2406.01063, 2024. 2, 8

work page arXiv 2024

[76] [76]

M3d: Dataset condensation by minimizing maximum mean discrepancy

Hansong Zhang, Shikun Li, Pengju Wang, Dan Zeng, and Shiming Ge. M3d: Dataset condensation by minimizing maximum mean discrepancy. InProceedings of the AAAI Conference on Artificial Intelligence, pages 9314–9322, 2024. 1

work page 2024

[77] [77]

Accelerating dataset distillation via model augmentation

Lei Zhang, Jie Zhang, Bowen Lei, Subhabrata Mukherjee, Xiang Pan, Bo Zhao, Caiwen Ding, Yao Li, and Dongkuan Xu. Accelerating dataset distillation via model augmentation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11950–11959, 2023. 2

work page 2023

[78] [78]

Dataset condensation with differ- entiable siamese augmentation

Bo Zhao and Hakan Bilen. Dataset condensation with differ- entiable siamese augmentation. InICML, 2021. 2

work page 2021

[79] [79]

Synthesizing informative training samples with gan.arXiv preprint arXiv:2204.07513, 2022

Bo Zhao and Hakan Bilen. Synthesizing informative training samples with gan.arXiv preprint arXiv:2204.07513, 2022. 2

work page arXiv 2022

[80] [80]

Dataset condensation with distri- bution matching

Bo Zhao and Hakan Bilen. Dataset condensation with distri- bution matching. InWACV, 2023. 2, 3

work page 2023