Multimodal Distribution Matching for Vision-Language Dataset Distillation
Pith reviewed 2026-05-25 04:27 UTC · model grok-4.3
The pith
MDM produces compact synthetic image-text datasets that preserve multimodal semantics and retrieval performance with reduced computation.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central discovery is that integrating cluster sampling in the joint embedding space, angular interpolation of fine-tuned models to form a mixed teacher, and matching joint distributions on the unit hypersphere with a geometry-aware objective that uses cross-modal agreement, discrepancy, and symmetric contrastive learning produces synthetic image-text pairs that maintain performance on retrieval tasks.
What carries the argument
The geometry-aware matching objective on the unit hypersphere, which matches distributions by exploiting features in agreement and discrepancy directions along with symmetric contrastive learning.
If this is right
- MDM yields compact synthetic sets that preserve multimodal semantics on image-text retrieval benchmarks.
- Distillation cost is substantially reduced compared to prior methods.
- Performance remains robust across different architectures in cross-architecture evaluations.
Where Pith is reading between the lines
- The approach could be tested on other multimodal tasks such as captioning or visual reasoning to see if alignment preservation transfers.
- Lower compute requirements might make dataset distillation feasible for smaller research groups without access to large clusters.
- Future work could explore whether these synthetic sets improve generalization when used in combination with real data.
Load-bearing premise
The three components of cluster sampling, angular model interpolation, and hyperspherical geometry-aware matching will together preserve cross-modal alignment without the heavy compute of earlier approaches.
What would settle it
A clear falsifier would be if the synthetic datasets generated by MDM show significantly degraded image-text retrieval accuracy compared to real data when evaluated using a model architecture different from those used in distillation.
Figures
read the original abstract
Dataset distillation compresses large training sets into compact synthetic datasets while preserving downstream performance. As modern systems increasingly operate on paired vision-language inputs, multimodal distillation must preserve representation quality and cross-modal alignment under tight compute and memory budgets, yet prior methods often require heavy computes and overlook their correlations. To address this, we present Multimodal Distribution Matching (MDM), a geometry-aware framework for efficient and generalizable multimodal distillation. Specifically, MDM integrates complementary components at the data, model, and loss levels. At the data level, it initializes synthetic image-text pairs by sampling from clusters in the joint embedding space. At the model level, it forms a mixed teacher by interpolating independently fine-tuned models in weight space according to their angular deviation from the pretrained anchor. At the loss level, it matches joint distributions on the unit hypersphere using a geometry-aware matching objective that exploits the joint features in the cross-modal agreement and discrepancy directions along with symmetric contrastive learning. Across image-text retrieval benchmarks with cross-architecture evaluation, MDM yields compact synthetic sets that preserve multimodal semantics, substantially reduce distillation cost, and remain robust across architectures.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes Multimodal Distribution Matching (MDM), a geometry-aware framework for vision-language dataset distillation. It initializes synthetic image-text pairs by sampling from clusters in the joint embedding space, forms a mixed teacher via weight-space interpolation of independently fine-tuned models according to angular deviation from the pretrained anchor, and applies a geometry-aware matching objective on the unit hypersphere that exploits cross-modal agreement/discrepancy directions together with symmetric contrastive learning. On image-text retrieval benchmarks with cross-architecture evaluation, the method is claimed to yield compact synthetic sets that preserve multimodal semantics, substantially reduce distillation cost, and remain robust across architectures.
Significance. If the claims hold, MDM could meaningfully lower the compute barrier for VL dataset distillation while maintaining cross-modal fidelity, which would be useful for resource-constrained settings. The three-level integration (data, model, loss) is a coherent design choice, but the significance is tempered by the absence of any reported quantitative cost reductions, ablation results on the interpolation step, or direct comparisons showing superiority over prior multimodal distillation baselines.
major comments (2)
- [Model-level component (abstract and §3)] Model-level component (abstract and §3 description): the claim that interpolating independently fine-tuned models by angular deviation from the pretrained anchor produces a mixed teacher whose joint vision-language representations remain aligned is load-bearing for the entire pipeline, yet no derivation, correlation analysis, or ablation is provided showing that angular deviation in parameter space correlates with cross-modal agreement on retrieval metrics. Without this, the downstream geometry-aware matching on the hypersphere cannot be guaranteed to distill faithful pairs.
- [Evaluation section] Evaluation section: the abstract asserts 'substantially reduce distillation cost' and 'remain robust across architectures,' but the provided text contains no tables, figures, or numerical results quantifying cost (e.g., GPU-hours or memory) or cross-architecture retrieval metrics (e.g., R@1 deltas), making it impossible to verify whether the central efficiency and robustness claims are supported.
minor comments (2)
- [Loss level (abstract and §4)] Notation for the geometry-aware objective (loss level) is introduced without an explicit equation or pseudocode, which hinders reproducibility.
- [Introduction / abstract] The abstract refers to 'prior methods often require heavy computes' without citing specific multimodal distillation baselines or their reported costs.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address each major comment below and will revise the paper to strengthen the presentation of the model-level component and the supporting evaluation results.
read point-by-point responses
-
Referee: Model-level component (abstract and §3 description): the claim that interpolating independently fine-tuned models by angular deviation from the pretrained anchor produces a mixed teacher whose joint vision-language representations remain aligned is load-bearing for the entire pipeline, yet no derivation, correlation analysis, or ablation is provided showing that angular deviation in parameter space correlates with cross-modal agreement on retrieval metrics. Without this, the downstream geometry-aware matching on the hypersphere cannot be guaranteed to distill faithful pairs.
Authors: We agree that the manuscript currently lacks an explicit derivation or empirical analysis linking angular deviation in parameter space to cross-modal agreement on retrieval metrics. This is a valid observation. In the revised version we will add a dedicated subsection containing (i) a short geometric argument relating angular deviation to representation drift and (ii) a correlation study plus ablation that quantifies how the interpolation step affects downstream retrieval performance. These additions will directly support the load-bearing claim. revision: yes
-
Referee: Evaluation section: the abstract asserts 'substantially reduce distillation cost' and 'remain robust across architectures,' but the provided text contains no tables, figures, or numerical results quantifying cost (e.g., GPU-hours or memory) or cross-architecture retrieval metrics (e.g., R@1 deltas), making it impossible to verify whether the central efficiency and robustness claims are supported.
Authors: The referee is correct that the submitted manuscript text does not contain the requested quantitative tables or figures for distillation cost (GPU-hours, memory) or cross-architecture R@1 deltas. We will insert new tables and figures reporting these metrics, including direct comparisons against prior multimodal distillation baselines, to substantiate the efficiency and robustness claims made in the abstract. revision: yes
Circularity Check
No circularity; high-level method description contains no equations or self-referential reductions.
full rationale
The provided abstract and method summary describe MDM via three complementary components (data-level cluster sampling, model-level angular interpolation of fine-tuned models, loss-level geometry-aware matching on the hypersphere) but supply no equations, no fitted parameters renamed as predictions, and no self-citations that bear the central claim. The interpolation step is presented as an independent modeling choice rather than a definitional tautology, and the overall framework is not shown to reduce to its inputs by construction. Absent any load-bearing derivation chain that collapses, the paper is self-contained at the level of description given.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/AlexanderDuality.leanalexander_duality_circle_linking echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
matches joint distributions on the unit hypersphere using a geometry-aware matching objective... geodesic kernel energies over cross-modal agreement and discrepancy directions
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023. 1
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[2]
Contextual diversity for active learning
Sharat Agarwal, Himanshu Arora, Saket Anand, and Chetan Arora. Contextual diversity for active learning. InEuropean Conference on Computer Vision, pages 137–153. Springer,
-
[3]
Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning.Advances in neural information processing systems, 35:23716–23736,
-
[4]
Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xi- aodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report.arXiv preprint arXiv:2309.16609,
work page internal anchor Pith review Pith/arXiv arXiv
-
[5]
Dataset distillation as data compression: A rate-utility perspective
Youneng Bao, Yiping Liu, Zhuo Chen, Yongsheng Liang, Mu Li, and Kede Ma. Dataset distillation as data compression: A rate-utility perspective. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 519– 529, 2025. 1
work page 2025
-
[6]
High-performance large-scale image recognition without normalization
Andy Brock, Soham De, Samuel L Smith, and Karen Si- monyan. High-performance large-scale image recognition without normalization. InInternational conference on ma- chine learning, pages 1059–1071. PMLR, 2021. 5
work page 2021
-
[7]
Coyo-700m: Image-text pair dataset
Minwoo Byeon, Beomhee Park, Haecheon Kim, Sungjun Lee, Woonhyuk Baek, and Saehoon Kim. Coyo-700m: Image-text pair dataset. https://github.com/kakaobrain/ coyo-dataset, 2022. 1
work page 2022
-
[8]
Dataset distillation by matching training trajectories
George Cazenavette, Tongzhou Wang, Antonio Torralba, Alexei A Efros, and Jun-Yan Zhu. Dataset distillation by matching training trajectories. InCVPR, 2022. 2, 4
work page 2022
-
[9]
Generalizing dataset distillation via deep generative prior
George Cazenavette, Tongzhou Wang, Antonio Torralba, Alexei A Efros, and Jun-Yan Zhu. Generalizing dataset distillation via deep generative prior. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3739–3748, 2023. 2
work page 2023
-
[10]
Selection via proxy: Efficient data se- lection for deep learning.arXiv preprint arXiv:1906.11829,
Cody Coleman, Christopher Yeh, Stephen Mussmann, Baha- ran Mirzasoleiman, Peter Bailis, Percy Liang, Jure Leskovec, and Matei Zaharia. Selection via proxy: Efficient data se- lection for deep learning.arXiv preprint arXiv:1906.11829,
-
[11]
Justin Cui, Ruochen Wang, Si Si, and Cho-Jui Hsieh. Dc- bench: Dataset condensation benchmark.Advances in Neural Information Processing Systems, 35:810–822, 2022. 8
work page 2022
-
[12]
Scaling up dataset distillation to imagenet-1k with constant memory
Justin Cui, Ruochen Wang, Si Si, and Cho-Jui Hsieh. Scaling up dataset distillation to imagenet-1k with constant memory. InInternational Conference on Machine Learning, pages 6565–6590. PMLR, 2023. 2, 5, 6
work page 2023
-
[13]
Optical: Leveraging optimal transport for con- tribution allocation in dataset distillation
Xiao Cui, Yulei Qin, Wengang Zhou, Hongsheng Li, and Houqiang Li. Optical: Leveraging optimal transport for con- tribution allocation in dataset distillation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 15245–15254, 2025. 1
work page 2025
-
[14]
Ex- ploiting inter-sample and inter-feature relations in dataset distillation
Wenxiao Deng, Wenbin Li, Tianyu Ding, Lei Wang, Hong- guang Zhang, Kuihua Huang, Jing Huo, and Yang Gao. Ex- ploiting inter-sample and inter-feature relations in dataset distillation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17057– 17066, 2024. 1
work page 2024
-
[15]
Remember the past: Dis- tilling datasets into addressable memories for neural networks
Zhiwei Deng and Olga Russakovsky. Remember the past: Dis- tilling datasets into addressable memories for neural networks. InNeurIPS, 2022. 2
work page 2022
-
[16]
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional trans- formers for language understanding.arXiv preprint arXiv:1810.04805, 2018. 5, 7
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[17]
Minimizing the accumulated trajectory error to improve dataset distillation
Jiawei Du, Yidi Jiang, Vincent YF Tan, Joey Tianyi Zhou, and Haizhou Li. Minimizing the accumulated trajectory error to improve dataset distillation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3749–3758, 2023. 2
work page 2023
-
[18]
Adversarial Active Learning for Deep Networks: a Margin Based Approach
Melanie Ducoffe and Frederic Precioso. Adversarial active learning for deep networks: a margin based approach.arXiv preprint arXiv:1802.09841, 2018. 5
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[19]
Springer Science & Business Media, 2009
Reza Zanjirani Farahani and Masoud Hekmatfar.Facility loca- tion: concepts, models, algorithms and case studies. Springer Science & Business Media, 2009. 2, 5, 6, 7
work page 2009
-
[20]
Deepcore: A comprehensive library for coreset selection in deep learning
Chengcheng Guo, Bo Zhao, and Yanbing Bai. Deepcore: A comprehensive library for coreset selection in deep learning. InInternational Conference on Database and Expert Systems Applications, pages 181–195. Springer, 2022. 5
work page 2022
-
[21]
Algorithm as 136: A k-means clustering algorithm.Journal of the royal statistical society
John A Hartigan and Manchek A Wong. Algorithm as 136: A k-means clustering algorithm.Journal of the royal statistical society. series c (applied statistics), 28(1):100–108, 1979. 4
work page 1979
-
[22]
You only condense once: Two rules for pruning condensed datasets
Yang He, Lingao Xiao, and Joey Tianyi Zhou. You only condense once: Two rules for pruning condensed datasets. arXiv preprint arXiv:2310.14019, 2023. 2
-
[23]
Micah Hodosh, Peter Young, and Julia Hockenmaier. Fram- ing image description as a ranking task: Data, models and evaluation metrics.Journal of Artificial Intelligence Research, 47:853–899, 2013. 1, 5, 6, 7, 3, 4
work page 2013
-
[24]
Submodular combinatorial information measures with applications in machine learning
Rishabh Iyer, Ninad Khargoankar, Jeff Bilmes, and Himanshu Asanani. Submodular combinatorial information measures with applications in machine learning. InAlgorithmic Learn- ing Theory, pages 722–754. PMLR, 2021. 5
work page 2021
-
[25]
Model stock: All we need is just a few fine-tuned models
Dong-Hwan Jang, Sangdoo Yun, and Dongyoon Han. Model stock: All we need is just a few fine-tuned models. In European Conference on Computer Vision, pages 207–223. Springer, 2024. 4, 2, 6
work page 2024
-
[26]
Grad-match: Gra- dient matching based data subset selection for efficient deep model training
Krishnateja Killamsetty, Sivasubramanian Durga, Ganesh Ra- makrishnan, Abir De, and Rishabh Iyer. Grad-match: Gra- dient matching based data subset selection for efficient deep model training. InInternational Conference on Machine Learning, pages 5464–5474. PMLR, 2021. 5
work page 2021
-
[27]
Glister: Generalization based data subset selection for efficient and robust learning
Krishnateja Killamsetty, Durga Sivasubramanian, Ganesh Ra- makrishnan, and Rishabh Iyer. Glister: Generalization based data subset selection for efficient and robust learning. In Proceedings of the AAAI conference on artificial intelligence, pages 8110–8118, 2021. 5
work page 2021
-
[28]
On divergence measures for bayesian pseudocoresets.arXiv preprint arXiv:2210.06205, 2022
Balhae Kim, Jungwon Choi, Seanie Lee, Yoonho Lee, Jung- Woo Ha, and Juho Lee. On divergence measures for bayesian pseudocoresets.arXiv preprint arXiv:2210.06205, 2022. 2
-
[29]
Dataset condensation via efficient synthetic-data pa- rameterization
Jang-Hyun Kim, Jinuk Kim, Seong Joon Oh, Sangdoo Yun, Hwanjun Song, Joonhyun Jeong, Jung-Woo Ha, and Hyun Oh Song. Dataset condensation via efficient synthetic-data pa- rameterization. InICML, 2022. 2
work page 2022
-
[30]
Ron Kimmel and James A Sethian. Computing geodesic paths on manifolds.Proceedings of the national academy of Sciences, 95(15):8431–8435, 1998. 4
work page 1998
-
[31]
Hae Beom Lee, Dong Bok Lee, and Sung Ju Hwang. Dataset condensation with latent space knowledge factorization and sharing.arXiv preprint arXiv:2208.10494, 2022. 2
-
[32]
Shiye Lei and Dacheng Tao. A comprehensive survey of dataset distillation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(1):17–32, 2023. 1
work page 2023
-
[33]
Diversity-enhanced distribution alignment for dataset distillation
Hongcheng Li, Yucan Zhou, Xiaoyan Gu, Bo Li, and Weiping Wang. Diversity-enhanced distribution alignment for dataset distillation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 3747–3756, 2025. 1
work page 2025
-
[34]
Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for unified vision- language understanding and generation. InInternational Con- ference on Machine Learning, pages 12888–12900. PMLR,
-
[35]
Microsoft coco: Common objects in context
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages 740–755. Springer, 2014. 1, 3, 5, 6, 7
work page 2014
-
[36]
Dataset distillation by automatic training trajectories
Dai Liu, Jindong Gu, Hu Cao, Carsten Trinitis, and Martin Schulz. Dataset distillation by automatic training trajectories. InEuropean Conference on Computer Vision, pages 334–351. Springer, 2024. 2
work page 2024
-
[37]
Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023. 1
work page 2023
-
[38]
Dataset distillation via the wasserstein metric
Haoyang Liu, Yijiang Li, Tiancheng Xing, Peiran Wang, Vibhu Dalal, Luwei Li, Jingrui He, and Haohan Wang. Dataset distillation via the wasserstein metric. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 1205–1215, 2025. 1
work page 2025
-
[39]
Ping Liu and Jiawei Du. The evolution of dataset distillation: Toward scalable and generalizable solutions.arXiv preprint arXiv:2502.05673, 2025. 1
-
[40]
Dataset distillation via factorization
Songhua Liu, Kai Wang, Xingyi Yang, Jingwen Ye, and Xin- chao Wang. Dataset distillation via factorization. InNeurIPS,
-
[41]
Slimmable dataset condensation
Songhua Liu, Jingwen Ye, Runpeng Yu, and Xinchao Wang. Slimmable dataset condensation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3759–3768, 2023. 2
work page 2023
-
[42]
Yanqing Liu, Jianyang Gu, Kai Wang, Zheng Zhu, Wei Jiang, and Yang You. Dream: Efficient dataset distillation by repre- sentative matching.arXiv preprint arXiv:2302.14416, 2023. 2
-
[43]
Efficient dataset distillation using random feature approxima- tion
Noel Loo, Ramin Hasani, Alexander Amini, and Daniela Rus. Efficient dataset distillation using random feature approxima- tion. InNeurIPS, 2022. 2
work page 2022
-
[44]
Dataset distillation with convexified implicit gradients.arXiv preprint arXiv:2302.06755, 2023
Noel Loo, Ramin Hasani, Mathias Lechner, and Daniela Rus. Dataset distillation with convexified implicit gradients.arXiv preprint arXiv:2302.06755, 2023. 2
-
[45]
Dionysis Manousakas, Zuheng Xu, Cecilia Mascolo, and Trevor Campbell. Bayesian pseudocoresets. InNeurIPS,
-
[46]
Active learning by acquiring contrastive examples.arXiv preprint arXiv:2109.03764, 2021
Katerina Margatina, Giorgos Vernikos, Loïc Barrault, and Nikolaos Aletras. Active learning by acquiring contrastive examples.arXiv preprint arXiv:2109.03764, 2021. 5
-
[47]
Geomm: On geodesic perspective for multi-modal learning
Shibin Mei, Hang Wang, and Bingbing Ni. Geomm: On geodesic perspective for multi-modal learning. InProceed- ings of the Computer Vision and Pattern Recognition Confer- ence, pages 4776–4786, 2025. 4, 2
work page 2025
-
[48]
Coresets for data-efficient training of machine learning mod- els
Baharan Mirzasoleiman, Jeff Bilmes, and Jure Leskovec. Coresets for data-efficient training of machine learning mod- els. InInternational Conference on Machine Learning, pages 6950–6960. PMLR, 2020. 5
work page 2020
-
[49]
Dataset meta-learning from kernel ridge-regression.arXiv preprint arXiv:2011.00050, 2020
Timothy Nguyen, Zhourong Chen, and Jaehoon Lee. Dataset meta-learning from kernel ridge-regression.arXiv preprint arXiv:2011.00050, 2020. 2
-
[50]
Dataset distillation with infinitely wide convolutional networks
Timothy Nguyen, Roman Novak, Lechao Xiao, and Jaehoon Lee. Dataset distillation with infinitely wide convolutional networks. InNeurIPS, 2021. 2
work page 2021
-
[51]
Mansheej Paul, Surya Ganguli, and Gintare Karolina Dz- iugaite. Deep learning on a data diet: Finding important examples early in training.Advances in neural information processing systems, 34:20596–20607, 2021. 5
work page 2021
-
[52]
Learning transferable visual models from natural language supervi- sion
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 3
work page 2021
-
[53]
Datadam: Efficient dataset distillation with attention matching
Ahmad Sajedi, Samir Khaki, Ehsan Amjadian, Lucy Z Liu, Yuri A Lawryshyn, and Konstantinos N Plataniotis. Datadam: Efficient dataset distillation with attention matching. InPro- ceedings of the IEEE/CVF International Conference on Com- puter Vision, pages 17097–17107, 2023. 8
work page 2023
-
[54]
DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter
Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter.arXiv preprint arXiv:1910.01108, 2019. 7
work page internal anchor Pith review Pith/arXiv arXiv 1910
-
[55]
Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, et al. Laion-5b: An open large-scale dataset for training next gen- eration image-text models.Advances in Neural Information Processing Systems, 35:25278–25294, 2022. 1
work page 2022
-
[56]
Active Learning for Convolutional Neural Networks: A Core-Set Approach
Ozan Sener and Silvio Savarese. Active learning for convolu- tional neural networks: A core-set approach.arXiv preprint arXiv:1708.00489, 2017. 5
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[57]
Donghyeok Shin, Seungjae Shin, and Il-Chul Moon. Fre- quency domain-based dataset distillation.Advances in Neural Information Processing Systems, 36:70033–70044, 2023. 1, 2
work page 2023
-
[58]
Fyi: Flip your images for dataset distillation
Byunggwan Son, Youngmin Oh, Donghyeon Baek, and Bum- sub Ham. Fyi: Flip your images for dataset distillation. In European Conference on Computer Vision, pages 214–230. Springer, 2024
work page 2024
-
[59]
D^4m: Dataset distillation via disentangled diffusion model
Duo Su, Junjie Hou, Weizhi Gao, Yingjie Tian, and Bowen Tang. D^4m: Dataset distillation via disentangled diffusion model. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, pages 5809–5818, 2024. 1
work page 2024
-
[60]
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context
Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, et al. Gemini 1.5: Unlocking mul- timodal understanding across millions of tokens of context. arXiv preprint arXiv:2403.05530, 2024. 1
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[61]
Yfcc100m: The new data in multimedia research
Bart Thomee, David A Shamma, Gerald Friedland, Benjamin Elizalde, Karl Ni, Douglas Poland, Damian Borth, and Li- Jia Li. Yfcc100m: The new data in multimedia research. Communications of the ACM, 59(2):64–73, 2016. 1
work page 2016
-
[62]
Piyush Tiwary, Kumar Shubham, Vivek Kashyap, et al. Con- structing bayesian pseudo-coresets using contrastive diver- gence.arXiv preprint arXiv:2303.11278, 2023. 2
-
[63]
Mariya Toneva, Alessandro Sordoni, Remi Tachet des Combes, Adam Trischler, Yoshua Bengio, and Geoffrey J Gor- don. An empirical study of example forgetting during deep neural network learning.arXiv preprint arXiv:1812.05159,
-
[64]
Cao2: Rectifying inconsistencies in diffusion-based dataset distillation, 2025
Haoxuan Wang, Zhenghao Zhao, Junyi Wu, Yuzhang Shang, Gaowen Liu, and Yan Yan. Cao2: Rectifying inconsistencies in diffusion-based dataset distillation, 2025. 1
work page 2025
-
[65]
Cafe: Learning to condense dataset by aligning features
Kai Wang, Bo Zhao, Xiangyu Peng, Zheng Zhu, Shuo Yang, Shuo Wang, Guan Huang, Hakan Bilen, Xinchao Wang, and Yang You. Cafe: Learning to condense dataset by aligning features. InCVPR, 2022. 2
work page 2022
-
[66]
Tongzhou Wang, Jun-Yan Zhu, Antonio Torralba, and Alexei A Efros. Dataset distillation.arXiv preprint arXiv:1811.10959, 2018. 1, 8
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[67]
Herding dynamical weights to learn
Max Welling. Herding dynamical weights to learn. InPro- ceedings of the 26th Annual International Conference on Machine Learning, pages 1121–1128, 2009. 2, 5, 6, 7
work page 2009
-
[68]
Vision-language dataset distillation, 2024
Xindi Wu, Byron Zhang, Zhiwei Deng, and Olga Rus- sakovsky. Vision-language dataset distillation, 2024. TMLR
work page 2024
-
[69]
Low-rank similarity mining for multimodal dataset distilla- tion
Yue Xu, Zhilin Lin, Yusong Qiu, Cewu Lu, and Yong-Lu Li. Low-rank similarity mining for multimodal dataset distilla- tion. InProceedings of the 41st International Conference on Machine Learning, pages 55144–55161. PMLR, 2024. 2, 3, 4, 5, 6, 7
work page 2024
-
[70]
Zeyuan Yin and Zhiqiang Shen. Dataset distillation via cur- riculum data synthesis in large data era.Transactions on Machine Learning Research, 2024. 8
work page 2024
-
[71]
Zeyuan Yin, Eric Xing, and Zhiqiang Shen. Squeeze, recover and relabel: Dataset condensation at imagenet scale from a new perspective.arXiv preprint arXiv:2306.13092, 2023. 2
-
[72]
Peter Young, Alice Lai, Micah Hodosh, and Julia Hocken- maier. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descrip- tions.Transactions of the association for computational lin- guistics, 2:67–78, 2014. 1, 3, 5, 6, 7
work page 2014
-
[73]
Ruonan Yu, Songhua Liu, and Xinchao Wang. Dataset distilla- tion: A comprehensive review.IEEE transactions on pattern analysis and machine intelligence, 46(1):150–170, 2023. 1
work page 2023
-
[74]
Sigmoid loss for language image pre-training
Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. InProceedings of the IEEE/CVF international conference on computer vision, pages 11975–11986, 2023. 1
work page 2023
-
[75]
Hansong Zhang, Shikun Li, Fanzhao Lin, Weiping Wang, Zhenxing Qian, and Shiming Ge. Dance: Dual-view distri- bution alignment for dataset condensation.arXiv preprint arXiv:2406.01063, 2024. 2, 8
-
[76]
M3d: Dataset condensation by minimizing maximum mean discrepancy
Hansong Zhang, Shikun Li, Pengju Wang, Dan Zeng, and Shiming Ge. M3d: Dataset condensation by minimizing maximum mean discrepancy. InProceedings of the AAAI Conference on Artificial Intelligence, pages 9314–9322, 2024. 1
work page 2024
-
[77]
Accelerating dataset distillation via model augmentation
Lei Zhang, Jie Zhang, Bowen Lei, Subhabrata Mukherjee, Xiang Pan, Bo Zhao, Caiwen Ding, Yao Li, and Dongkuan Xu. Accelerating dataset distillation via model augmentation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11950–11959, 2023. 2
work page 2023
-
[78]
Dataset condensation with differ- entiable siamese augmentation
Bo Zhao and Hakan Bilen. Dataset condensation with differ- entiable siamese augmentation. InICML, 2021. 2
work page 2021
-
[79]
Synthesizing informative training samples with gan.arXiv preprint arXiv:2204.07513, 2022
Bo Zhao and Hakan Bilen. Synthesizing informative training samples with gan.arXiv preprint arXiv:2204.07513, 2022. 2
-
[80]
Dataset condensation with distri- bution matching
Bo Zhao and Hakan Bilen. Dataset condensation with distri- bution matching. InWACV, 2023. 2, 3
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.