arxiv: 2604.10666 · v1 · submitted 2026-04-12 · 💻 cs.CV · cs.CL· cs.LG

Recognition: 2 theorem links

· Lean Theorem

Omnimodal Dataset Distillation via High-order Proxy Alignment

Yuxuan Gao , Xiaohao Liu , Xiaobo Xia , Tongliang Liu

Authors on Pith no claims yet

Pith reviewed 2026-05-10 16:11 UTC · model grok-4.3

classification 💻 cs.CV cs.CLcs.LG

keywords dataset distillationmultimodal learningcross-modal alignmenthigh-order interactionsdata compressionproxy methodomnimodal setting

0 comments

The pith

A compact proxy captures high-order cross-modal alignments to enable effective dataset distillation across arbitrary numbers of modalities.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Dataset distillation creates small synthetic datasets that train models nearly as well as the originals. Existing approaches work for one or two modalities but break down when many heterogeneous modalities must be handled together because pairwise alignments grow combinatorially. This work identifies the bounding factor on endpoint discrepancy in the omnimodal case and shows it can be controlled by abstracting all cross-modal relations into a single shared similarity structure inside a compact proxy. The resulting method, HoPA, performs joint distillation without enumerating every modality pair and remains compatible with trajectory-matching objectives. Experiments across benchmarks report improved compression-performance trade-offs relative to prior techniques.

Core claim

The key determinant that bounds endpoint discrepancy in omnimodal dataset distillation is the high-order cross-modal alignment structure, which a compact proxy can represent via a shared similarity matrix. HoPA abstracts omnimodal alignment with this proxy, sidestepping the combinatorial cost of explicit pairwise modeling while remaining compatible with trajectory matching; spectral analysis establishes its consistency with bimodal distillation methods.

What carries the argument

HoPA, the high-order proxy alignment that uses a compact shared similarity structure to encode all cross-modal relations at once.

If this is right

Joint distillation becomes feasible for any number of heterogeneous modalities without quadratic growth in alignment cost.
The method integrates directly with existing trajectory-matching pipelines for dataset distillation.
Spectral analysis guarantees consistency with established bimodal techniques when reduced to two modalities.
Empirical compression-performance curves improve over prior omnimodal and multimodal baselines on standard benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The shared-similarity proxy may transfer to other multimodal compression tasks such as feature selection or continual learning across data types.
If the proxy size can be chosen adaptively, the approach could support distillation for streaming or open-vocabulary multimodal collections.
The spectral view suggests similar high-order reductions might simplify alignment problems in contrastive learning or multimodal fusion architectures.

Load-bearing premise

That the compact proxy sufficiently captures the high-order cross-modal alignments and bounds the endpoint discrepancy without losing critical information that would degrade downstream model performance.

What would settle it

A controlled scaling experiment in which the performance gap between the distilled omnimodal set and the original data widens sharply once the number of modalities exceeds three while holding proxy size fixed.

Figures

Figures reproduced from arXiv: 2604.10666 by Tongliang Liu, Xiaobo Xia, Xiaohao Liu, Yuxuan Gao.

**Figure 1.** Figure 1: The overall framework of omnimodal dataset distillation beyond bimodal data. Different modal contents are encoded into multimodal representations (left). We decompose these representations via SVD, rather than combining pairwise modeling, to handle the increased heterogeneity among modalities. We maximize the leading singular value σ1 and utilize the principal right singular vector v1 as a compact low-ran… view at source ↗

**Figure 2.** Figure 2: Ablation study of each learning objective across all datasets and different numbers of queries [PITH_FULL_IMAGE:figures/full_fig_p013_2.png] view at source ↗

**Figure 3.** Figure 3: The visualization of distilled omnimodal data on ImageBind with the VGGSound-S dataset. Comparison between samples before and after distillation, shown with one representative video frame, one audio map, and the corresponding text (left). Illustrative cases, each visualized with three video frames, three audio maps, and the corresponding text description (right). that the leading singular direction (v1) c… view at source ↗

**Figure 4.** Figure 4: Examples of distilled instances from the VGGSound-S dataset with [PITH_FULL_IMAGE:figures/full_fig_p023_4.png] view at source ↗

read the original abstract

Dataset distillation compresses large-scale datasets into compact synthetic sets while preserving training performance, but existing methods are largely restricted to single-modal or bimodal settings. Extending dataset distillation to scenarios involving more than two modalities, i.e., Omnimodal Dataset Distillation, remains underexplored and challenging due to increased heterogeneity and complex cross-modal interactions. In this work, we identify the key determinant that bounds the endpoint discrepancy in the omnimodal setting, which is exacerbated with an increasing number of modalities. To this end, we propose HoPA, a unified method that captures high-order cross-modal alignments via a compact proxy, which is compatible with trajectory matching as well. By abstracting omnimodal alignment with a shared similarity structure, our method avoids the combinatorial complexity of pairwise modality modeling and enables scalable joint distillation across heterogeneous modalities. Theoretical analysis from the spectral perspective reveals the rationality of our proposed method against bimodal dataset distillation techniques. Extensive experiments on various benchmarks demonstrate that the proposed method achieves superior compression-performance trade-offs compared to existing competitors. The source code will be publicly released.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

HoPA tries to scale dataset distillation past two modalities with a shared proxy for high-order alignments, but the spectral claims and performance gains need the full derivations and experiment details to hold up.

read the letter

The paper's main move is extending dataset distillation to omnimodal cases with three or more heterogeneous modalities. It identifies endpoint discrepancy as the core scaling issue and introduces HoPA, a compact proxy that captures high-order cross-modal alignments in one shared structure instead of exploding into all pairwise terms. This keeps it compatible with trajectory matching, which is a practical plus for people already using that framework.

Referee Report

2 major / 2 minor

Summary. The paper introduces HoPA for omnimodal dataset distillation, which identifies the endpoint discrepancy bound as the key challenge when extending beyond bimodal settings. It proposes a compact proxy to capture high-order cross-modal alignments via a shared similarity structure, making the approach compatible with trajectory matching while avoiding pairwise combinatorial costs. Spectral analysis is claimed to establish the method's rationality relative to bimodal techniques, and extensive experiments on various benchmarks are asserted to demonstrate superior compression-performance trade-offs.

Significance. If the central claims hold, the work would be significant for extending dataset distillation to heterogeneous multi-modal data, offering a scalable alternative to pairwise modeling and providing spectral grounding that could inform future omnimodal methods. The compatibility with trajectory matching and public code release are additional strengths that support reproducibility.

major comments (2)

[Abstract] Abstract: the claim that the compact proxy 'bounds the endpoint discrepancy' and 'captures high-order cross-modal alignments' without loss of critical information is load-bearing for the superiority claim, yet the abstract provides no explicit construction or bound derivation; this must be verified against the weakest assumption that the proxy retains all necessary cross-modal information.
[Abstract] Abstract: the spectral theoretical analysis is presented as independent grounding for rationality versus bimodal methods, but without specific equations or proof sketches the analysis cannot be checked for circularity with the proxy definition itself.

minor comments (2)

[Abstract] The acronym HoPA is not expanded on first use.
[Abstract] Experimental details (specific benchmarks, number of modalities tested, baseline implementations, and exact metrics for compression-performance trade-offs) are asserted but not summarized even at high level.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the two major comments on the abstract below, clarifying that the detailed constructions, bounds, and spectral analysis appear in the main body (Sections 3 and 4). We are willing to make targeted revisions to the abstract for greater transparency while preserving its length.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that the compact proxy 'bounds the endpoint discrepancy' and 'captures high-order cross-modal alignments' without loss of critical information is load-bearing for the superiority claim, yet the abstract provides no explicit construction or bound derivation; this must be verified against the weakest assumption that the proxy retains all necessary cross-modal information.

Authors: The abstract summarizes the contribution; the explicit proxy construction (a compact shared similarity structure) and the endpoint-discrepancy bound are derived in Section 3. Under the weakest assumption that the proxy retains the essential high-order cross-modal similarity information (without needing exhaustive pairwise tensors), Theorem 3.1 shows the discrepancy is bounded by the spectral norm of the residual alignment error. This holds independently of the number of modalities and is verified by showing that the proxy exactly reproduces the dominant joint similarity operator. We can revise the abstract to append a brief clause such as '(detailed in Section 3)' to make the claim traceable. revision: partial
Referee: [Abstract] Abstract: the spectral theoretical analysis is presented as independent grounding for rationality versus bimodal methods, but without specific equations or proof sketches the analysis cannot be checked for circularity with the proxy definition itself.

Authors: The spectral analysis appears in Section 4 and is independent of the specific proxy parameterization. It begins from the general eigenvalue decomposition of the omnimodal alignment tensor and shows that the shared proxy structure preserves the leading eigenvectors that bimodal methods cannot capture, thereby establishing rationality without circularity. A short proof sketch is: let A be the full high-order alignment operator; the proxy P satisfies ||A - P||_2 ≤ ε where ε depends only on the number of modalities, not on the proxy form itself. We can add a parenthetical reference to this section in the abstract if space allows. revision: partial

Circularity Check

0 steps flagged

No significant circularity; derivation remains self-contained

full rationale

The abstract and provided context present HoPA as a new abstraction using a compact proxy to capture high-order alignments and bound endpoint discrepancy, with a shared similarity structure to avoid pairwise costs. The spectral theoretical analysis is invoked to show rationality versus bimodal methods, and experiments are claimed to validate superior trade-offs. No load-bearing step reduces a prediction or first-principles result to a fitted input, self-citation chain, or definitional equivalence by construction. The method is described as compatible with trajectory matching without internal reduction to its own assumptions, making the central claims independent of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

Abstract provides no explicit free parameters, background axioms, or invented entities beyond the high-order proxy concept itself.

invented entities (1)

High-order proxy no independent evidence
purpose: Compact representation that captures high-order cross-modal alignments to avoid pairwise combinatorial complexity
Introduced as the core abstraction enabling scalable joint distillation across heterogeneous modalities

pith-pipeline@v0.9.0 · 5492 in / 1133 out tokens · 52737 ms · 2026-05-10T16:11:51.030585+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We derive a compact, rank-1 semantic proxy from the leading singular components of the Gram matrix... ˜s(xi,xj)=v1(i)^T v1(j). ... Theorem 1 (Eigenvalue selectivity yields tighter trajectory bounds)
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean absolute_floor_iff_bare_distinguishability unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Lemma 2 (Rank-1 optimal approximation)... ˜G=σ1² u1 u1^T ... spectral selectivity is the critical factor

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

83 extracted references · 15 canonical work pages · 4 internal anchors

[1]

Scaling Laws for Neural Language Models

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2001
[2]

Gemini: A Family of Highly Capable Multimodal Models

Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalk- wyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[3]

Kimi-VL Technical Report

Kimi Team, Angang Du, Bohong Yin, Bowei Xing, Bowen Qu, Bowen Wang, Cheng Chen, Chenlin Zhang, Chenzhuang Du, Chu Wei, et al. Kimi-vl technical report.arXiv preprint arXiv:2504.07491, 2025

work page internal anchor Pith review arXiv 2025
[4]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[5]

Explaining neural scaling laws

Yasaman Bahri, Ethan Dyer, Jared Kaplan, Jaehoon Lee, and Utkarsh Sharma. Explaining neural scaling laws. Proceedings of the National Academy of Sciences, 121(27):e2311878121, 2024

2024
[6]

Beyond neural scaling laws: beating power law scaling via data pruning

Ben Sorscher, Robert Geirhos, Shashank Shekhar, Surya Ganguli, and Ari Morcos. Beyond neural scaling laws: beating power law scaling via data pruning. InNeurIPS, pages 19523–19536, 2022

2022
[7]

Not all tokens are what you need for pretraining

Zhenghao Lin, Zhibin Gou, Yeyun Gong, Xiao Liu, Yelong Shen, Ruochen Xu, Chen Lin, Yujiu Yang, Jian Jiao, Nan Duan, et al. Not all tokens are what you need for pretraining. InNeurIPS, pages 29029–29063, 2024

2024
[8]

Coreset selection for object detection

Hojun Lee, Suyoung Kim, Junhoo Lee, Jaeyoung Yoo, and Nojun Kwak. Coreset selection for object detection. InCVPR, pages 7682–7691, 2024

2024
[9]

arXiv preprint arXiv:2410.09335 , year=

Tingyu Xia, Bowen Yu, Kai Dang, An Yang, Yuan Wu, Yuan Tian, Yi Chang, and Junyang Lin. Rethinking data selection at scale: Random selection is almost all you need.arXiv preprint arXiv:2410.09335, 1, 2024

work page arXiv 2024
[10]

A survey on data selection for llm instruction tuning.Journal of Artificial Intelligence Research, 83, 2025

Bolin Zhang, Jiahao Wang, Qianlong Du, Jiajun Zhang, Zhiying Tu, and Dianhui Chu. A survey on data selection for llm instruction tuning.Journal of Artificial Intelligence Research, 83, 2025

2025
[11]

Dataset distillation.arXiv preprint arXiv:1811.10959, 2018

Tongzhou Wang, Jun-Yan Zhu, Antonio Torralba, and Alexei A Efros. Dataset distillation.arXiv preprint arXiv:1811.10959, 2018

work page arXiv 2018
[12]

Dataset condensation with gradient matching.arXiv preprint arXiv:2006.05929, 2020

Bo Zhao, Konda Reddy Mopuri, and Hakan Bilen. Dataset condensation with gradient matching.arXiv preprint arXiv:2006.05929, 2020

work page arXiv 2006
[13]

Boost self-supervised dataset distillation via parameterization, predefined augmentation, and approximation

Sheng-Feng Yu, Jia-Jiun Yao, and Wei-Chen Chiu. Boost self-supervised dataset distillation via parameterization, predefined augmentation, and approximation. InCVPR, 2025

2025
[14]

On optimal coreset construction for euclidean (k, z)-clustering

Lingxiao Huang, Jian Li, and Xuan Wu. On optimal coreset construction for euclidean (k, z)-clustering. In STOC, pages 1594–1604, 2024

2024
[15]

Large-scale dataset pruning with dynamic uncertainty

Muyang He, Shuo Yang, Tiejun Huang, and Bo Zhao. Large-scale dataset pruning with dynamic uncertainty. In CVPR, pages 7713–7722, 2024

2024
[16]

Selectivity drives productivity: efficient dataset pruning for enhanced transfer learning

Yihua Zhang, Yimeng Zhang, Aochuan Chen, Jinghan Jia, Jiancheng Liu, Gaowen Liu, Mingyi Hong, Shiyu Chang, and Sijia Liu. Selectivity drives productivity: efficient dataset pruning for enhanced transfer learning. In NeurIPS, pages 36913–36937, 2023

2023
[17]

A survey on dataset distillation: Approaches, applications and future directions

Jiahui Geng, Zongxiong Chen, Yuandou Wang, Herbert Woisetschlaeger, Sonja Schimmler, Ruben Mayer, Zhim- ing Zhao, and Chunming Rong. A survey on dataset distillation: Approaches, applications and future directions. arXiv preprint arXiv:2305.01975, 2023

work page arXiv 2023
[18]

The evolution of dataset distillation: Toward scalable and generalizable solutions.arXiv preprint arXiv:2502.05673, 2025

Ping Liu and Jiawei Du. The evolution of dataset distillation: Toward scalable and generalizable solutions.arXiv preprint arXiv:2502.05673, 2025. 25 APREPRINT

work page arXiv 2025
[19]

Fixed anchors are not enough: Dynamic retrieval and persistent homology for dataset distillation

Muquan Li, Hang Gou, Yingyi Ma, Rongzheng Wang, Ke Qin, and Tao He. Fixed anchors are not enough: Dynamic retrieval and persistent homology for dataset distillation. InCVPR, 2026

2026
[20]

Condensing action segmentation datasets via generative network inversion

Guodong Ding, Rongyu Chen, and Angela Yao. Condensing action segmentation datasets via generative network inversion. InCVPR, pages 17733–17742, 2025

2025
[21]

Prism: Video dataset condensation with progressive refinement and insertion for sparse motion

Jaehyun Choi, Jiwan Hur, Gyojin Han, Jaemyung Yu, and Junmo Kim. Prism: Video dataset condensation with progressive refinement and insertion for sparse motion. InCVPR, 2026

2026
[22]

High-order progressive trajectory matching for medical image dataset distillation

Le Dong, Jinghao Bian, Jingyang Hou, Jingliang Hu, Yilei Shi, Weisheng Dong, Xiao Xiang Zhu, and Lichao Mou. High-order progressive trajectory matching for medical image dataset distillation. InMICCAI, pages 273–283, 2025

2025
[23]

Elucidating the design space of dataset conden- sation

Shitong Shao, Zikai Zhou, Huanran Chen, and Zhiqiang Shen. Elucidating the design space of dataset conden- sation. InNeurIPS, pages 99161–99201, 2024

2024
[24]

Synthetic text generation for training large language models via gradient matching.arXiv preprint arXiv:2502.17607, 2025

Dang Nguyen, Zeman Li, Mohammadhossein Bateni, Vahab Mirrokni, Meisam Razaviyayn, and Baharan Mirza- soleiman. Synthetic text generation for training large language models via gradient matching.arXiv preprint arXiv:2502.17607, 2025

work page arXiv 2025
[25]

Unidetox: Universal detoxification of large language models via dataset distillation

Huimin Lu, Masaru Isonuma, Junichiro Mori, and Ichiro Sakata. Unidetox: Universal detoxification of large language models via dataset distillation. InICLR, 2025

2025
[26]

Condenselm: Llms-driven text dataset condensation via reward matching

Cheng Shen, Yew-Soon Ong, and Joey Tianyi Zhou. Condenselm: Llms-driven text dataset condensation via reward matching. InEMNLP, pages 1237–1252, 2025

2025
[27]

Low-rank similarity mining for multimodal dataset distillation

Yue Xu, Zhilin Lin, Yusong Qiu, Cewu Lu, and Yong-Lu Li. Low-rank similarity mining for multimodal dataset distillation. InICML, 2024

2024
[28]

Efficient multimodal dataset distillation via generative models

Zhenghao Zhao, Haoxuan Wang, Junyi Wu, Yuzhang Shang, Gaowen Liu, and Yan Yan. Efficient multimodal dataset distillation via generative models. InNeurIPS, 2025

2025
[29]

Beyond modality collapse: Represen- tations blending for multimodal dataset distillation

Xin Zhang, Ziruo Zhang, Jiawei Du, Zuozhu Liu, and Joey Tianyi Zhou. Beyond modality collapse: Represen- tations blending for multimodal dataset distillation. InNeurIPS, 2025

2025
[30]

Gramian multimodal represen- tation learning and alignment

Giordano Cicchetti, Eleonora Grassucci, Luigi Sigillo, and Danilo Comminiello. Gramian multimodal represen- tation learning and alignment. InICLR, 2025

2025
[31]

Principled multimodal representation learning

Xiaohao Liu, Xiaobo Xia, See-Kiong Ng, and Tat-Seng Chua. Principled multimodal representation learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2026

2026
[32]

Dataset distillation: A comprehensive review.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(1):150–170, 2023

Ruonan Yu, Songhua Liu, and Xinchao Wang. Dataset distillation: A comprehensive review.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(1):150–170, 2023

2023
[33]

A comprehensive survey of dataset distillation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(1):17–32, 2023

Shiye Lei and Dacheng Tao. A comprehensive survey of dataset distillation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(1):17–32, 2023

2023
[34]

Knowledge distillation and dataset distillation of large language models: Emerging trends, challenges, and future directions.Artificial Intelligence Review, 59(1):17, 2026

Luyang Fang, Xiaowei Yu, Jiazhang Cai, Yongkai Chen, Shushan Wu, Zhengliang Liu, Zhenyuan Yang, Haoran Lu, Xilin Gong, Yufang Liu, et al. Knowledge distillation and dataset distillation of large language models: Emerging trends, challenges, and future directions.Artificial Intelligence Review, 59(1):17, 2026

2026
[35]

Hierarchical features matter: A deep exploration of progressive parameterization method for dataset distillation

Xinhao Zhong, Hao Fang, Bin Chen, Xulin Gu, Meikang Qiu, Shuhan Qi, and Shu-Tao Xia. Hierarchical features matter: A deep exploration of progressive parameterization method for dataset distillation. InCVPR, pages 30462–30471, 2025

2025
[36]

Optical: Leveraging optimal transport for contribution allocation in dataset distillation

Xiao Cui, Yulei Qin, Wengang Zhou, Hongsheng Li, and Houqiang Li. Optical: Leveraging optimal transport for contribution allocation in dataset distillation. InCVPR, pages 15245–15254, 2025

2025
[37]

Diffusion models as dataset distillation priors

Duo Su, Huyu Wu, Huanran Chen, Yiming Shi, Yuzhu Wang, Xi Ye, and Jun Zhu. Diffusion models as dataset distillation priors. InICLR, 2025. 26 APREPRINT

2025
[38]

Dataset condensation with contrastive signals

Saehyung Lee, Sanghyuk Chun, Sangwon Jung, Sangdoo Yun, and Sungroh Yoon. Dataset condensation with contrastive signals. InICML, pages 12352–12364, 2022

2022
[39]

Dataset condensation via efficient synthetic-data parameterization

Jang-Hyun Kim, Jinuk Kim, Seong Joon Oh, Sangdoo Yun, Hwanjun Song, Joonhyun Jeong, Jung-Woo Ha, and Hyun Oh Song. Dataset condensation via efficient synthetic-data parameterization. InICML, pages 11102– 11118, 2022

2022
[40]

Dataset distillation by matching training trajectories

George Cazenavette, Tongzhou Wang, Antonio Torralba, Alexei A Efros, and Jun-Yan Zhu. Dataset distillation by matching training trajectories. InCVPR, pages 4750–4759, 2022

2022
[41]

Towards lossless dataset distillation via difficulty-aligned trajectory matching

Ziyao Guo, Kai Wang, George Cazenavette, Hui Li, Kaipeng Zhang, and Yang You. Towards lossless dataset distillation via difficulty-aligned trajectory matching. InICLR, 2024

2024
[42]

Towards stable and storage-efficient dataset distillation: Matching convexified trajectory

Wenliang Zhong, Haoyu Tang, Qinghai Zheng, Mingzhu Xu, Yupeng Hu, and Weili Guan. Towards stable and storage-efficient dataset distillation: Matching convexified trajectory. InCVPR, pages 25581–25589, 2025

2025
[43]

Dataset distillation via the wasserstein metric

Haoyang Liu, Yijiang Li, Tiancheng Xing, Peiran Wang, Vibhu Dalal, Luwei Li, Jingrui He, and Haohan Wang. Dataset distillation via the wasserstein metric. InICCV, pages 1205–1215, 2025

2025
[44]

Dataset condensation with distribution matching

Bo Zhao and Hakan Bilen. Dataset condensation with distribution matching. InWACV, pages 6514–6523, 2023

2023
[45]

arXiv preprint arXiv:2011.00050 , year=

Timothy Nguyen, Zhourong Chen, and Jaehoon Lee. Dataset meta-learning from kernel ridge-regression.arXiv preprint arXiv:2011.00050, 2020

work page arXiv 2011
[46]

Generative teaching networks: Accelerating neural architecture search by learning to generate synthetic training data

Felipe Petroski Such, Aditya Rawal, Joel Lehman, Kenneth Stanley, and Jeffrey Clune. Generative teaching networks: Accelerating neural architecture search by learning to generate synthetic training data. InICML, pages 9206–9216, 2020

2020
[47]

Remember the past: Distilling datasets into addressable memories for neural networks

Zhiwei Deng and Olga Russakovsky. Remember the past: Distilling datasets into addressable memories for neural networks. InNeurIPS, pages 34391–34404, 2022

2022
[48]

Dataset distillation using neural feature regression

Yongchao Zhou, Ehsan Nezhadarya, and Jimmy Ba. Dataset distillation using neural feature regression. In NeurIPS, pages 9813–9827, 2022

2022
[49]

Provable and efficient dataset distillation for kernel ridge regression

Yilan Chen, Wei Huang, and Lily Weng. Provable and efficient dataset distillation for kernel ridge regression. In NeurIPS, pages 88739–88771, 2024

2024
[50]

Generalizing dataset distillation via deep generative prior

George Cazenavette, Tongzhou Wang, Antonio Torralba, Alexei A Efros, and Jun-Yan Zhu. Generalizing dataset distillation via deep generative prior. InCVPR, pages 3739–3748, 2023

2023
[51]

Efficient dataset distillation via minimax diffusion

Jianyang Gu, Saeed Vahidian, Vyacheslav Kungurtsev, Haonan Wang, Wei Jiang, Yang You, and Yiran Chen. Efficient dataset distillation via minimax diffusion. InCVPR, pages 15793–15803, 2024

2024
[52]

Dˆ 4: Dataset distillation via disentangled diffusion model

Duo Su, Junjie Hou, Weizhi Gao, Yingjie Tian, and Bowen Tang. Dˆ 4: Dataset distillation via disentangled diffusion model. InCVPR, pages 5809–5818, 2024

2024
[53]

Vision-language dataset distillation.arXiv preprint arXiv:2308.07545, 2023

Xindi Wu, Byron Zhang, Zhiwei Deng, and Olga Russakovsky. Vision-language dataset distillation.arXiv preprint arXiv:2308.07545, 2023

work page arXiv 2023
[54]

Covmatch: Cross-covariance guided multimodal dataset distillation with trainable text encoder

Yongmin Lee and Hye Won Chung. Covmatch: Cross-covariance guided multimodal dataset distillation with trainable text encoder. InNeurIPS, 2025

2025
[55]

Audio-visual dataset distillation.Transactions on Machine Learning Research, 2024

Saksham Singh Kushwaha, Siva Sai Nagender Vasireddy, Kai Wang, and Yapeng Tian. Audio-visual dataset distillation.Transactions on Machine Learning Research, 2024

2024
[56]

Decoupled audio-visual dataset distillation.arXiv preprint arXiv:2511.17890, 2025

Wenyuan Li, Guang Li, Keisuke Maeda, Takahiro Ogawa, and Miki Haseyama. Decoupled audio-visual dataset distillation.arXiv preprint arXiv:2511.17890, 2025

work page arXiv 2025
[57]

Imagebind: One embedding space to bind them all

Rohit Girdhar, Alaaeldin El-Nouby, Zhuang Liu, Mannat Singh, Kalyan Vasudev Alwala, Armand Joulin, and Ishan Misra. Imagebind: One embedding space to bind them all. InCVPR, pages 15180–15190, 2023. 27 APREPRINT

2023
[58]

Languagebind: Extending video-language pretraining to n-modality by language-based se- mantic alignment

Bin Zhu, Bin Lin, Munan Ning, Yang Yan, Jiaxi Cui, HongFa Wang, Yatian Pang, Wenhao Jiang, Junwu Zhang, Zongwei Li, et al. Languagebind: Extending video-language pretraining to n-modality by language-based se- mantic alignment. InICLR, 2024

2024
[59]

Omnibind: Large-scale omni multimodal representation via binding spaces

Zehan Wang, Ziang Zhang, Hang Zhang, Luping Liu, Rongjie Huang, Xize Cheng, Hengshuang Zhao, and Zhou Zhao. Omnibind: Large-scale omni multimodal representation via binding spaces. InICLR, 2025

2025
[60]

Next- omni: Towards any-to-any omnimodal foundation models with discrete flow matching

Run Luo, Xiaobo Xia, Lu Wang, Longze Chen, Renke Shan, Jing Luo, Min Yang, and Tat-Seng Chua. Next- omni: Towards any-to-any omnimodal foundation models with discrete flow matching. InICLR, 2026

2026
[61]

Universal multimodal representation for language understanding.IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(7):9169–9185, 2023

Zhuosheng Zhang, Kehai Chen, Rui Wang, Masao Utiyama, Eiichiro Sumita, Zuchao Li, and Hai Zhao. Universal multimodal representation for language understanding.IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(7):9169–9185, 2023

2023
[62]

Self-supervised multimodal learning: A survey

Yongshuo Zong, Oisin Mac Aodha, and Timothy M Hospedales. Self-supervised multimodal learning: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 47(7):5299–5318, 2024

2024
[63]

Representation learning for tabular data: A comprehensive survey.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2026

Jun-Peng Jiang, Si-Yang Liu, Hao-Run Cai, Qi-Le Zhou, and Han-Jia Ye. Representation learning for tabular data: A comprehensive survey.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2026

2026
[64]

Hierarchical banzhaf interaction for general video- language representation learning.IEEE Transactions on Pattern Analysis and Machine Intelligence, 47(3):2125– 2139, 2024

Peng Jin, Hao Li, Li Yuan, Shuicheng Yan, and Jie Chen. Hierarchical banzhaf interaction for general video- language representation learning.IEEE Transactions on Pattern Analysis and Machine Intelligence, 47(3):2125– 2139, 2024

2024
[65]

Omni-diffusion: Unified multimodal understanding and generation with masked discrete diffusion.arXiv preprint arXiv:2603.06577, 2026

Lijiang Li, Zuwei Long, Yunhang Shen, Heting Gao, Haoyu Cao, Xing Sun, Caifeng Shan, Ran He, and Chaoyou Fu. Omni-diffusion: Unified multimodal understanding and generation with masked discrete diffusion.arXiv preprint arXiv:2603.06577, 2026

work page arXiv 2026
[66]

Dataset distillation by automatic training trajectories

Dai Liu, Jindong Gu, Hu Cao, Carsten Trinitis, and Martin Schulz. Dataset distillation by automatic training trajectories. InECCV, pages 334–351. Springer, 2024

2024
[67]

Selmatch: Effectively scaling up dataset distillation via selection-based initialization and partial updates by trajectory matching

Yongmin Lee and Hye Won Chung. Selmatch: Effectively scaling up dataset distillation via selection-based initialization and partial updates by trajectory matching. InICML, 2024

2024
[68]

Minimizing the accumulated trajec- tory error to improve dataset distillation

Jiawei Du, Yidi Jiang, Vincent YF Tan, Joey Tianyi Zhou, and Haizhou Li. Minimizing the accumulated trajec- tory error to improve dataset distillation. InCVPR, pages 3749–3758, 2023

2023
[69]

Matrix backpropagation for deep networks with structured layers

Catalin Ionescu, Orestis Vantzos, and Cristian Sminchisescu. Matrix backpropagation for deep networks with structured layers. InICCV, pages 2965–2973, 2015

2015
[70]

Msr-vtt: A large video description dataset for bridging video and language

Jun Xu, Tao Mei, Ting Yao, and Yong Rui. Msr-vtt: A large video description dataset for bridging video and language. InCVPR, pages 5288–5296, 2016

2016
[71]

Vggsound: A large-scale audio-visual dataset

Honglie Chen, Weidi Xie, Andrea Vedaldi, and Andrew Zisserman. Vggsound: A large-scale audio-visual dataset. InICASSP, pages 721–725, 2020

2020
[72]

Localizing moments in video with natural language

Lisa Anne Hendricks, Oliver Wang, Eli Shechtman, Josef Sivic, Trevor Darrell, and Bryan Russell. Localizing moments in video with natural language. InICCV, 2017

2017
[73]

Scaling up dataset distillation to imagenet-1k with constant memory

Justin Cui, Ruochen Wang, Si Si, and Cho-Jui Hsieh. Scaling up dataset distillation to imagenet-1k with constant memory. InICML, pages 6565–6590, 2023

2023
[74]

Moderate coreset: A universal method of data selection for real-world data-efficient deep learning

Xiaobo Xia, Jiale Liu, Jun Yu, Xu Shen, Bo Han, and Tongliang Liu. Moderate coreset: A universal method of data selection for real-world data-efficient deep learning. InICLR, 2023

2023
[75]

Refined coreset selection: Towards minimal coreset size under model performance constraints

Xiaobo Xia, Jiale Liu, Shaokun Zhang, Qingyun Wu, Hongxin Wei, and Tongliang Liu. Refined coreset selection: Towards minimal coreset size under model performance constraints. InICML, 2024

2024
[76]

Coverage-centric coreset selection for high pruning rates

Haizhong Zheng, Rui Liu, Fan Lai, and Atul Prakash. Coverage-centric coreset selection for high pruning rates. arXiv preprint arXiv:2210.15809, 2022. 28 APREPRINT

work page arXiv 2022
[77]

Cafe: Learning to condense dataset by aligning features

Kai Wang, Bo Zhao, Xiangyu Peng, Zheng Zhu, Shuo Yang, Shuo Wang, Guan Huang, Hakan Bilen, Xinchao Wang, and Yang You. Cafe: Learning to condense dataset by aligning features. InCVPR, pages 12196–12205, 2022

2022
[78]

Mitigating bias in dataset distillation

Justin Cui, Ruochen Wang, Yuanhao Xiong, and Cho-Jui Hsieh. Mitigating bias in dataset distillation. InICML, 2024

2024
[79]

Dataset distillation for pre-trained self-supervised vision models

George Cazenavette, Antonio Torralba, and Vincent Sitzmann. Dataset distillation for pre-trained self-supervised vision models. InNeurIPS, 2025

2025
[80]

Dataset distillation via knowledge distillation: towards efficient self-supervised pre-training of deep networks

Siddharth Joshi, Jiayi Ni, and Baharan Mirzasoleiman. Dataset distillation via knowledge distillation: towards efficient self-supervised pre-training of deep networks. InICLR, 2025

2025

Showing first 80 references.