pith. sign in

arxiv: 2606.29464 · v1 · pith:2SJPXOM6new · submitted 2026-06-28 · 💻 cs.CV · cs.AI

Rank-Aware Hyperbolic Alignment for Vision-Language Dataset Distillation

Pith reviewed 2026-06-30 07:19 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords vision-language dataset distillationhyperbolic embeddingsrank-aware alignmentcross-modal retrievalcontrastive learningmultimodal distillationasymmetric objectives
0
0 comments X

The pith

Rank-aware hyperbolic alignment separates shared image-text semantics from modality-private residuals to improve vision-language dataset distillation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that full-dimensional Euclidean alignment wastes capacity on weakly correlated variations because image-text correlations are rank-deficient. RAHA lifts representations into hyperbolic space and applies asymmetric geodesic objectives to align only the dominant shared range while regularizing the residual subspace. This produces synthetic pairs that train contrastive models more efficiently under tight data and compute limits. A sympathetic reader would expect competitive retrieval accuracy plus stronger transfer to downstream tasks compared with Euclidean or low-rank baselines. The central mechanism is explicit control of alignment capacity through hyperbolic geometry rather than post-hoc factorization.

Core claim

RAHA lifts multimodal representations to hyperbolic space and optimizes distilled pairs with asymmetric objectives that enforce geodesic alignment in the shared range while regularizing the residual subspace to preserve modality-private diversity and improve transfer robustness.

What carries the argument

rank-aware hyperbolic alignment (RAHA), which uses hierarchical hyperbolic geometry together with asymmetric geodesic objectives to enforce alignment only in the dominant shared subspace

If this is right

  • Synthetic pairs distilled with RAHA achieve competitive cross-modal retrieval under fixed budgets.
  • Transfer performance on downstream tasks improves relative to Euclidean and low-rank factorization methods.
  • Modality-private diversity is preserved in the residual subspace, reducing overfitting to shared semantics.
  • Contrastive vision-language models can be trained more robustly with smaller synthetic datasets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same hyperbolic capacity-control idea could be tested on other multimodal tasks where one modality carries hierarchical structure.
  • If the rank deficiency assumption holds across datasets, the method might reduce the number of required synthetic pairs further without loss of performance.
  • Combining RAHA with trajectory-matching distillation techniques could compound the efficiency gains.
  • The approach implies that geometry choice matters more than raw dimensionality reduction when alignment capacity must be explicitly budgeted.

Load-bearing premise

Image-text correlation is rank-deficient, with shared semantics concentrated in a low-dimensional range that hyperbolic lifting and asymmetric objectives can isolate and control more effectively than Euclidean or low-rank baselines.

What would settle it

Demonstrating that a Euclidean low-rank baseline or full-dimensional alignment matches or exceeds RAHA on cross-modal retrieval and transfer metrics under identical budgets and architectures would falsify the claimed advantage.

Figures

Figures reproduced from arXiv: 2606.29464 by Jongoh Jeong, Kuk-Jin Yoon, Sun-Kyung Lee.

Figure 1
Figure 1. Figure 1: Qualitative synthesized pairs. Representative samples at initialization (left), after CovMatch (middle), and after RAHA distillation (right). Please zoom in for details and view in color. that relevance distillation benefits from additional synthetic capacity on more diverse datasets. See comparison with EDGE [81] in Appendix. Relative to the strongest distribution/statistics matching baseline, Cov￾Match [… view at source ↗
Figure 2
Figure 2. Figure 2: Ablation study for Flickr8k N=100 setting with each component added, demon￾strating the synergy of the two subspace losses. 5 Auxiliary Discussion Ablation study. Using only hyperbolic contrast LhITC provides a strong baseline, confirming that geodesic InfoNCE on synthetic pairs already yields retrieval-relevant alignment ( [PITH_FULL_IMAGE:figures/full_fig_p014_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: (b) shows that ρ=0.95 is the best energy threshold, while 1.0 degrades by absorbing the low-energy residual tail [PITH_FULL_IMAGE:figures/full_fig_p015_3.png] view at source ↗
read the original abstract

Vision-language dataset distillation (VLDD) compresses a large image-text paired dataset into a small set of synthetic pairs that can efficiently train contrastive vision-language models under strict data and compute budgets. Most existing methods match expert trajectories or cross-modal statistics, yet still enforce full-dimensional alignment in a Euclidean embedding space. This is often overly restrictive due to rank-deficient image--text correlation, with shared semantics concentrated in a low-dimensional range and remaining variation spread across a weakly correlated residual subspace. LoRS relaxes alignment at the similarity level by low-rank factorization, but does not explicitly control dominant alignment capacity and structure in the representation space. We thus propose a rank-aware hyperbolic alignment (RAHA) that combines hierarchical geometry with explicit alignment-capacity control. RAHA lifts multimodal representations to hyperbolic space and optimizes distilled pairs with asymmetric objectives that enforce geodesic alignment in the shared range while regularizing the residual subspace to preserve modality-private diversity and improve transfer robustness. Experiments on benchmarks show that RAHA demonstrates competitive cross-modal retrieval and improved transfer indicators under fixed budgets.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes Rank-Aware Hyperbolic Alignment (RAHA) for vision-language dataset distillation (VLDD). It argues that image-text correlations are rank-deficient, with shared semantics in a low-dimensional range and residual variation weakly correlated. Existing methods enforce full-dimensional Euclidean alignment or use low-rank factorization (LoRS) without explicit capacity control. RAHA lifts representations to hyperbolic space and optimizes distilled pairs via asymmetric geodesic objectives that enforce alignment in the shared range while regularizing the residual subspace for modality-private diversity. Experiments claim competitive cross-modal retrieval and improved transfer indicators under fixed budgets.

Significance. If the empirical results and derivations hold, the work offers a geometrically motivated way to relax over-constrained alignment in VLDD while preserving transfer robustness. The combination of hyperbolic lifting with explicit rank-aware regularization could influence dataset distillation and cross-modal representation learning by providing a principled alternative to Euclidean or low-rank baselines, particularly under strict data budgets.

major comments (2)
  1. [Abstract] Abstract: the central premise that 'image--text correlation is rank-deficient with shared semantics concentrated in a low-dimensional range' is stated without supporting analysis or citation to prior rank analyses of vision-language embeddings; this assumption is load-bearing for the motivation of hyperbolic lifting and asymmetric objectives, yet no evidence or derivation is visible to substantiate it.
  2. [Abstract] Abstract: the claim that RAHA 'demonstrates competitive cross-modal retrieval and improved transfer indicators' is presented without reference to specific baselines, metrics, datasets, or quantitative deltas; without the experimental section, it is impossible to assess whether the hyperbolic components deliver gains beyond what Euclidean low-rank methods already achieve.
minor comments (2)
  1. [Abstract] Abstract: the term 'asymmetric objectives' is introduced without a brief definition or contrast to symmetric contrastive losses; a one-sentence clarification would improve readability.
  2. [Abstract] Abstract: 'hierarchical geometry' is invoked but not linked to any concrete hyperbolic model (e.g., Poincaré ball, Lorentz model) or curvature parameter; specifying the model would help readers anticipate the technical approach.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the comments. We address the two major points on the abstract below.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central premise that 'image--text correlation is rank-deficient with shared semantics concentrated in a low-dimensional range' is stated without supporting analysis or citation to prior rank analyses of vision-language embeddings; this assumption is load-bearing for the motivation of hyperbolic lifting and asymmetric objectives, yet no evidence or derivation is visible to substantiate it.

    Authors: The abstract states the premise concisely as motivation. The full manuscript contains an SVD-based rank analysis of cross-modal similarity matrices in Section 3.1 demonstrating rapid singular-value decay. We will add a citation to prior rank analyses of VL embeddings and a one-sentence reference to this analysis in the revised abstract. revision: yes

  2. Referee: [Abstract] Abstract: the claim that RAHA 'demonstrates competitive cross-modal retrieval and improved transfer indicators' is presented without reference to specific baselines, metrics, datasets, or quantitative deltas; without the experimental section, it is impossible to assess whether the hyperbolic components deliver gains beyond what Euclidean low-rank methods already achieve.

    Authors: Abstracts are space-constrained summaries. Section 4 and Tables 2–4 report the full comparisons (baselines: Euclidean full-alignment and LoRS; metrics: Recall@K and transfer accuracy; datasets: COCO, Flickr30K) with quantitative deltas. We will revise the abstract to name the primary datasets and one key improvement figure. revision: yes

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The abstract and description introduce RAHA as a new method combining hyperbolic geometry with asymmetric geodesic objectives for rank-aware alignment in VLDD. No equations, fitting procedures, self-citations, or derivation steps are visible that would reduce any claimed prediction or result to its own inputs by construction. The central premise (rank-deficient correlation addressed via hyperbolic lifting) is presented as a technical choice rather than derived from prior self-referential results. This matches the expectation for a score of 0 when the provided text is self-contained against external benchmarks with no load-bearing circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only abstract available; no free parameters, axioms, or invented entities are specified in the provided text.

pith-pipeline@v0.9.1-grok · 5711 in / 975 out tokens · 37338 ms · 2026-06-30T07:19:13.427563+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

83 extracted references · 25 canonical work pages · 8 internal anchors

  1. [1]

    Hugging Face Datasets (2023),https://huggin gface.co/datasets/liuhaotian/LLaVA-CC3M-Pretrain-595K

    Llava-cc3m-pretrain-595k dataset. Hugging Face Datasets (2023),https://huggin gface.co/datasets/liuhaotian/LLaVA-CC3M-Pretrain-595K

  2. [2]

    GPT-4 Technical Report

    Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F.L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al.: Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023)

  3. [3]

    Advances in neural information processing systems35, 23716– 23736 (2022)

    Alayrac, J.B., Donahue, J., Luc, P., Miech, A., Barr, I., Hasson, Y., Lenc, K., Mensch, A., Millican, K., Reynolds, M., et al.: Flamingo: a visual language model for few-shot learning. Advances in neural information processing systems35, 23716– 23736 (2022)

  4. [4]

    Qwen Technical Report

    Bai, J., Bai, S., Chu, Y., Cui, Z., Dang, K., Deng, X., Fan, Y., Ge, W., Han, Y., Huang, F., et al.: Qwen technical report. arXiv preprint arXiv:2309.16609 (2023)

  5. [5]

    In: Advances in Neural Information Processing Systems (NeurIPS), Datasets and Benchmarks Track (2023)

    Birhane, A., Prabhu, V., Han, S., Boddeti, V.N., Luccioni, A.S.: Into the LAIONs den: Investigating hate in multimodal datasets. In: Advances in Neural Information Processing Systems (NeurIPS), Datasets and Benchmarks Track (2023)

  6. [6]

    In: Neural networks: tricks of the trade: second edition, pp

    Bottou, L.: Stochastic gradient descent tricks. In: Neural networks: tricks of the trade: second edition, pp. 421–436. Springer (2012)

  7. [7]

    In: International conference on machine learning

    Brock, A., De, S., Smith, S.L., Simonyan, K.: High-performance large-scale image recognition without normalization. In: International conference on machine learning. pp. 1059–1071. PMLR (2021)

  8. [8]

    Byeon, M., Park, B., Kim, H., Lee, S., Baek, W., Kim, S.: Coyo-700m: Image-text pair dataset.https://github.com/kakaobrain/coyo-dataset(2022)

  9. [9]

    In: IEEE Symposium on Security and Privacy (SP) (2024)

    Carlini, N., Jagielski, M., Choquette-Choo, C.A., Paleka, D., Pearce, W., Anderson, H., Terzis, A., Thomas, K., Tramèr, F.: Poisoning web-scale training datasets is practical. In: IEEE Symposium on Security and Privacy (SP) (2024)

  10. [10]

    In: CVPR (2022)

    Cazenavette, G., Wang, T., Torralba, A., Efros, A.A., Zhu, J.Y.: Dataset distillation by matching training trajectories. In: CVPR (2022)

  11. [11]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Cazenavette, G., Wang, T., Torralba, A., Efros, A.A., Zhu, J.Y.: Generalizing dataset distillation via deep generative prior. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 3739–3748 (2023)

  12. [12]

    Advances in Neural Information Processing Systems35, 810–822 (2022)

    Cui, J., Wang, R., Si, S., Hsieh, C.J.: Dc-bench: Dataset condensation benchmark. Advances in Neural Information Processing Systems35, 810–822 (2022)

  13. [13]

    In: International Conference on Machine Learning

    Cui, J., Wang, R., Si, S., Hsieh, C.J.: Scaling up dataset distillation to imagenet-1k with constant memory. In: International Conference on Machine Learning. pp. 6565–6590. PMLR (2023)

  14. [14]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference

    Cui, X., Qin, Y., Zhou, W., Li, H., Li, H.: Optical: Leveraging optimal transport for contribution allocation in dataset distillation. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 15245–15254 (2025)

  15. [15]

    Advances in neural information processing systems26(2013)

    Cuturi, M.: Sinkhorn distances: Lightspeed computation of optimal transport. Advances in neural information processing systems26(2013)

  16. [16]

    In: 2009 IEEE conference on computer vision and pattern recognition

    Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE conference on computer vision and pattern recognition. pp. 248–255. Ieee (2009)

  17. [17]

    In: ICML (2023)

    Desai, K., Nickel, M., Rajpurohit, T., Johnson, J., Vedantam, R.: Hyperbolic image-text representations. In: ICML (2023)

  18. [18]

    In: Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers)

    Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers). pp. 4171–4186 (2019) 34 Jeonget al

  19. [19]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Du, J., Jiang, Y., Tan, V.Y., Zhou, J.T., Li, H.: Minimizing the accumulated trajectory error to improve dataset distillation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 3749–3758 (2023)

  20. [20]

    Springer Science & Business Media (2009)

    Farahani, R.Z., Hekmatfar, M.: Facility location: concepts, models, algorithms and case studies. Springer Science & Business Media (2009)

  21. [21]

    Advances in neural information processing systems31(2018)

    Ganea, O., Bécigneul, G., Hofmann, T.: Hyperbolic neural networks. Advances in neural information processing systems31(2018)

  22. [22]

    Countering Adversarial Images using Input Transformations

    Guo, C., Rana, M., Cisse, M., Van Der Maaten, L.: Countering adversarial images using input transformations. arXiv preprint arXiv:1711.00117 (2017)

  23. [23]

    arXiv preprint arXiv:2310.05773 (2023)

    Guo, Z., Wang, K., Cazenavette, G., Li, H., Zhang, K., You, Y.: Towards loss- less dataset distillation via difficulty-aligned trajectory matching. arXiv preprint arXiv:2310.05773 (2023)

  24. [24]

    Journal of Artificial Intelligence Research 47, 853–899 (2013)

    Hodosh, M., Young, P., Hockenmaier, J.: Framing image description as a ranking task: Data, models and evaluation metrics. Journal of Artificial Intelligence Research 47, 853–899 (2013)

  25. [25]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2026)

    Jeong, J., Kwon, H., Kim, M., Yoon, K.J.: Multimodal distribution matching for vision-language dataset distillation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2026)

  26. [26]

    In: Proceedings of the IEEE conference on computer vision and pattern recognition

    Karpathy, A., Fei-Fei, L.: Deep visual-semantic alignments for generating image descriptions. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 3128–3137 (2015)

  27. [27]

    In: ICML (2022)

    Kim, J.H., Kim, J., Oh, S.J., Yun, S., Song, H., Jeong, J., Ha, J.W., Song, H.O.: Dataset condensation via efficient synthetic-data parameterization. In: ICML (2022)

  28. [28]

    In: European Conference on Computer Vision (ECCV) (2024).https://doi.org/10.48550/arXiv.2404.17507

    Kim, W., Chun, S., Kim, T., Han, D., Yun, S.: Hype: Hyperbolic entailment filtering for underspecified images and texts. In: European Conference on Computer Vision (ECCV) (2024).https://doi.org/10.48550/arXiv.2404.17507

  29. [29]

    In: Proceedings of the IEEE international conference on computer vision workshops

    Krause, J., Stark, M., Deng, J., Fei-Fei, L.: 3d object representations for fine-grained categorization. In: Proceedings of the IEEE international conference on computer vision workshops. pp. 554–561 (2013)

  30. [30]

    Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009)

  31. [31]

    arXiv preprint arXiv:2208.10494 (2022)

    Lee, H.B., Lee, D.B., Hwang, S.J.: Dataset condensation with latent space knowledge factorization and sharing. arXiv preprint arXiv:2208.10494 (2022)

  32. [32]

    arXiv preprint arXiv:2510.18583 (2025)

    Lee, Y., Chung, H.W.: Covmatch: Cross-covariance guided multimodal dataset distillation with trainable text encoder. arXiv preprint arXiv:2510.18583 (2025)

  33. [33]

    IEEE Transactions on Pattern Analysis and Machine Intelligence46(1), 17–32 (2023)

    Lei, S., Tao, D.: A comprehensive survey of dataset distillation. IEEE Transactions on Pattern Analysis and Machine Intelligence46(1), 17–32 (2023)

  34. [34]

    In: International Conference on Machine Learning

    Li, J., Li, D., Xiong, C., Hoi, S.: Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning. pp. 12888–12900. PMLR (2022)

  35. [35]

    In: Advances in Neural Information Processing Systems (NeurIPS) (2025), poster

    Li, W., Li, G., Maeda, K., Ogawa, T., Haseyama, M.: Hyperbolic dataset distillation. In: Advances in Neural Information Processing Systems (NeurIPS) (2025), poster

  36. [36]

    In: Computer Vision– ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13

    Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft coco: Common objects in context. In: Computer Vision– ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13. pp. 740–755. Springer (2014)

  37. [37]

    In: European Conference on Computer Vision

    Liu, D., Gu, J., Cao, H., Trinitis, C., Schulz, M.: Dataset distillation by automatic training trajectories. In: European Conference on Computer Vision. pp. 334–351. Springer (2024)

  38. [38]

    Advances in neural information processing systems36, 34892–34916 (2023) RAHA: Appendix 35

    Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in neural information processing systems36, 34892–34916 (2023) RAHA: Appendix 35

  39. [39]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision

    Liu, H., Li, Y., Xing, T., Wang, P., Dalal, V., Li, L., He, J., Wang, H.: Dataset dis- tillation via the wasserstein metric. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 1205–1215 (2025)

  40. [40]

    arXiv preprint arXiv:2502.05673 , year=

    Liu, P., Du, J.: The evolution of dataset distillation: Toward scalable and generaliz- able solutions. arXiv preprint arXiv:2502.05673 (2025)

  41. [41]

    In: NeurIPS (2022)

    Liu, S., Wang, K., Yang, X., Ye, J., Wang, X.: Dataset distillation via factorization. In: NeurIPS (2022)

  42. [42]

    arXiv preprint arXiv:2310.16787 (2023)

    Longpre, S., Mahari, R., Chen, A., Obeng-Marnu, N., Sileo, D., Brannon, W., Muennighoff, N., Khazam, N., Kabbara, J., Perisetla, K., et al.: The data provenance initiative: A large scale audit of dataset licensing & attribution in AI. arXiv preprint arXiv:2310.16787 (2023)

  43. [43]

    In: Advances in Neural Information Processing Systems (NeurIPS) (2024)

    Longpre, S., Mahari, R., Lee, A., Lund, C., Oderinwale, H., Brannon, W., Saxena, N., Obeng-Marnu, N., South, T., Hunter, C., Klyman, K., et al.: Consent in crisis: The rapid decline of the AI data commons. In: Advances in Neural Information Processing Systems (NeurIPS) (2024)

  44. [44]

    In: NeurIPS (2022)

    Loo, N., Hasani, R., Amini, A., Rus, D.: Efficient dataset distillation using random feature approximation. In: NeurIPS (2022)

  45. [45]

    arXiv preprint arXiv:2302.06755 (2023)

    Loo, N., Hasani, R., Lechner, M., Rus, D.: Dataset distillation with convexified implicit gradients. arXiv preprint arXiv:2302.06755 (2023)

  46. [46]

    Towards Deep Learning Models Resistant to Adversarial Attacks

    Madry, A., Makelov, A., Schmidt, L., Tsipras, D., Vladu, A.: Towards deep learning models resistant to adversarial attacks. arXiv preprint arXiv:1706.06083 (2017)

  47. [47]

    arXiv preprint arXiv:2011.00050 (2020)

    Nguyen, T., Chen, Z., Lee, J.: Dataset meta-learning from kernel ridge-regression. arXiv preprint arXiv:2011.00050 (2020)

  48. [48]

    In: NeurIPS (2021)

    Nguyen, T., Novak, R., Xiao, L., Lee, J.: Dataset distillation with infinitely wide convolutional networks. In: NeurIPS (2021)

  49. [49]

    Poincar\'e Embeddings for Learning Hierarchical Representations

    Nickel, M., Kiela, D.: Poincaré embeddings for learning hierarchical representations. arXiv preprint arXiv:1705.08039 (2017)

  50. [50]

    In: Proceedings of the 35th International Conference on Machine Learning (ICML)

    Nickel, M., Kiela, D.: Learning continuous hierarchies in the Lorentz model of hyperbolic geometry. In: Proceedings of the 35th International Conference on Machine Learning (ICML). Proceedings of Machine Learning Research (PMLR), vol. 80, pp. 3779–3788 (2018)

  51. [51]

    In: International Conference on Learning Representations (ICLR) (2025), oral

    Pal, A., van Spengler, M., D’Amely di Melendugno, G.M., Flaborea, A., Galasso, F., Mettes, P.: Compositional entailment learning for hyperbolic vision-language models. In: International Conference on Learning Representations (ICLR) (2025), oral

  52. [52]

    arXiv preprint arXiv:2101.04562 (2021)

    Peng, W., Varanka, T., Mostafa, A., Shi, H., Zhao, G.: Hyperbolic deep neural networks: A survey. arXiv preprint arXiv:2101.04562 (2021)

  53. [53]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2025).https://doi.org/10 .48550/arXiv.2503.12127

    Poppi, T., Kasarla, T., Mettes, P., Baraldi, L., Cucchiara, R.: Hyperbolic safety- aware vision-language models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2025).https://doi.org/10 .48550/arXiv.2503.12127

  54. [54]

    In: International conference on machine learning

    Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International conference on machine learning. pp. 8748–8763. PmLR (2021)

  55. [55]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    Ramasinghe, S., Shevchenko, V., Avraham, G., Thalaiyasingam, A.: Accept the modality gap: An exploration in the hyperbolic space. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 27263–27272 (June 2024) 36 Jeonget al

  56. [56]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 10684–10695 (2022)

  57. [57]

    DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter

    Sanh, V., Debut, L., Chaumond, J., Wolf, T.: Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108 (2019)

  58. [58]

    Advances in Neural Information Processing Systems35, 25278–25294 (2022)

    Schuhmann, C., Beaumont, R., Vencu, R., Gordon, C., Wightman, R., Cherti, M., Coombes, T., Katta, A., Mullis, C., Wortsman, M., et al.: Laion-5b: An open large-scale dataset for training next generation image-text models. Advances in Neural Information Processing Systems35, 25278–25294 (2022)

  59. [59]

    arXiv preprint arXiv:2312.16627 (2023)

    Shang, Y., Yuan, Z., Yan, Y.: Mim4dd: Mutual information maximization for dataset distillation. arXiv preprint arXiv:2312.16627 (2023)

  60. [60]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Su, D., Hou, J., Gao, W., Tian, Y., Tang, B.: D^4m: Dataset distillation via disentangled diffusion model. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 5809–5818 (2024)

  61. [61]

    Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

    Team,G., Georgiev, P., Lei,V.I., Burnell, R., Bai, L., Gulati, A., Tanzer, G., Vincent, D., Pan, Z., Wang, S., et al.: Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv preprint arXiv:2403.05530 (2024)

  62. [62]

    Thiel, D.: Identifying and eliminating CSAM in generative ML training data and models. Tech. rep., Stanford Internet Observatory (2023).https://doi.org/10.2 5740/kh752sm9123,https://purl.stanford.edu/kh752sm9123

  63. [63]

    Communications of the ACM59(2), 64–73 (2016)

    Thomee, B., Shamma, D.A., Friedland, G., Elizalde, B., Ni, K., Poland, D., Borth, D., Li, L.J.: Yfcc100m: The new data in multimedia research. Communications of the ACM59(2), 64–73 (2016)

  64. [64]

    An empirical study of example forgetting during deep neural network learning

    Toneva, M., Sordoni, A., Combes, R.T.d., Trischler, A., Bengio, Y., Gordon, G.J.: An empirical study of example forgetting during deep neural network learning. arXiv preprint arXiv:1812.05159 (2018)

  65. [65]

    Wang, H., Zhao, Z., Wu, J., Shang, Y., Liu, G., Yan, Y.: Cao2: Rectifying inconsis- tencies in diffusion-based dataset distillation (2025),https://arxiv.org/abs/25 06.22637

  66. [66]

    In: CVPR (2022)

    Wang, K., Zhao, B., Peng, X., Zhu, Z., Yang, S., Wang, S., Huang, G., Bilen, H., Wang, X., You, Y.: Cafe: Learning to condense dataset by aligning features. In: CVPR (2022)

  67. [67]

    Dataset Distillation

    Wang, T., Zhu, J.Y., Torralba, A., Efros, A.A.: Dataset distillation. arXiv preprint arXiv:1811.10959 (2018)

  68. [68]

    Welinder, P., Branson, S., Mita, T., Wah, C., Schroff, F., Belongie, S., Perona, P.: Caltech-ucsd birds 200 (2010)

  69. [69]

    In: Proceedings of the 26th Annual International Conference on Machine Learning

    Welling, M.: Herding dynamical weights to learn. In: Proceedings of the 26th Annual International Conference on Machine Learning. pp. 1121–1128 (2009)

  70. [70]

    Wu, X., Zhang, B., Deng, Z., Russakovsky, O.: Vision-language dataset distillation (2024),https://openreview.net/forum?id=2y8XnaIiB8, tMLR 2024

  71. [71]

    In: NDSS (2018).https://doi.org/10.14722/ndss.2018.23295, https://www.ndss-symposium.org/ndss-paper/feature-squeezing-detectin g-adversarial-examples-in-deep-neural-networks/

    Xu, W., Evans, D., Qi, Y.: Feature squeezing: Detecting adversarial examples in deep neural networks. In: NDSS (2018).https://doi.org/10.14722/ndss.2018.23295, https://www.ndss-symposium.org/ndss-paper/feature-squeezing-detectin g-adversarial-examples-in-deep-neural-networks/

  72. [72]

    In: Proceedings of the 41st International Conference on Machine Learning

    Xu, Y., Lin, Z., Qiu, Y., Lu, C., Li, Y.L.: Low-rank similarity mining for multimodal dataset distillation. In: Proceedings of the 41st International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 235, pp. 55144–55161. PMLR (2024),https://proceedings.mlr.press/v235/xu24q.html

  73. [73]

    Transactions of the association for computational linguistics2, 67–78 (2014) RAHA: Appendix 37

    Young, P., Lai, A., Hodosh, M., Hockenmaier, J.: From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the association for computational linguistics2, 67–78 (2014) RAHA: Appendix 37

  74. [74]

    IEEE transactions on pattern analysis and machine intelligence46(1), 150–170 (2023)

    Yu, R., Liu, S., Wang, X.: Dataset distillation: A comprehensive review. IEEE transactions on pattern analysis and machine intelligence46(1), 150–170 (2023)

  75. [75]

    In: Proceedings of the IEEE/CVF international conference on computer vision

    Zhai, X., Mustafa, B., Kolesnikov, A., Beyer, L.: Sigmoid loss for language image pre- training. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 11975–11986 (2023)

  76. [76]

    arXiv arXiv:2505.14705 (2025)

    Zhang, X., Zhang, Z., Du, J., Liu, Z., Zhou, J.T.: Beyond modality collapse: Rep- resentations blending for multimodal dataset distillation. arXiv arXiv:2505.14705 (2025)

  77. [77]

    In: ICML (2021)

    Zhao, B., Bilen, H.: Dataset condensation with differentiable siamese augmentation. In: ICML (2021)

  78. [78]

    In: WACV (2023)

    Zhao, B., Bilen, H.: Dataset condensation with distribution matching. In: WACV (2023)

  79. [79]

    arXiv preprint arXiv:2006.05929 (2020)

    Zhao, B., Mopuri, K.R., Bilen, H.: Dataset condensation with gradient matching. arXiv preprint arXiv:2006.05929 (2020)

  80. [80]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Zhao, G., Li, G., Qin, Y., Yu, Y.: Improved distribution matching for dataset condensation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 7856–7865 (2023)

Showing first 80 references.