pith. sign in

arxiv: 2605.22098 · v1 · pith:Y6F4J4KGnew · submitted 2026-05-21 · 💻 cs.CV · cs.AI· cs.LG

TextTeacher: What Can Language Teach About Images?

Pith reviewed 2026-05-22 07:20 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.LG
keywords TextTeachervision-languageimage classificationknowledge distillationImageNetViTauxiliary objectivesemantic anchors
0
0 comments X

The pith

TextTeacher uses image captions and a frozen text encoder to improve vision model accuracy on ImageNet by up to 2.7 percentage points without altering the inference model.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether semantic knowledge from language models can efficiently guide vision models by leveraging the platonic representation hypothesis. It introduces TextTeacher as an auxiliary objective that injects projected text embeddings from readily available captions into image classification training. This matters to a sympathetic reader because it promises higher accuracy and better transfer with negligible added cost and no change to the final deployed vision model. The approach avoids costly multimodal training of the target model itself while supplying complementary semantic cues that precondition deeper layers early in training.

Core claim

TextTeacher is a simple auxiliary objective that injects text embeddings as additional information into image classification training. It uses readily available image captions, a pre-trained and frozen text encoder, and a lightweight projection to produce semantic anchors that efficiently guide representations during training while leaving the inference-time model unchanged.

What carries the argument

TextTeacher auxiliary objective that projects frozen text encoder outputs into the vision feature space to act as semantic anchors and precondition deeper layers.

If this is right

  • Yields up to 2.7 percentage point accuracy gains on ImageNet with standard ViT backbones.
  • Produces average 1.0 percentage point gains on transfer tasks under the same training recipe.
  • Outperforms vision-only knowledge distillation at constant compute or achieves similar accuracy 33 percent faster.
  • Shapes deeper layers in the first stages of training to aid generalization.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This suggests vision models can gain from abundant text data without requiring joint multimodal pretraining of the vision backbone.
  • The preconditioning effect could extend to other vision tasks such as object detection if similar caption-based anchors are used.
  • It raises the possibility that future scaling of vision models might benefit more from simple text guidance than from increased vision-only data alone.

Load-bearing premise

Readily available image captions supply complementary semantic cues that a frozen text encoder and lightweight projection can turn into useful guidance for vision features.

What would settle it

Training the same ViT backbone on ImageNet with and without the TextTeacher auxiliary loss and finding no accuracy gain or transfer improvement under matched compute and recipe.

Figures

Figures reproduced from arXiv: 2605.22098 by Ahmed Anwar, Andreas Dengel, Brian Bernhard Moser, Federico Raue, Stanislav Frolov, Tobias Christian Nauen.

Figure 1
Figure 1. Figure 1: Setup of text-guided image-classification using [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: λ sweep for ViT-S on ImageNet (100 epochs) with λt ≡ λ ∈ [0.0, 0.8]. TextTeacher improves over the baseline especially at λ = 0.5 with αadapt. Without adaption, accu￾racy drops sharply for λ > 0.3; with adaption it is stable up to λ = 0.6 [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Jump schedules for ViT-S on ImageNet. At epoch [PITH_FULL_IMAGE:figures/full_fig_p010_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: 2D-umap of the embedding space of ViT-S after training without and with [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗
Figure 6
Figure 6. Figure 6: Label-noise robustness for ViT-S. Accuracy declines with higher noise level ρ for both methods, but TextTeacher enlarges its margin over the baseline as ρ increases. To locate TextTeacher’s impact in a trained model, we study how guidance changes representational geometry across random initializations using centered kernel alignment (Kornblith et al., 2019) (CKA) at different depths. Visualizing the layerw… view at source ↗
Figure 7
Figure 7. Figure 7: Limited data results for ViT-S with a fixed number of update steps. [PITH_FULL_IMAGE:figures/full_fig_p021_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Distribution of the length of image captions for different image captioners compared to ImageNet [PITH_FULL_IMAGE:figures/full_fig_p022_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Across run CKA similarity for ViT-S. While different runs are similar in the early layers, [PITH_FULL_IMAGE:figures/full_fig_p023_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Most aligned and disaligned images for 6 random classes using ViT-B trained with [PITH_FULL_IMAGE:figures/full_fig_p024_10.png] view at source ↗
read the original abstract

The platonic representation hypothesis suggests that sufficiently large models converge to a shared representation geometry, even across modalities. Motivated by this, we ask: Can the semantic knowledge of a language model efficiently improve a vision model? As an answer, we introduce TextTeacher, a simple auxiliary objective that injects text embeddings as additional information into image classification training. TextTeacher uses readily available image captions, a pre-trained and frozen text encoder, and a lightweight projection to produce semantic anchors that efficiently guide representations during training while leaving the inference-time model unchanged. On ImageNet with standard ViT backbones, TextTeacher improves accuracy by up to +2.7 percentage points (p.p.) and yields consistent transfer gains (on average +1.0 p.p.) under the same recipe and compute. It outperforms vision knowledge distillation, yielding more accuracy at a constant compute budget or similar accuracy, but 33% faster. Our analysis indicates that TextTeacher acts as a feature-space preconditioner, shaping deeper layers in the first stages of training, and aiding generalization by supplying complementary semantic cues. TextTeacher adds negligible overhead, requires no costly multimodal training of the target model and preserves the simplicity and latency of pure vision models. Project page with code and captions: https://nauen-it.de/publications/text-teacher

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 3 minor

Summary. The manuscript introduces TextTeacher, a simple auxiliary objective for training vision models on image classification. It uses readily available image captions, a pre-trained frozen text encoder, and a lightweight projection to generate semantic anchors that guide representations during training. The approach leaves the inference-time model unchanged. On ImageNet with standard ViT backbones, the method reports accuracy gains of up to +2.7 percentage points and average transfer gains of +1.0 percentage points, outperforming vision knowledge distillation at constant or reduced compute while adding negligible overhead. The work is motivated by the platonic representation hypothesis and positions the text embeddings as complementary semantic cues that precondition deeper layers.

Significance. If the gains are robust and attributable to cross-modal semantic transfer rather than implicit class supervision, the result would provide practical evidence supporting the platonic representation hypothesis and a low-overhead route to improve pure vision models using language resources without multimodal retraining. The public release of code and captions strengthens reproducibility and enables independent verification of the reported improvements.

major comments (1)
  1. [§4] §4: The analysis claims that TextTeacher acts as a feature-space preconditioner by supplying complementary semantic cues. However, no ablation is presented that masks class names in captions, replaces them with synonyms, or otherwise severs the correlation between caption content and target class labels while keeping the projection and training recipe fixed. This test is load-bearing for attributing the +2.7 p.p. ImageNet and +1.0 p.p. transfer gains to the stated cross-modal mechanism rather than soft label supervision.
minor comments (3)
  1. [§3] §3: The auxiliary loss combining the standard classification objective with the projected text term should be written explicitly as an equation to clarify the weighting and optimization details.
  2. [Abstract] Abstract and §5: The maximum +2.7 p.p. gain is reported without specifying the exact ViT variant, training schedule, or run that achieves it; adding this detail would improve precision.
  3. [Experiments] Experiments section: While the project page supplies code and captions, the main text should briefly describe caption sourcing and any filtering steps to support full replication from the manuscript alone.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the thoughtful review and for highlighting a key point about mechanism attribution. We address the major comment below and will incorporate the requested ablation in the revised manuscript.

read point-by-point responses
  1. Referee: [§4] §4: The analysis claims that TextTeacher acts as a feature-space preconditioner by supplying complementary semantic cues. However, no ablation is presented that masks class names in captions, replaces them with synonyms, or otherwise severs the correlation between caption content and target class labels while keeping the projection and training recipe fixed. This test is load-bearing for attributing the +2.7 p.p. ImageNet and +1.0 p.p. transfer gains to the stated cross-modal mechanism rather than soft label supervision.

    Authors: We agree that the manuscript does not contain an ablation that explicitly removes or replaces class-related information from the captions while holding the rest of the pipeline fixed. Section 4 presents evidence that TextTeacher shapes deeper-layer representations early in training and yields gains beyond what is observed with standard vision-only baselines or distillation. Nevertheless, the referee is correct that this leaves open the possibility that part of the benefit arises from implicit class supervision encoded in the captions rather than richer cross-modal semantics. To address this directly, we will add the suggested ablation in the revision: we will mask class names, replace them with synonyms, or substitute generic descriptors in the captions and re-run the ImageNet and transfer experiments under identical training settings. The results will be reported alongside the existing analysis. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical method with external components

full rationale

The paper introduces TextTeacher as an auxiliary training objective that injects embeddings from a frozen external text encoder (pre-trained independently) and publicly available image captions into ViT training. The reported gains (+2.7 p.p. ImageNet accuracy, +1.0 p.p. transfer) are presented as measured experimental outcomes under fixed recipes, not as quantities derived by construction from fitted parameters or self-referential definitions. No equations reduce the performance claims to tautological fits, and the central mechanism relies on independent, verifiable components (projection layer, auxiliary loss) rather than load-bearing self-citations or ansatzes smuggled from prior author work. The derivation chain is self-contained against external benchmarks such as standard supervised ViT training and vision distillation baselines.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The approach rests on the platonic representation hypothesis as motivation and on the availability of image captions plus a frozen external text encoder; the lightweight projection is a trainable component whose exact parameterization is not detailed in the abstract.

free parameters (1)
  • lightweight projection
    Small trainable layer that maps text embeddings into the vision feature space; its weights are fitted during the auxiliary training.
axioms (1)
  • domain assumption Platonic representation hypothesis: sufficiently large models converge to a shared representation geometry across modalities
    Explicitly stated as the motivating premise in the abstract.
invented entities (1)
  • semantic anchors no independent evidence
    purpose: Additional training-time signals derived from text embeddings to guide vision representations
    Introduced as the output of the text encoder plus projection; no independent falsifiable prediction is given in the abstract.

pith-pipeline@v0.9.0 · 5775 in / 1405 out tokens · 38388 ms · 2026-05-22T07:20:53.414809+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

77 extracted references · 77 canonical work pages · 10 internal anchors

  1. [1]

    A survey on lexical ambiguity detection and word sense disambiguation

    Miuru Abeysiriwardana and Deshan Sumanathilaka. A survey on lexical ambiguity detection and word sense disambiguation. 2024. doi:10.48550/ARXIV.2403.16129

  2. [2]

    Label-embedding for image classification

    Zeynep Akata, Florent Perronnin, Zaid Harchaoui, and Cordelia Schmid. Label-embedding for image classification. IEEE Transactions on Pattern Analysis and Machine Intelligence, 38 0 (7): 0 1425--1438, 2015. ISSN 2160-9292. doi:10.1109/tpami.2015.2487986

  3. [3]

    PaliGemma: A versatile 3B VLM for transfer

    Lucas Beyer, Andreas Steiner, André Susano Pinto, Alexander Kolesnikov, Xiao Wang, Daniel Salz, Maxim Neumann, Ibrahim Alabdulmohsin, Michael Tschannen, Emanuele Bugliarello, Thomas Unterthiner, Daniel Keysers, Skanda Koppula, Fangyu Liu, Adam Grycner, Alexey Gritsenko, Neil Houlsby, Manoj Kumar, Keran Rong, Julian Eisenschlos, Rishabh Kabra, Matthias Bau...

  4. [4]

    Multimodal datasets: misogyny, pornography, and malignant stereotypes

    Abeba Birhane, Vinay Uday Prabhu, and Emmanuel Kahembwe. Multimodal datasets: misogyny, pornography, and malignant stereotypes. October 2021. doi:10.48550/ARXIV.2110.01963

  5. [5]

    Addressing uncertainty in the safety assurance of machine-learning

    Simon Burton and Benjamin Herd. Addressing uncertainty in the safety assurance of machine-learning. Frontiers in Computer Science, 5, 2023. ISSN 2624-9898. doi:10.3389/fcomp.2023.1132580

  6. [6]

    Isotropy in the contextual embedding space: Clusters and manifolds

    Xingyu Cai, Jiaji Huang, Yuchen Bian, and Kenneth Church. Isotropy in the contextual embedding space: Clusters and manifolds. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=xYGNO86OWDH

  7. [7]

    End-to-end object detection with transformers

    Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to-end object detection with transformers. In Andrea Vedaldi, Horst Bischof, Thomas Brox, and Jan-Michael Frahm (eds.), Computer Vision -- ECCV 2020, pp.\ 213--229, Cham, 2020. Springer International Publishing. ISBN 978-3-030-58452-8

  8. [8]

    Emerging Properties in Self-Supervised Vision Transformers

    Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers. In Proceedings of the International Conference on Computer Vision (ICCV). arXiv, 2021. doi:10.48550/ARXIV.2104.14294

  9. [9]

    Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey E. Hinton. A simple framework for contrastive learning of visual representations. In Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event , Proceedings of Machine Learning Research, pp.\ 1597--1607. PMLR , 2020. URL http://proceedings.mlr....

  10. [10]

    When vision transformers outperform resnets without pre-training or strong data augmentations

    Xiangning Chen, Cho-Jui Hsieh, and Boqing Gong. When vision transformers outperform resnets without pre-training or strong data augmentations. In International Conference on Learning Representations, 2022

  11. [11]

    Ekin Dogus Cubuk, Barret Zoph, Dandelion Man \'e , Vijay Vasudevan, and Quoc V. Le. Autoaugment: Learning augmentation strategies from data. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.\ 113--123, 2019

  12. [12]

    Imagenet: A large- scale hierarchical image database

    Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. ImageNet : A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition . IEEE , 2009. doi:10.1109/cvpr.2009.5206848

  13. [13]

    Virtex: Learning visual representations from textual annotations

    Karan Desai and Justin Johnson. Virtex: Learning visual representations from textual annotations. In 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.\ 11157--11168, 2021. doi:10.1109/CVPR46437.2021.01101

  14. [14]

    Devlin, M.-W

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. In Jill Burstein, Christy Doran, and Thamar Solorio (eds.), Proceedings of the 2019 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volu...

  15. [15]

    An image is worth 16x16 words: Transformers for image recognition at scale

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. In 9th International Conference on Learning Representations, ICLR 2021, ...

  16. [16]

    Xcit: Cross-covariance image transformers

    Alaaeldin El-Nouby, Hugo Touvron, Mathilde Caron, Piotr Bojanowski, Matthijs Douze, Armand Joulin, Ivan Laptev, Natalia Neverova, Gabriel Synnaeve, Jakob Verbeek, and Hervé Jegou. Xcit: Cross-covariance image transformers. In A. Beygelzimer, Y. Dauphin, P. Liang, and J. Wortman Vaughan (eds.), Advances in Neural Information Processing Systems, 2021. doi:1...

  17. [17]

    Caron, H

    Zhiyuan Fang, Jianfeng Wang, Xiaowei Hu, Lijuan Wang, Yezhou Yang, and Zicheng Liu. Compressing visual-linguistic model via knowledge distillation. In 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pp.\ 1408--1418, 2021. doi:10.1109/ICCV48922.2021.00146

  18. [18]

    Caption supervision enables robust learners

    Benjamin Feuer, Ameya Joshi, and Chinmay Hegde. Caption supervision enables robust learners. 2022. doi:10.48550/ARXIV.2210.07396

  19. [19]

    Devise: A deep visual-semantic embedding model

    Andrea Frome, Greg S Corrado, Jon Shlens, Samy Bengio, Jeff Dean, Marc Aurelio Ranzato, and Tomas Mikolov. Devise: A deep visual-semantic embedding model. In C.J. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K.Q. Weinberger (eds.), Advances in Neural Information Processing Systems, volume 26. Curran Associates, Inc., 2013. URL https://proceedings.neu...

  20. [20]

    Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik

    Ross B. Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. 2014 IEEE Conference on Computer Vision and Pattern Recognition, pp.\ 580--587, 2014

  21. [21]

    The Llama 3 Herd of Models

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev, Arthur Hinsvark, Arun Rao, Aston Zhang, Aurelien Rodriguez, Austen Gregerson, and A...

  22. [22]

    Crossmodal knowledge distillation with wordnet-relaxed text embeddings for robust image classification

    Chenqi Guo, Mengshuo Rong, Qianli Feng, Rongfan Feng, and Yinglong Ma. Crossmodal knowledge distillation with wordnet-relaxed text embeddings for robust image classification. 2025. doi:10.48550/ARXIV.2503.24017

  23. [23]

    Polysemy—evidence from linguistics, behavioral science, and contextualized language models

    Janosch Haber and Massimo Poesio. Polysemy—evidence from linguistics, behavioral science, and contextualized language models. Computational Linguistics, 50 0 (1): 0 351--417, 2024. ISSN 1530-9312. doi:10.1162/coli_a_00500

  24. [24]

    Deep residual learning for image recognition

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.\ 770--778, 2016

  25. [25]

    Mask R-CNN

    Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. Mask r-cnn. In ICCV, pp.\ 2980--2988. arXiv, 2017. ISBN 978-1-5386-1032-9. doi:10.48550/ARXIV.1703.06870

  26. [26]

    Masked image pretraining on language assisted representation

    Zejiang Hou and Sun-Yuan Kung. Masked image pretraining on language assisted representation. In ICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2025. doi:10.1109/ICASSP49660.2025.10888259

  27. [27]

    Position: The platonic representation hypothesis

    Minyoung Huh, Brian Cheung, Tongzhou Wang, and Phillip Isola. Position: The platonic representation hypothesis. In Ruslan Salakhutdinov, Zico Kolter, Katherine Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp (eds.), Proceedings of the 41st International Conference on Machine Learning, volume 235 of Proceedings of Machine Learn...

  28. [28]

    Vl2lite: Task-specific knowledge distillation from large vision-language models to lightweight networks, 2025

    Jinseong Jang, Chunfei Ma, and Byeongwon Lee. Vl2lite: Task-specific knowledge distillation from large vision-language models to lightweight networks, 2025

  29. [29]

    Scaling up visual and vision-language representation learning with noisy text supervision

    Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representation learning with noisy text supervision. In Marina Meila and Tong Zhang (eds.), Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine...

  30. [30]

    Combining Weakly and Webly Supervised Learning for Classifying Food Images

    Parneet Kaur, Karan Sikka, and Ajay Divakaran. Combining weakly and webly supervised learning for classifying food images. 2017. doi:10.48550/ARXIV.1712.08730

  31. [31]

    TIPS : Text-image pretraining with spatial awareness

    Kevis kokitsi Maninis, Kaifeng Chen, Soham Ghosh, Arjun Karpur, Koert Chen, Ye Xia, Bingyi Cao, Daniel Salz, Guangxing Han, Jan Dlabal, Dan Gnanapragasam, Mojtaba Seyedhosseini, Howard Zhou, and Andre Araujo. TIPS : Text-image pretraining with spatial awareness. In The Thirteenth International Conference on Learning Representations, 2025

  32. [32]

    Similarity of neural network representations revisited

    Simon Kornblith, Mohammad Norouzi, Honglak Lee, and Geoffrey Hinton. Similarity of neural network representations revisited. In Kamalika Chaudhuri and Ruslan Salakhutdinov (eds.), Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pp.\ 3519--3529. PMLR, 09--15 Jun 2019. URL http://p...

  33. [33]

    3d object representations for fine-grained categorization

    Jonathan Krause, Michael Stark, Jia Deng, and Li Fei-Fei. 3d object representations for fine-grained categorization. In 4th International IEEE Workshop on 3D Representation and Recognition (3dRR-13), Sydney, Australia, 2013

  34. [34]

    Nv-embed: Improved techniques for training llms as generalist embedding models

    Chankyu Lee, Rajarshi Roy, Mengyao Xu, Jonathan Raiman, Mohammad Shoeybi, Bryan Catanzaro, and Wei Ping. Nv-embed: Improved techniques for training llms as generalist embedding models. In Y. Yue, A. Garg, N. Peng, F. Sha, and R. Yu (eds.), International Conference on Learning Representations, volume 2025, pp.\ 79310--79333. arXiv, 2025

  35. [35]

    Next-vit: Next generation vision transformer for efficient deployment in realistic industrial scenarios

    Jiashi Li, Xin Xia, Wei Li, Huixia Li, Xing Wang, Xuefeng Xiao, Rui Wang, Min Zheng, and Xin Pan. Next-vit: Next generation vision transformer for efficient deployment in realistic industrial scenarios. 2022. doi:10.48550/ARXIV.2207.05501

  36. [36]

    Blip-2: bootstrapping language-image pre-training with frozen image encoders and large language models

    Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: bootstrapping language-image pre-training with frozen image encoders and large language models. In Proceedings of the 40th International Conference on Machine Learning, ICML'23, 2023

  37. [37]

    Visual instruction tuning

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine (eds.), Advances in Neural Information Processing Systems, volume 36, pp.\ 34892--34916. Curran Associates, Inc., 2023. URL https://proceedings.neurips.cc/paper_files/paper/2023/file/6dcf277ea32ce328891...

  38. [38]

    Asymmetric visual semantic embedding framework for efficient vision-language alignment

    Yang Liu, Mengyuan Liu, Shudong Huang, and Jiancheng Lv. Asymmetric visual semantic embedding framework for efficient vision-language alignment. Proceedings of the AAAI Conference on Artificial Intelligence, 39 0 (6): 0 5676--5684, April 2025. ISSN 2159-5399. doi:10.1609/aaai.v39i6.32605

  39. [39]

    Caron, H

    Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pp.\ 9992--10002, Los Alamitos, CA, USA, 10 2021. IEEE Computer Society. doi:10.1109/ICCV48922.2021.00986

  40. [40]

    Borrowing knowledge from pre-trained language model: A new data-efficient visual learning paradigm

    Wenxuan Ma, Shuang Li, JinMing Zhang, Chi Harold Liu, Jingxuan Kang, Yulin Wang, and Gao Huang. Borrowing knowledge from pre-trained language model: A new data-efficient visual learning paradigm. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp.\ 18786--18797, October 2023

  41. [41]

    S. Maji, J. Kannala, E. Rahtu, M. Blaschko, and A. Vedaldi. Fine-grained visual classification of aircraft. Technical report, 2013

  42. [42]

    Slip: Self-supervision meets language-image pre-training

    Norman Mu, Alexander Kirillov, David Wagner, and Saining Xie. Slip: Self-supervision meets language-image pre-training. In Shai Avidan, Gabriel Brostow, Moustapha Ciss \'e , Giovanni Maria Farinella, and Tal Hassner (eds.), Computer Vision -- ECCV 2022, pp.\ 529--544, Cham, 2022. Springer Nature Switzerland. ISBN 978-3-031-19809-0

  43. [43]

    Nurminen

    Lalli Myllyaho, Mikko Raatikainen, Tomi Männistö, Tommi Mikkonen, and Jukka K. Nurminen. Systematic literature review of validation methods for ai systems. Journal of Systems and Software, 181: 0 111050, 2021. ISSN 0164-1212. doi:10.1016/j.jss.2021.111050

  44. [44]

    Clip-embed-kd: Computationally efficient knowledge distillation using embeddings as teachers

    Lakshmi Nair. Clip-embed-kd: Computationally efficient knowledge distillation using embeddings as teachers. Extended abstract: 28th IEEE High Performance Extreme Computing Conference (HPEC) 2024 - Outstanding short paper award, 2024

  45. [45]

    Which transformer to favor: A comparative analysis of efficiency in vision transformers

    Tobias Christian Nauen, Sebastian Palacio, Federico Raue, and Andreas Dengel. Which transformer to favor: A comparative analysis of efficiency in vision transformers. In Proceedings of the Winter Conference on Applications of Computer Vision (WACV), pp.\ 6955--6966, February 2025

  46. [46]

    Automated flower classification over a large number of classes

    Maria-Elena Nilsback and Andrew Zisserman. Automated flower classification over a large number of classes. In Indian Conference on Computer Vision, Graphics and Image Processing, Dec 2008

  47. [47]

    Maxime Oquab, Timoth \'e e Darcet, Th \'e o Moutakanni, Huy V. Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel HAZIZA, Francisco Massa, Alaaeldin El-Nouby, Mido Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Herve Jegou, Julien Mairal, Pat...

  48. [48]

    Parkhi, Andrea Vedaldi, Andrew Zisserman, and C

    Omkar M. Parkhi, Andrea Vedaldi, Andrew Zisserman, and C. V. Jawahar. Cats and dogs. In IEEE Conference on Computer Vision and Pattern Recognition, 2012

  49. [49]

    Development methodologies for safety critical machine learning applications in the automotive domain: A survey

    Martin Rabe, Stefan Milz, and Patrick Mader. Development methodologies for safety critical machine learning applications in the automotive domain: A survey. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, pp.\ 129--141, June 2021

  50. [50]

    Improving language understanding by generative pre-training

    Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. Improving language understanding by generative pre-training. 2018

  51. [51]

    Language models are unsupervised multitask learners

    Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. 2019. URL https://openai.com/blog/better-language-models/

  52. [52]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. In Marina Meila and Tong Zhang (eds.), Proceedings of the 38th International Conference on Machine L...

  53. [53]

    Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21 0 (140): 0 1--67, 2020

  54. [54]

    The shape of learning: Anisotropy and intrinsic dimensions in transformer-based models

    Anton Razzhigaev, Matvey Mikhalchuk, Elizaveta Goncharova, Ivan Oseledets, Denis Dimitrov, and Andrey Kuznetsov. The shape of learning: Anisotropy and intrinsic dimensions in transformer-based models. In Yvette Graham and Matthew Purver (eds.), Findings of the Association for Computational Linguistics: EACL 2024, pp.\ 868--874, St. Julian ' s, Malta, Marc...

  55. [55]

    Matuszewski

    Edward Sanderson and Bogdan J. Matuszewski. FCN-Transformer Feature Fusion for Polyp Segmentation, pp.\ 892--907. Springer International Publishing, 2022. ISBN 9783031120534. doi:10.1007/978-3-031-12053-4_65

  56. [56]

    A fistful of words: Learning transferable visual models from bag-of-words supervision, 2022

    Ajinkya Tejankar, Maziar Sanjabi, Bichen Wu, Saining Xie, Madian Khabsa, Hamed Pirsiavash, and Hamed Firooz. A fistful of words: Learning transferable visual models from bag-of-words supervision, 2022. URL https://arxiv.org/abs/2112.13884

  57. [57]

    Dragonfly: Multi-resolution zoom-in encoding enhances vision-language models

    Rahul Thapa, Kezhen Chen, Ian Covert, Rahul Chalamala, Ben Athiwaratkun, Shuaiwen Leon Song, and James Zou. Dragonfly: Multi-resolution zoom-in encoding enhances vision-language models. 2024. doi:10.48550/ARXIV.2406.00977

  58. [58]

    What makes for good views for contrastive learning? In H

    Yonglong Tian, Chen Sun, Ben Poole, Dilip Krishnan, Cordelia Schmid, and Phillip Isola. What makes for good views for contrastive learning? In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin (eds.), Advances in Neural Information Processing Systems, volume 33, pp.\ 6827--6839. Curran Associates, Inc., 2020

  59. [59]

    Training data-efficient image transformers & distillation through attention

    Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Herve Jegou. Training data-efficient image transformers & distillation through attention. In Marina Meila and Tong Zhang (eds.), Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pp.\ 10347--...

  60. [60]

    Deit iii: Revenge of the vit

    Hugo Touvron, Matthieu Cord, and Herv \'e J \'e gou. Deit iii: Revenge of the vit. In Shai Avidan, Gabriel Brostow, Moustapha Ciss \'e , Giovanni Maria Farinella, and Tal Hassner (eds.), Computer Vision -- ECCV 2022, pp.\ 516--533, Cham, 2022. Springer Nature Switzerland

  61. [61]

    LLaMA: Open and Efficient Foundation Language Models

    Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. Llama: Open and efficient foundation language models. 2023. doi:10.48550/ARXIV.2302.13971

  62. [62]

    Attention is all you need

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017

  63. [63]

    Effisegnet: Gastrointestinal polyp segmentation through a pre-trained efficientnet-based network with a simplified decoder

    Ioannis Vezakis, Konstantinos Georgas, Dimitrios Fotiadis, and George Matsopoulos. Effisegnet: Gastrointestinal polyp segmentation through a pre-trained efficientnet-based network with a simplified decoder. In 2024 46th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), pp.\ 1--4, 07 2024. doi:10.1109/EMBC53108...

  64. [64]

    C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie. Caltech-ucsd birds 200. Technical Report CNS-TR-2011-001, California Institute of Technology, 2011

  65. [65]

    Internimage: Exploring large-scale vision foundation models with deformable convolutions

    Wenhai Wang, Jifeng Dai, Zhe Chen, Zhenhang Huang, Zhiqi Li, Xizhou Zhu, Xiaowei Hu, Tong Lu, Lewei Lu, Hongsheng Li, Xiaogang Wang, and Yu Qiao. Internimage: Exploring large-scale vision foundation models with deformable convolutions. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.\ 14408--14419, 2023

  66. [66]

    Advances in neural architecture search

    Xin Wang and Wenwu Zhu. Advances in neural architecture search. National Science Review, 11 0 (8), 2024. ISSN 2053-714X. doi:10.1093/nsr/nwae282

  67. [67]

    Energy efficiency of training neural network architectures: An empirical study

    Yinlena Xu, Silverio Martínez-Fernández, Matias Martinez, and Xavier Franch. Energy efficiency of training neural network architectures: An empirical study. In Proceedings of the 56th Hawaii International Conference on System Sciences, HICSS. Hawaii International Conference on System Sciences, 2023. doi:10.24251/hicss.2023.098

  68. [68]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...

  69. [69]

    Clip-kd: An empirical study of clip model distillation

    Chuanguang Yang, Zhulin An, Libo Huang, Junyu Bi, Xinqiang Yu, Han Yang, Boyu Diao, and Yongjun Xu. Clip-kd: An empirical study of clip model distillation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

  70. [70]

    Reconstruction vs

    Jingfeng Yao, Bin Yang, and Xinggang Wang. Reconstruction vs. generation: Taming optimization dilemma in latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

  71. [71]

    Large Batch Optimization for Deep Learning: Training BERT in 76 minutes

    Yang You, Jing Li, Sashank Reddi, Jonathan Hseu, Sanjiv Kumar, Srinadh Bhojanapalli, Xiaodan Song, James Demmel, Kurt Keutzer, and Cho-Jui Hsieh. Large batch optimization for deep learning: Training bert in 76 minutes. In International Conference on Learning Representations, 2020. doi:10.48550/arxiv.1904.00962. URL https://openreview.net/forum?id=Syx4wnEtvH

  72. [72]

    CoCa: Contrastive Captioners are Image-Text Foundation Models

    Jiahui Yu, Zirui Wang, Vijay Vasudevan, Legg Yeung, Mojtaba Seyedhosseini, and Yonghui Wu. Coca: Contrastive captioners are image-text foundation models. Transactions on Machine Learning Research, 2022. ISSN 2835-8856. doi:10.48550/arxiv.2205.01917. URL https://openreview.net/forum?id=Ee277P3AYC

  73. [73]

    CutMix : Regularization strategy to train strong classifiers with localizable features

    Sangdoo Yun, Dongyoon Han, Sanghyuk Chun, Seong Joon Oh, Youngjoon Yoo, and Junsuk Choe. CutMix : Regularization strategy to train strong classifiers with localizable features. In 2019 IEEE / CVF International Conference on Computer Vision ( ICCV ) . IEEE , 2019. doi:10.1109/iccv.2019.00612

  74. [74]

    Dic-transformer: interpretation of plant disease classification results using image caption generation technology

    Qingtian Zeng, Jian Sun, and Shansong Wang. Dic-transformer: interpretation of plant disease classification results using image caption generation technology. Frontiers in Plant Science, 14, 2024. ISSN 1664-462X. doi:10.3389/fpls.2023.1273029

  75. [75]

    Dauphin, and David Lopez-Paz

    Hongyi Zhang, Moustapha Cisse, Yann N. Dauphin, and David Lopez-Paz. mixup: Beyond empirical risk minimization. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=r1Ddp1-Rb

  76. [76]

    Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models

    Yanzhao Zhang, Mingxin Li, Dingkun Long, Xin Zhang, Huan Lin, Baosong Yang, Pengjun Xie, An Yang, Dayiheng Liu, Junyang Lin, Fei Huang, and Jingren Zhou. Qwen3 embedding: Advancing text embedding and reranking through foundation models. 2025. doi:10.48550/ARXIV.2506.05176

  77. [77]

    Random erasing data augmentation

    Zhun Zhong, Liang Zheng, Guoliang Kang, Shaozi Li, and Yi Yang. Random erasing data augmentation. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), 2020