TextTeacher: What Can Language Teach About Images?
Pith reviewed 2026-05-22 07:20 UTC · model grok-4.3
The pith
TextTeacher uses image captions and a frozen text encoder to improve vision model accuracy on ImageNet by up to 2.7 percentage points without altering the inference model.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
TextTeacher is a simple auxiliary objective that injects text embeddings as additional information into image classification training. It uses readily available image captions, a pre-trained and frozen text encoder, and a lightweight projection to produce semantic anchors that efficiently guide representations during training while leaving the inference-time model unchanged.
What carries the argument
TextTeacher auxiliary objective that projects frozen text encoder outputs into the vision feature space to act as semantic anchors and precondition deeper layers.
If this is right
- Yields up to 2.7 percentage point accuracy gains on ImageNet with standard ViT backbones.
- Produces average 1.0 percentage point gains on transfer tasks under the same training recipe.
- Outperforms vision-only knowledge distillation at constant compute or achieves similar accuracy 33 percent faster.
- Shapes deeper layers in the first stages of training to aid generalization.
Where Pith is reading between the lines
- This suggests vision models can gain from abundant text data without requiring joint multimodal pretraining of the vision backbone.
- The preconditioning effect could extend to other vision tasks such as object detection if similar caption-based anchors are used.
- It raises the possibility that future scaling of vision models might benefit more from simple text guidance than from increased vision-only data alone.
Load-bearing premise
Readily available image captions supply complementary semantic cues that a frozen text encoder and lightweight projection can turn into useful guidance for vision features.
What would settle it
Training the same ViT backbone on ImageNet with and without the TextTeacher auxiliary loss and finding no accuracy gain or transfer improvement under matched compute and recipe.
Figures
read the original abstract
The platonic representation hypothesis suggests that sufficiently large models converge to a shared representation geometry, even across modalities. Motivated by this, we ask: Can the semantic knowledge of a language model efficiently improve a vision model? As an answer, we introduce TextTeacher, a simple auxiliary objective that injects text embeddings as additional information into image classification training. TextTeacher uses readily available image captions, a pre-trained and frozen text encoder, and a lightweight projection to produce semantic anchors that efficiently guide representations during training while leaving the inference-time model unchanged. On ImageNet with standard ViT backbones, TextTeacher improves accuracy by up to +2.7 percentage points (p.p.) and yields consistent transfer gains (on average +1.0 p.p.) under the same recipe and compute. It outperforms vision knowledge distillation, yielding more accuracy at a constant compute budget or similar accuracy, but 33% faster. Our analysis indicates that TextTeacher acts as a feature-space preconditioner, shaping deeper layers in the first stages of training, and aiding generalization by supplying complementary semantic cues. TextTeacher adds negligible overhead, requires no costly multimodal training of the target model and preserves the simplicity and latency of pure vision models. Project page with code and captions: https://nauen-it.de/publications/text-teacher
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces TextTeacher, a simple auxiliary objective for training vision models on image classification. It uses readily available image captions, a pre-trained frozen text encoder, and a lightweight projection to generate semantic anchors that guide representations during training. The approach leaves the inference-time model unchanged. On ImageNet with standard ViT backbones, the method reports accuracy gains of up to +2.7 percentage points and average transfer gains of +1.0 percentage points, outperforming vision knowledge distillation at constant or reduced compute while adding negligible overhead. The work is motivated by the platonic representation hypothesis and positions the text embeddings as complementary semantic cues that precondition deeper layers.
Significance. If the gains are robust and attributable to cross-modal semantic transfer rather than implicit class supervision, the result would provide practical evidence supporting the platonic representation hypothesis and a low-overhead route to improve pure vision models using language resources without multimodal retraining. The public release of code and captions strengthens reproducibility and enables independent verification of the reported improvements.
major comments (1)
- [§4] §4: The analysis claims that TextTeacher acts as a feature-space preconditioner by supplying complementary semantic cues. However, no ablation is presented that masks class names in captions, replaces them with synonyms, or otherwise severs the correlation between caption content and target class labels while keeping the projection and training recipe fixed. This test is load-bearing for attributing the +2.7 p.p. ImageNet and +1.0 p.p. transfer gains to the stated cross-modal mechanism rather than soft label supervision.
minor comments (3)
- [§3] §3: The auxiliary loss combining the standard classification objective with the projected text term should be written explicitly as an equation to clarify the weighting and optimization details.
- [Abstract] Abstract and §5: The maximum +2.7 p.p. gain is reported without specifying the exact ViT variant, training schedule, or run that achieves it; adding this detail would improve precision.
- [Experiments] Experiments section: While the project page supplies code and captions, the main text should briefly describe caption sourcing and any filtering steps to support full replication from the manuscript alone.
Simulated Author's Rebuttal
We thank the referee for the thoughtful review and for highlighting a key point about mechanism attribution. We address the major comment below and will incorporate the requested ablation in the revised manuscript.
read point-by-point responses
-
Referee: [§4] §4: The analysis claims that TextTeacher acts as a feature-space preconditioner by supplying complementary semantic cues. However, no ablation is presented that masks class names in captions, replaces them with synonyms, or otherwise severs the correlation between caption content and target class labels while keeping the projection and training recipe fixed. This test is load-bearing for attributing the +2.7 p.p. ImageNet and +1.0 p.p. transfer gains to the stated cross-modal mechanism rather than soft label supervision.
Authors: We agree that the manuscript does not contain an ablation that explicitly removes or replaces class-related information from the captions while holding the rest of the pipeline fixed. Section 4 presents evidence that TextTeacher shapes deeper-layer representations early in training and yields gains beyond what is observed with standard vision-only baselines or distillation. Nevertheless, the referee is correct that this leaves open the possibility that part of the benefit arises from implicit class supervision encoded in the captions rather than richer cross-modal semantics. To address this directly, we will add the suggested ablation in the revision: we will mask class names, replace them with synonyms, or substitute generic descriptors in the captions and re-run the ImageNet and transfer experiments under identical training settings. The results will be reported alongside the existing analysis. revision: yes
Circularity Check
No significant circularity; empirical method with external components
full rationale
The paper introduces TextTeacher as an auxiliary training objective that injects embeddings from a frozen external text encoder (pre-trained independently) and publicly available image captions into ViT training. The reported gains (+2.7 p.p. ImageNet accuracy, +1.0 p.p. transfer) are presented as measured experimental outcomes under fixed recipes, not as quantities derived by construction from fitted parameters or self-referential definitions. No equations reduce the performance claims to tautological fits, and the central mechanism relies on independent, verifiable components (projection layer, auxiliary loss) rather than load-bearing self-citations or ansatzes smuggled from prior author work. The derivation chain is self-contained against external benchmarks such as standard supervised ViT training and vision distillation baselines.
Axiom & Free-Parameter Ledger
free parameters (1)
- lightweight projection
axioms (1)
- domain assumption Platonic representation hypothesis: sufficiently large models converge to a shared representation geometry across modalities
invented entities (1)
-
semantic anchors
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
TextTeacher uses readily available image captions, a pre-trained and frozen text encoder, and a lightweight projection to produce semantic anchors... auxiliary CLIP-style contrastive loss... acts as a feature-space preconditioner, shaping deeper layers in the first stages of training
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
The platonic representation hypothesis suggests that sufficiently large models converge to a shared representation geometry, even across modalities
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
A survey on lexical ambiguity detection and word sense disambiguation
Miuru Abeysiriwardana and Deshan Sumanathilaka. A survey on lexical ambiguity detection and word sense disambiguation. 2024. doi:10.48550/ARXIV.2403.16129
-
[2]
Label-embedding for image classification
Zeynep Akata, Florent Perronnin, Zaid Harchaoui, and Cordelia Schmid. Label-embedding for image classification. IEEE Transactions on Pattern Analysis and Machine Intelligence, 38 0 (7): 0 1425--1438, 2015. ISSN 2160-9292. doi:10.1109/tpami.2015.2487986
-
[3]
PaliGemma: A versatile 3B VLM for transfer
Lucas Beyer, Andreas Steiner, André Susano Pinto, Alexander Kolesnikov, Xiao Wang, Daniel Salz, Maxim Neumann, Ibrahim Alabdulmohsin, Michael Tschannen, Emanuele Bugliarello, Thomas Unterthiner, Daniel Keysers, Skanda Koppula, Fangyu Liu, Adam Grycner, Alexey Gritsenko, Neil Houlsby, Manoj Kumar, Keran Rong, Julian Eisenschlos, Rishabh Kabra, Matthias Bau...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2407.07726 2024
-
[4]
Multimodal datasets: misogyny, pornography, and malignant stereotypes
Abeba Birhane, Vinay Uday Prabhu, and Emmanuel Kahembwe. Multimodal datasets: misogyny, pornography, and malignant stereotypes. October 2021. doi:10.48550/ARXIV.2110.01963
-
[5]
Addressing uncertainty in the safety assurance of machine-learning
Simon Burton and Benjamin Herd. Addressing uncertainty in the safety assurance of machine-learning. Frontiers in Computer Science, 5, 2023. ISSN 2624-9898. doi:10.3389/fcomp.2023.1132580
-
[6]
Isotropy in the contextual embedding space: Clusters and manifolds
Xingyu Cai, Jiaji Huang, Yuchen Bian, and Kenneth Church. Isotropy in the contextual embedding space: Clusters and manifolds. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=xYGNO86OWDH
work page 2021
-
[7]
End-to-end object detection with transformers
Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to-end object detection with transformers. In Andrea Vedaldi, Horst Bischof, Thomas Brox, and Jan-Michael Frahm (eds.), Computer Vision -- ECCV 2020, pp.\ 213--229, Cham, 2020. Springer International Publishing. ISBN 978-3-030-58452-8
work page 2020
-
[8]
Emerging Properties in Self-Supervised Vision Transformers
Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers. In Proceedings of the International Conference on Computer Vision (ICCV). arXiv, 2021. doi:10.48550/ARXIV.2104.14294
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2104.14294 2021
-
[9]
Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey E. Hinton. A simple framework for contrastive learning of visual representations. In Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event , Proceedings of Machine Learning Research, pp.\ 1597--1607. PMLR , 2020. URL http://proceedings.mlr....
work page 2020
-
[10]
When vision transformers outperform resnets without pre-training or strong data augmentations
Xiangning Chen, Cho-Jui Hsieh, and Boqing Gong. When vision transformers outperform resnets without pre-training or strong data augmentations. In International Conference on Learning Representations, 2022
work page 2022
-
[11]
Ekin Dogus Cubuk, Barret Zoph, Dandelion Man \'e , Vijay Vasudevan, and Quoc V. Le. Autoaugment: Learning augmentation strategies from data. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.\ 113--123, 2019
work page 2019
-
[12]
Imagenet: A large- scale hierarchical image database
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. ImageNet : A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition . IEEE , 2009. doi:10.1109/cvpr.2009.5206848
-
[13]
Virtex: Learning visual representations from textual annotations
Karan Desai and Justin Johnson. Virtex: Learning visual representations from textual annotations. In 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.\ 11157--11168, 2021. doi:10.1109/CVPR46437.2021.01101
-
[14]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. In Jill Burstein, Christy Doran, and Thamar Solorio (eds.), Proceedings of the 2019 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volu...
-
[15]
An image is worth 16x16 words: Transformers for image recognition at scale
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. In 9th International Conference on Learning Representations, ICLR 2021, ...
work page 2021
-
[16]
Xcit: Cross-covariance image transformers
Alaaeldin El-Nouby, Hugo Touvron, Mathilde Caron, Piotr Bojanowski, Matthijs Douze, Armand Joulin, Ivan Laptev, Natalia Neverova, Gabriel Synnaeve, Jakob Verbeek, and Hervé Jegou. Xcit: Cross-covariance image transformers. In A. Beygelzimer, Y. Dauphin, P. Liang, and J. Wortman Vaughan (eds.), Advances in Neural Information Processing Systems, 2021. doi:1...
-
[17]
Zhiyuan Fang, Jianfeng Wang, Xiaowei Hu, Lijuan Wang, Yezhou Yang, and Zicheng Liu. Compressing visual-linguistic model via knowledge distillation. In 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pp.\ 1408--1418, 2021. doi:10.1109/ICCV48922.2021.00146
-
[18]
Caption supervision enables robust learners
Benjamin Feuer, Ameya Joshi, and Chinmay Hegde. Caption supervision enables robust learners. 2022. doi:10.48550/ARXIV.2210.07396
-
[19]
Devise: A deep visual-semantic embedding model
Andrea Frome, Greg S Corrado, Jon Shlens, Samy Bengio, Jeff Dean, Marc Aurelio Ranzato, and Tomas Mikolov. Devise: A deep visual-semantic embedding model. In C.J. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K.Q. Weinberger (eds.), Advances in Neural Information Processing Systems, volume 26. Curran Associates, Inc., 2013. URL https://proceedings.neu...
work page 2013
-
[20]
Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik
Ross B. Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. 2014 IEEE Conference on Computer Vision and Pattern Recognition, pp.\ 580--587, 2014
work page 2014
-
[21]
Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev, Arthur Hinsvark, Arun Rao, Aston Zhang, Aurelien Rodriguez, Austen Gregerson, and A...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2407.21783 2024
-
[22]
Chenqi Guo, Mengshuo Rong, Qianli Feng, Rongfan Feng, and Yinglong Ma. Crossmodal knowledge distillation with wordnet-relaxed text embeddings for robust image classification. 2025. doi:10.48550/ARXIV.2503.24017
-
[23]
Polysemy—evidence from linguistics, behavioral science, and contextualized language models
Janosch Haber and Massimo Poesio. Polysemy—evidence from linguistics, behavioral science, and contextualized language models. Computational Linguistics, 50 0 (1): 0 351--417, 2024. ISSN 1530-9312. doi:10.1162/coli_a_00500
-
[24]
Deep residual learning for image recognition
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.\ 770--778, 2016
work page 2016
-
[25]
Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. Mask r-cnn. In ICCV, pp.\ 2980--2988. arXiv, 2017. ISBN 978-1-5386-1032-9. doi:10.48550/ARXIV.1703.06870
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1703.06870 2017
-
[26]
Masked image pretraining on language assisted representation
Zejiang Hou and Sun-Yuan Kung. Masked image pretraining on language assisted representation. In ICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2025. doi:10.1109/ICASSP49660.2025.10888259
-
[27]
Position: The platonic representation hypothesis
Minyoung Huh, Brian Cheung, Tongzhou Wang, and Phillip Isola. Position: The platonic representation hypothesis. In Ruslan Salakhutdinov, Zico Kolter, Katherine Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp (eds.), Proceedings of the 41st International Conference on Machine Learning, volume 235 of Proceedings of Machine Learn...
work page 2024
-
[28]
Jinseong Jang, Chunfei Ma, and Byeongwon Lee. Vl2lite: Task-specific knowledge distillation from large vision-language models to lightweight networks, 2025
work page 2025
-
[29]
Scaling up visual and vision-language representation learning with noisy text supervision
Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representation learning with noisy text supervision. In Marina Meila and Tong Zhang (eds.), Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine...
work page 2021
-
[30]
Combining Weakly and Webly Supervised Learning for Classifying Food Images
Parneet Kaur, Karan Sikka, and Ajay Divakaran. Combining weakly and webly supervised learning for classifying food images. 2017. doi:10.48550/ARXIV.1712.08730
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1712.08730 2017
-
[31]
TIPS : Text-image pretraining with spatial awareness
Kevis kokitsi Maninis, Kaifeng Chen, Soham Ghosh, Arjun Karpur, Koert Chen, Ye Xia, Bingyi Cao, Daniel Salz, Guangxing Han, Jan Dlabal, Dan Gnanapragasam, Mojtaba Seyedhosseini, Howard Zhou, and Andre Araujo. TIPS : Text-image pretraining with spatial awareness. In The Thirteenth International Conference on Learning Representations, 2025
work page 2025
-
[32]
Similarity of neural network representations revisited
Simon Kornblith, Mohammad Norouzi, Honglak Lee, and Geoffrey Hinton. Similarity of neural network representations revisited. In Kamalika Chaudhuri and Ruslan Salakhutdinov (eds.), Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pp.\ 3519--3529. PMLR, 09--15 Jun 2019. URL http://p...
work page 2019
-
[33]
3d object representations for fine-grained categorization
Jonathan Krause, Michael Stark, Jia Deng, and Li Fei-Fei. 3d object representations for fine-grained categorization. In 4th International IEEE Workshop on 3D Representation and Recognition (3dRR-13), Sydney, Australia, 2013
work page 2013
-
[34]
Nv-embed: Improved techniques for training llms as generalist embedding models
Chankyu Lee, Rajarshi Roy, Mengyao Xu, Jonathan Raiman, Mohammad Shoeybi, Bryan Catanzaro, and Wei Ping. Nv-embed: Improved techniques for training llms as generalist embedding models. In Y. Yue, A. Garg, N. Peng, F. Sha, and R. Yu (eds.), International Conference on Learning Representations, volume 2025, pp.\ 79310--79333. arXiv, 2025
work page 2025
-
[35]
Jiashi Li, Xin Xia, Wei Li, Huixia Li, Xing Wang, Xuefeng Xiao, Rui Wang, Min Zheng, and Xin Pan. Next-vit: Next generation vision transformer for efficient deployment in realistic industrial scenarios. 2022. doi:10.48550/ARXIV.2207.05501
-
[36]
Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: bootstrapping language-image pre-training with frozen image encoders and large language models. In Proceedings of the 40th International Conference on Machine Learning, ICML'23, 2023
work page 2023
-
[37]
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine (eds.), Advances in Neural Information Processing Systems, volume 36, pp.\ 34892--34916. Curran Associates, Inc., 2023. URL https://proceedings.neurips.cc/paper_files/paper/2023/file/6dcf277ea32ce328891...
work page 2023
-
[38]
Asymmetric visual semantic embedding framework for efficient vision-language alignment
Yang Liu, Mengyuan Liu, Shudong Huang, and Jiancheng Lv. Asymmetric visual semantic embedding framework for efficient vision-language alignment. Proceedings of the AAAI Conference on Artificial Intelligence, 39 0 (6): 0 5676--5684, April 2025. ISSN 2159-5399. doi:10.1609/aaai.v39i6.32605
-
[39]
Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pp.\ 9992--10002, Los Alamitos, CA, USA, 10 2021. IEEE Computer Society. doi:10.1109/ICCV48922.2021.00986
-
[40]
Borrowing knowledge from pre-trained language model: A new data-efficient visual learning paradigm
Wenxuan Ma, Shuang Li, JinMing Zhang, Chi Harold Liu, Jingxuan Kang, Yulin Wang, and Gao Huang. Borrowing knowledge from pre-trained language model: A new data-efficient visual learning paradigm. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp.\ 18786--18797, October 2023
work page 2023
-
[41]
S. Maji, J. Kannala, E. Rahtu, M. Blaschko, and A. Vedaldi. Fine-grained visual classification of aircraft. Technical report, 2013
work page 2013
-
[42]
Slip: Self-supervision meets language-image pre-training
Norman Mu, Alexander Kirillov, David Wagner, and Saining Xie. Slip: Self-supervision meets language-image pre-training. In Shai Avidan, Gabriel Brostow, Moustapha Ciss \'e , Giovanni Maria Farinella, and Tal Hassner (eds.), Computer Vision -- ECCV 2022, pp.\ 529--544, Cham, 2022. Springer Nature Switzerland. ISBN 978-3-031-19809-0
work page 2022
-
[43]
Lalli Myllyaho, Mikko Raatikainen, Tomi Männistö, Tommi Mikkonen, and Jukka K. Nurminen. Systematic literature review of validation methods for ai systems. Journal of Systems and Software, 181: 0 111050, 2021. ISSN 0164-1212. doi:10.1016/j.jss.2021.111050
-
[44]
Clip-embed-kd: Computationally efficient knowledge distillation using embeddings as teachers
Lakshmi Nair. Clip-embed-kd: Computationally efficient knowledge distillation using embeddings as teachers. Extended abstract: 28th IEEE High Performance Extreme Computing Conference (HPEC) 2024 - Outstanding short paper award, 2024
work page 2024
-
[45]
Which transformer to favor: A comparative analysis of efficiency in vision transformers
Tobias Christian Nauen, Sebastian Palacio, Federico Raue, and Andreas Dengel. Which transformer to favor: A comparative analysis of efficiency in vision transformers. In Proceedings of the Winter Conference on Applications of Computer Vision (WACV), pp.\ 6955--6966, February 2025
work page 2025
-
[46]
Automated flower classification over a large number of classes
Maria-Elena Nilsback and Andrew Zisserman. Automated flower classification over a large number of classes. In Indian Conference on Computer Vision, Graphics and Image Processing, Dec 2008
work page 2008
-
[47]
Maxime Oquab, Timoth \'e e Darcet, Th \'e o Moutakanni, Huy V. Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel HAZIZA, Francisco Massa, Alaaeldin El-Nouby, Mido Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Herve Jegou, Julien Mairal, Pat...
work page 2024
-
[48]
Parkhi, Andrea Vedaldi, Andrew Zisserman, and C
Omkar M. Parkhi, Andrea Vedaldi, Andrew Zisserman, and C. V. Jawahar. Cats and dogs. In IEEE Conference on Computer Vision and Pattern Recognition, 2012
work page 2012
-
[49]
Martin Rabe, Stefan Milz, and Patrick Mader. Development methodologies for safety critical machine learning applications in the automotive domain: A survey. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, pp.\ 129--141, June 2021
work page 2021
-
[50]
Improving language understanding by generative pre-training
Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. Improving language understanding by generative pre-training. 2018
work page 2018
-
[51]
Language models are unsupervised multitask learners
Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. 2019. URL https://openai.com/blog/better-language-models/
work page 2019
-
[52]
Learning transferable visual models from natural language supervision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. In Marina Meila and Tong Zhang (eds.), Proceedings of the 38th International Conference on Machine L...
work page 2021
-
[53]
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21 0 (140): 0 1--67, 2020
work page 2020
-
[54]
The shape of learning: Anisotropy and intrinsic dimensions in transformer-based models
Anton Razzhigaev, Matvey Mikhalchuk, Elizaveta Goncharova, Ivan Oseledets, Denis Dimitrov, and Andrey Kuznetsov. The shape of learning: Anisotropy and intrinsic dimensions in transformer-based models. In Yvette Graham and Matthew Purver (eds.), Findings of the Association for Computational Linguistics: EACL 2024, pp.\ 868--874, St. Julian ' s, Malta, Marc...
-
[55]
Edward Sanderson and Bogdan J. Matuszewski. FCN-Transformer Feature Fusion for Polyp Segmentation, pp.\ 892--907. Springer International Publishing, 2022. ISBN 9783031120534. doi:10.1007/978-3-031-12053-4_65
-
[56]
A fistful of words: Learning transferable visual models from bag-of-words supervision, 2022
Ajinkya Tejankar, Maziar Sanjabi, Bichen Wu, Saining Xie, Madian Khabsa, Hamed Pirsiavash, and Hamed Firooz. A fistful of words: Learning transferable visual models from bag-of-words supervision, 2022. URL https://arxiv.org/abs/2112.13884
-
[57]
Dragonfly: Multi-resolution zoom-in encoding enhances vision-language models
Rahul Thapa, Kezhen Chen, Ian Covert, Rahul Chalamala, Ben Athiwaratkun, Shuaiwen Leon Song, and James Zou. Dragonfly: Multi-resolution zoom-in encoding enhances vision-language models. 2024. doi:10.48550/ARXIV.2406.00977
-
[58]
What makes for good views for contrastive learning? In H
Yonglong Tian, Chen Sun, Ben Poole, Dilip Krishnan, Cordelia Schmid, and Phillip Isola. What makes for good views for contrastive learning? In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin (eds.), Advances in Neural Information Processing Systems, volume 33, pp.\ 6827--6839. Curran Associates, Inc., 2020
work page 2020
-
[59]
Training data-efficient image transformers & distillation through attention
Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Herve Jegou. Training data-efficient image transformers & distillation through attention. In Marina Meila and Tong Zhang (eds.), Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pp.\ 10347--...
work page 2021
-
[60]
Hugo Touvron, Matthieu Cord, and Herv \'e J \'e gou. Deit iii: Revenge of the vit. In Shai Avidan, Gabriel Brostow, Moustapha Ciss \'e , Giovanni Maria Farinella, and Tal Hassner (eds.), Computer Vision -- ECCV 2022, pp.\ 516--533, Cham, 2022. Springer Nature Switzerland
work page 2022
-
[61]
LLaMA: Open and Efficient Foundation Language Models
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. Llama: Open and efficient foundation language models. 2023. doi:10.48550/ARXIV.2302.13971
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2302.13971 2023
-
[62]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017
work page 2017
-
[63]
Ioannis Vezakis, Konstantinos Georgas, Dimitrios Fotiadis, and George Matsopoulos. Effisegnet: Gastrointestinal polyp segmentation through a pre-trained efficientnet-based network with a simplified decoder. In 2024 46th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), pp.\ 1--4, 07 2024. doi:10.1109/EMBC53108...
-
[64]
C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie. Caltech-ucsd birds 200. Technical Report CNS-TR-2011-001, California Institute of Technology, 2011
work page 2011
-
[65]
Internimage: Exploring large-scale vision foundation models with deformable convolutions
Wenhai Wang, Jifeng Dai, Zhe Chen, Zhenhang Huang, Zhiqi Li, Xizhou Zhu, Xiaowei Hu, Tong Lu, Lewei Lu, Hongsheng Li, Xiaogang Wang, and Yu Qiao. Internimage: Exploring large-scale vision foundation models with deformable convolutions. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.\ 14408--14419, 2023
work page 2023
-
[66]
Advances in neural architecture search
Xin Wang and Wenwu Zhu. Advances in neural architecture search. National Science Review, 11 0 (8), 2024. ISSN 2053-714X. doi:10.1093/nsr/nwae282
-
[67]
Energy efficiency of training neural network architectures: An empirical study
Yinlena Xu, Silverio Martínez-Fernández, Matias Martinez, and Xavier Franch. Energy efficiency of training neural network architectures: An empirical study. In Proceedings of the 56th Hawaii International Conference on System Sciences, HICSS. Hawaii International Conference on System Sciences, 2023. doi:10.24251/hicss.2023.098
-
[68]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2505.09388 2025
-
[69]
Clip-kd: An empirical study of clip model distillation
Chuanguang Yang, Zhulin An, Libo Huang, Junyu Bi, Xinqiang Yu, Han Yang, Boyu Diao, and Yongjun Xu. Clip-kd: An empirical study of clip model distillation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024
work page 2024
-
[70]
Jingfeng Yao, Bin Yang, and Xinggang Wang. Reconstruction vs. generation: Taming optimization dilemma in latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025
work page 2025
-
[71]
Large Batch Optimization for Deep Learning: Training BERT in 76 minutes
Yang You, Jing Li, Sashank Reddi, Jonathan Hseu, Sanjiv Kumar, Srinadh Bhojanapalli, Xiaodan Song, James Demmel, Kurt Keutzer, and Cho-Jui Hsieh. Large batch optimization for deep learning: Training bert in 76 minutes. In International Conference on Learning Representations, 2020. doi:10.48550/arxiv.1904.00962. URL https://openreview.net/forum?id=Syx4wnEtvH
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1904.00962 2020
-
[72]
CoCa: Contrastive Captioners are Image-Text Foundation Models
Jiahui Yu, Zirui Wang, Vijay Vasudevan, Legg Yeung, Mojtaba Seyedhosseini, and Yonghui Wu. Coca: Contrastive captioners are image-text foundation models. Transactions on Machine Learning Research, 2022. ISSN 2835-8856. doi:10.48550/arxiv.2205.01917. URL https://openreview.net/forum?id=Ee277P3AYC
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2205.01917 2022
-
[73]
CutMix : Regularization strategy to train strong classifiers with localizable features
Sangdoo Yun, Dongyoon Han, Sanghyuk Chun, Seong Joon Oh, Youngjoon Yoo, and Junsuk Choe. CutMix : Regularization strategy to train strong classifiers with localizable features. In 2019 IEEE / CVF International Conference on Computer Vision ( ICCV ) . IEEE , 2019. doi:10.1109/iccv.2019.00612
-
[74]
Qingtian Zeng, Jian Sun, and Shansong Wang. Dic-transformer: interpretation of plant disease classification results using image caption generation technology. Frontiers in Plant Science, 14, 2024. ISSN 1664-462X. doi:10.3389/fpls.2023.1273029
-
[75]
Hongyi Zhang, Moustapha Cisse, Yann N. Dauphin, and David Lopez-Paz. mixup: Beyond empirical risk minimization. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=r1Ddp1-Rb
work page 2018
-
[76]
Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models
Yanzhao Zhang, Mingxin Li, Dingkun Long, Xin Zhang, Huan Lin, Baosong Yang, Pengjun Xie, An Yang, Dayiheng Liu, Junyang Lin, Fei Huang, and Jingren Zhou. Qwen3 embedding: Advancing text embedding and reranking through foundation models. 2025. doi:10.48550/ARXIV.2506.05176
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2506.05176 2025
-
[77]
Random erasing data augmentation
Zhun Zhong, Liang Zheng, Guoliang Kang, Shaozi Li, and Yi Yang. Random erasing data augmentation. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), 2020
work page 2020
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.