TextTeacher: What Can Language Teach About Images?

Ahmed Anwar; Andreas Dengel; Brian Bernhard Moser; Federico Raue; Stanislav Frolov; Tobias Christian Nauen

arxiv: 2605.22098 · v1 · pith:Y6F4J4KGnew · submitted 2026-05-21 · 💻 cs.CV · cs.AI· cs.LG

TextTeacher: What Can Language Teach About Images?

Tobias Christian Nauen , Stanislav Frolov , Brian Bernhard Moser , Federico Raue , Ahmed Anwar , Andreas Dengel This is my paper

Pith reviewed 2026-05-22 07:20 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.LG

keywords TextTeachervision-languageimage classificationknowledge distillationImageNetViTauxiliary objectivesemantic anchors

0 comments

The pith

TextTeacher uses image captions and a frozen text encoder to improve vision model accuracy on ImageNet by up to 2.7 percentage points without altering the inference model.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether semantic knowledge from language models can efficiently guide vision models by leveraging the platonic representation hypothesis. It introduces TextTeacher as an auxiliary objective that injects projected text embeddings from readily available captions into image classification training. This matters to a sympathetic reader because it promises higher accuracy and better transfer with negligible added cost and no change to the final deployed vision model. The approach avoids costly multimodal training of the target model itself while supplying complementary semantic cues that precondition deeper layers early in training.

Core claim

TextTeacher is a simple auxiliary objective that injects text embeddings as additional information into image classification training. It uses readily available image captions, a pre-trained and frozen text encoder, and a lightweight projection to produce semantic anchors that efficiently guide representations during training while leaving the inference-time model unchanged.

What carries the argument

TextTeacher auxiliary objective that projects frozen text encoder outputs into the vision feature space to act as semantic anchors and precondition deeper layers.

If this is right

Yields up to 2.7 percentage point accuracy gains on ImageNet with standard ViT backbones.
Produces average 1.0 percentage point gains on transfer tasks under the same training recipe.
Outperforms vision-only knowledge distillation at constant compute or achieves similar accuracy 33 percent faster.
Shapes deeper layers in the first stages of training to aid generalization.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This suggests vision models can gain from abundant text data without requiring joint multimodal pretraining of the vision backbone.
The preconditioning effect could extend to other vision tasks such as object detection if similar caption-based anchors are used.
It raises the possibility that future scaling of vision models might benefit more from simple text guidance than from increased vision-only data alone.

Load-bearing premise

Readily available image captions supply complementary semantic cues that a frozen text encoder and lightweight projection can turn into useful guidance for vision features.

What would settle it

Training the same ViT backbone on ImageNet with and without the TextTeacher auxiliary loss and finding no accuracy gain or transfer improvement under matched compute and recipe.

Figures

Figures reproduced from arXiv: 2605.22098 by Ahmed Anwar, Andreas Dengel, Brian Bernhard Moser, Federico Raue, Stanislav Frolov, Tobias Christian Nauen.

**Figure 2.** Figure 2: λ sweep for ViT-S on ImageNet (100 epochs) with λt ≡ λ ∈ [0.0, 0.8]. TextTeacher improves over the baseline especially at λ = 0.5 with αadapt. Without adaption, accuracy drops sharply for λ > 0.3; with adaption it is stable up to λ = 0.6 [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗

**Figure 3.** Figure 3: Jump schedules for ViT-S on ImageNet. At epoch [PITH_FULL_IMAGE:figures/full_fig_p010_3.png] view at source ↗

**Figure 4.** Figure 4: 2D-umap of the embedding space of ViT-S after training without and with [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗

**Figure 6.** Figure 6: Label-noise robustness for ViT-S. Accuracy declines with higher noise level ρ for both methods, but TextTeacher enlarges its margin over the baseline as ρ increases. To locate TextTeacher’s impact in a trained model, we study how guidance changes representational geometry across random initializations using centered kernel alignment (Kornblith et al., 2019) (CKA) at different depths. Visualizing the layerw… view at source ↗

**Figure 7.** Figure 7: Limited data results for ViT-S with a fixed number of update steps. [PITH_FULL_IMAGE:figures/full_fig_p021_7.png] view at source ↗

**Figure 8.** Figure 8: Distribution of the length of image captions for different image captioners compared to ImageNet [PITH_FULL_IMAGE:figures/full_fig_p022_8.png] view at source ↗

**Figure 9.** Figure 9: Across run CKA similarity for ViT-S. While different runs are similar in the early layers, [PITH_FULL_IMAGE:figures/full_fig_p023_9.png] view at source ↗

**Figure 10.** Figure 10: Most aligned and disaligned images for 6 random classes using ViT-B trained with [PITH_FULL_IMAGE:figures/full_fig_p024_10.png] view at source ↗

read the original abstract

The platonic representation hypothesis suggests that sufficiently large models converge to a shared representation geometry, even across modalities. Motivated by this, we ask: Can the semantic knowledge of a language model efficiently improve a vision model? As an answer, we introduce TextTeacher, a simple auxiliary objective that injects text embeddings as additional information into image classification training. TextTeacher uses readily available image captions, a pre-trained and frozen text encoder, and a lightweight projection to produce semantic anchors that efficiently guide representations during training while leaving the inference-time model unchanged. On ImageNet with standard ViT backbones, TextTeacher improves accuracy by up to +2.7 percentage points (p.p.) and yields consistent transfer gains (on average +1.0 p.p.) under the same recipe and compute. It outperforms vision knowledge distillation, yielding more accuracy at a constant compute budget or similar accuracy, but 33% faster. Our analysis indicates that TextTeacher acts as a feature-space preconditioner, shaping deeper layers in the first stages of training, and aiding generalization by supplying complementary semantic cues. TextTeacher adds negligible overhead, requires no costly multimodal training of the target model and preserves the simplicity and latency of pure vision models. Project page with code and captions: https://nauen-it.de/publications/text-teacher

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

TextTeacher gets some real-looking accuracy lifts on ImageNet by adding projected text embeddings as an auxiliary loss, but the gains may come from simpler supervision effects than the platonic story suggests.

read the letter

The main takeaway is that TextTeacher adds a lightweight auxiliary objective during vision training that pulls in frozen text embeddings from image captions. On standard ViT backbones it reports up to 2.7 points higher ImageNet accuracy and roughly 1 point average transfer improvement, all while leaving the final model unchanged at inference and adding almost no extra cost. It also beats their vision distillation baseline under the same compute budget or runs faster for similar accuracy. They release code and the captions, which helps reproducibility.

Referee Report

1 major / 3 minor

Summary. The manuscript introduces TextTeacher, a simple auxiliary objective for training vision models on image classification. It uses readily available image captions, a pre-trained frozen text encoder, and a lightweight projection to generate semantic anchors that guide representations during training. The approach leaves the inference-time model unchanged. On ImageNet with standard ViT backbones, the method reports accuracy gains of up to +2.7 percentage points and average transfer gains of +1.0 percentage points, outperforming vision knowledge distillation at constant or reduced compute while adding negligible overhead. The work is motivated by the platonic representation hypothesis and positions the text embeddings as complementary semantic cues that precondition deeper layers.

Significance. If the gains are robust and attributable to cross-modal semantic transfer rather than implicit class supervision, the result would provide practical evidence supporting the platonic representation hypothesis and a low-overhead route to improve pure vision models using language resources without multimodal retraining. The public release of code and captions strengthens reproducibility and enables independent verification of the reported improvements.

major comments (1)

[§4] §4: The analysis claims that TextTeacher acts as a feature-space preconditioner by supplying complementary semantic cues. However, no ablation is presented that masks class names in captions, replaces them with synonyms, or otherwise severs the correlation between caption content and target class labels while keeping the projection and training recipe fixed. This test is load-bearing for attributing the +2.7 p.p. ImageNet and +1.0 p.p. transfer gains to the stated cross-modal mechanism rather than soft label supervision.

minor comments (3)

[§3] §3: The auxiliary loss combining the standard classification objective with the projected text term should be written explicitly as an equation to clarify the weighting and optimization details.
[Abstract] Abstract and §5: The maximum +2.7 p.p. gain is reported without specifying the exact ViT variant, training schedule, or run that achieves it; adding this detail would improve precision.
[Experiments] Experiments section: While the project page supplies code and captions, the main text should briefly describe caption sourcing and any filtering steps to support full replication from the manuscript alone.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the thoughtful review and for highlighting a key point about mechanism attribution. We address the major comment below and will incorporate the requested ablation in the revised manuscript.

read point-by-point responses

Referee: [§4] §4: The analysis claims that TextTeacher acts as a feature-space preconditioner by supplying complementary semantic cues. However, no ablation is presented that masks class names in captions, replaces them with synonyms, or otherwise severs the correlation between caption content and target class labels while keeping the projection and training recipe fixed. This test is load-bearing for attributing the +2.7 p.p. ImageNet and +1.0 p.p. transfer gains to the stated cross-modal mechanism rather than soft label supervision.

Authors: We agree that the manuscript does not contain an ablation that explicitly removes or replaces class-related information from the captions while holding the rest of the pipeline fixed. Section 4 presents evidence that TextTeacher shapes deeper-layer representations early in training and yields gains beyond what is observed with standard vision-only baselines or distillation. Nevertheless, the referee is correct that this leaves open the possibility that part of the benefit arises from implicit class supervision encoded in the captions rather than richer cross-modal semantics. To address this directly, we will add the suggested ablation in the revision: we will mask class names, replace them with synonyms, or substitute generic descriptors in the captions and re-run the ImageNet and transfer experiments under identical training settings. The results will be reported alongside the existing analysis. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical method with external components

full rationale

The paper introduces TextTeacher as an auxiliary training objective that injects embeddings from a frozen external text encoder (pre-trained independently) and publicly available image captions into ViT training. The reported gains (+2.7 p.p. ImageNet accuracy, +1.0 p.p. transfer) are presented as measured experimental outcomes under fixed recipes, not as quantities derived by construction from fitted parameters or self-referential definitions. No equations reduce the performance claims to tautological fits, and the central mechanism relies on independent, verifiable components (projection layer, auxiliary loss) rather than load-bearing self-citations or ansatzes smuggled from prior author work. The derivation chain is self-contained against external benchmarks such as standard supervised ViT training and vision distillation baselines.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The approach rests on the platonic representation hypothesis as motivation and on the availability of image captions plus a frozen external text encoder; the lightweight projection is a trainable component whose exact parameterization is not detailed in the abstract.

free parameters (1)

lightweight projection
Small trainable layer that maps text embeddings into the vision feature space; its weights are fitted during the auxiliary training.

axioms (1)

domain assumption Platonic representation hypothesis: sufficiently large models converge to a shared representation geometry across modalities
Explicitly stated as the motivating premise in the abstract.

invented entities (1)

semantic anchors no independent evidence
purpose: Additional training-time signals derived from text embeddings to guide vision representations
Introduced as the output of the text encoder plus projection; no independent falsifiable prediction is given in the abstract.

pith-pipeline@v0.9.0 · 5775 in / 1405 out tokens · 38388 ms · 2026-05-22T07:20:53.414809+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

TextTeacher uses readily available image captions, a pre-trained and frozen text encoder, and a lightweight projection to produce semantic anchors... auxiliary CLIP-style contrastive loss... acts as a feature-space preconditioner, shaping deeper layers in the first stages of training
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

The platonic representation hypothesis suggests that sufficiently large models converge to a shared representation geometry, even across modalities

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

77 extracted references · 77 canonical work pages · 10 internal anchors

[1]

A survey on lexical ambiguity detection and word sense disambiguation

Miuru Abeysiriwardana and Deshan Sumanathilaka. A survey on lexical ambiguity detection and word sense disambiguation. 2024. doi:10.48550/ARXIV.2403.16129

work page doi:10.48550/arxiv.2403.16129 2024
[2]

Label-embedding for image classification

Zeynep Akata, Florent Perronnin, Zaid Harchaoui, and Cordelia Schmid. Label-embedding for image classification. IEEE Transactions on Pattern Analysis and Machine Intelligence, 38 0 (7): 0 1425--1438, 2015. ISSN 2160-9292. doi:10.1109/tpami.2015.2487986

work page doi:10.1109/tpami.2015.2487986 2015
[3]

PaliGemma: A versatile 3B VLM for transfer

Lucas Beyer, Andreas Steiner, André Susano Pinto, Alexander Kolesnikov, Xiao Wang, Daniel Salz, Maxim Neumann, Ibrahim Alabdulmohsin, Michael Tschannen, Emanuele Bugliarello, Thomas Unterthiner, Daniel Keysers, Skanda Koppula, Fangyu Liu, Adam Grycner, Alexey Gritsenko, Neil Houlsby, Manoj Kumar, Keran Rong, Julian Eisenschlos, Rishabh Kabra, Matthias Bau...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2407.07726 2024
[4]

Multimodal datasets: misogyny, pornography, and malignant stereotypes

Abeba Birhane, Vinay Uday Prabhu, and Emmanuel Kahembwe. Multimodal datasets: misogyny, pornography, and malignant stereotypes. October 2021. doi:10.48550/ARXIV.2110.01963

work page doi:10.48550/arxiv.2110.01963 2021
[5]

Addressing uncertainty in the safety assurance of machine-learning

Simon Burton and Benjamin Herd. Addressing uncertainty in the safety assurance of machine-learning. Frontiers in Computer Science, 5, 2023. ISSN 2624-9898. doi:10.3389/fcomp.2023.1132580

work page doi:10.3389/fcomp.2023.1132580 2023
[6]

Isotropy in the contextual embedding space: Clusters and manifolds

Xingyu Cai, Jiaji Huang, Yuchen Bian, and Kenneth Church. Isotropy in the contextual embedding space: Clusters and manifolds. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=xYGNO86OWDH

work page 2021
[7]

End-to-end object detection with transformers

Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to-end object detection with transformers. In Andrea Vedaldi, Horst Bischof, Thomas Brox, and Jan-Michael Frahm (eds.), Computer Vision -- ECCV 2020, pp.\ 213--229, Cham, 2020. Springer International Publishing. ISBN 978-3-030-58452-8

work page 2020
[8]

Emerging Properties in Self-Supervised Vision Transformers

Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers. In Proceedings of the International Conference on Computer Vision (ICCV). arXiv, 2021. doi:10.48550/ARXIV.2104.14294

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2104.14294 2021
[9]

Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey E. Hinton. A simple framework for contrastive learning of visual representations. In Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event , Proceedings of Machine Learning Research, pp.\ 1597--1607. PMLR , 2020. URL http://proceedings.mlr....

work page 2020
[10]

When vision transformers outperform resnets without pre-training or strong data augmentations

Xiangning Chen, Cho-Jui Hsieh, and Boqing Gong. When vision transformers outperform resnets without pre-training or strong data augmentations. In International Conference on Learning Representations, 2022

work page 2022
[11]

Ekin Dogus Cubuk, Barret Zoph, Dandelion Man \'e , Vijay Vasudevan, and Quoc V. Le. Autoaugment: Learning augmentation strategies from data. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.\ 113--123, 2019

work page 2019
[12]

Imagenet: A large- scale hierarchical image database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. ImageNet : A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition . IEEE , 2009. doi:10.1109/cvpr.2009.5206848

work page doi:10.1109/cvpr.2009.5206848 2009
[13]

Virtex: Learning visual representations from textual annotations

Karan Desai and Justin Johnson. Virtex: Learning visual representations from textual annotations. In 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.\ 11157--11168, 2021. doi:10.1109/CVPR46437.2021.01101

work page doi:10.1109/cvpr46437.2021.01101 2021
[14]

Devlin, M.-W

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. In Jill Burstein, Christy Doran, and Thamar Solorio (eds.), Proceedings of the 2019 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volu...

work page doi:10.18653/v1/n19-1423 2019
[15]

An image is worth 16x16 words: Transformers for image recognition at scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. In 9th International Conference on Learning Representations, ICLR 2021, ...

work page 2021
[16]

Xcit: Cross-covariance image transformers

Alaaeldin El-Nouby, Hugo Touvron, Mathilde Caron, Piotr Bojanowski, Matthijs Douze, Armand Joulin, Ivan Laptev, Natalia Neverova, Gabriel Synnaeve, Jakob Verbeek, and Hervé Jegou. Xcit: Cross-covariance image transformers. In A. Beygelzimer, Y. Dauphin, P. Liang, and J. Wortman Vaughan (eds.), Advances in Neural Information Processing Systems, 2021. doi:1...

work page doi:10.48550/arxiv.2106.09681 2021
[17]

Caron, H

Zhiyuan Fang, Jianfeng Wang, Xiaowei Hu, Lijuan Wang, Yezhou Yang, and Zicheng Liu. Compressing visual-linguistic model via knowledge distillation. In 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pp.\ 1408--1418, 2021. doi:10.1109/ICCV48922.2021.00146

work page doi:10.1109/iccv48922.2021.00146 2021
[18]

Caption supervision enables robust learners

Benjamin Feuer, Ameya Joshi, and Chinmay Hegde. Caption supervision enables robust learners. 2022. doi:10.48550/ARXIV.2210.07396

work page doi:10.48550/arxiv.2210.07396 2022
[19]

Devise: A deep visual-semantic embedding model

Andrea Frome, Greg S Corrado, Jon Shlens, Samy Bengio, Jeff Dean, Marc Aurelio Ranzato, and Tomas Mikolov. Devise: A deep visual-semantic embedding model. In C.J. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K.Q. Weinberger (eds.), Advances in Neural Information Processing Systems, volume 26. Curran Associates, Inc., 2013. URL https://proceedings.neu...

work page 2013
[20]

Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik

Ross B. Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. 2014 IEEE Conference on Computer Vision and Pattern Recognition, pp.\ 580--587, 2014

work page 2014
[21]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev, Arthur Hinsvark, Arun Rao, Aston Zhang, Aurelien Rodriguez, Austen Gregerson, and A...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2407.21783 2024
[22]

Crossmodal knowledge distillation with wordnet-relaxed text embeddings for robust image classification

Chenqi Guo, Mengshuo Rong, Qianli Feng, Rongfan Feng, and Yinglong Ma. Crossmodal knowledge distillation with wordnet-relaxed text embeddings for robust image classification. 2025. doi:10.48550/ARXIV.2503.24017

work page doi:10.48550/arxiv.2503.24017 2025
[23]

Polysemy—evidence from linguistics, behavioral science, and contextualized language models

Janosch Haber and Massimo Poesio. Polysemy—evidence from linguistics, behavioral science, and contextualized language models. Computational Linguistics, 50 0 (1): 0 351--417, 2024. ISSN 1530-9312. doi:10.1162/coli_a_00500

work page doi:10.1162/coli_a_00500 2024
[24]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.\ 770--778, 2016

work page 2016
[25]

Mask R-CNN

Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. Mask r-cnn. In ICCV, pp.\ 2980--2988. arXiv, 2017. ISBN 978-1-5386-1032-9. doi:10.48550/ARXIV.1703.06870

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1703.06870 2017
[26]

Masked image pretraining on language assisted representation

Zejiang Hou and Sun-Yuan Kung. Masked image pretraining on language assisted representation. In ICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2025. doi:10.1109/ICASSP49660.2025.10888259

work page doi:10.1109/icassp49660.2025.10888259 2025
[27]

Position: The platonic representation hypothesis

Minyoung Huh, Brian Cheung, Tongzhou Wang, and Phillip Isola. Position: The platonic representation hypothesis. In Ruslan Salakhutdinov, Zico Kolter, Katherine Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp (eds.), Proceedings of the 41st International Conference on Machine Learning, volume 235 of Proceedings of Machine Learn...

work page 2024
[28]

Vl2lite: Task-specific knowledge distillation from large vision-language models to lightweight networks, 2025

Jinseong Jang, Chunfei Ma, and Byeongwon Lee. Vl2lite: Task-specific knowledge distillation from large vision-language models to lightweight networks, 2025

work page 2025
[29]

Scaling up visual and vision-language representation learning with noisy text supervision

Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representation learning with noisy text supervision. In Marina Meila and Tong Zhang (eds.), Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine...

work page 2021
[30]

Combining Weakly and Webly Supervised Learning for Classifying Food Images

Parneet Kaur, Karan Sikka, and Ajay Divakaran. Combining weakly and webly supervised learning for classifying food images. 2017. doi:10.48550/ARXIV.1712.08730

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1712.08730 2017
[31]

TIPS : Text-image pretraining with spatial awareness

Kevis kokitsi Maninis, Kaifeng Chen, Soham Ghosh, Arjun Karpur, Koert Chen, Ye Xia, Bingyi Cao, Daniel Salz, Guangxing Han, Jan Dlabal, Dan Gnanapragasam, Mojtaba Seyedhosseini, Howard Zhou, and Andre Araujo. TIPS : Text-image pretraining with spatial awareness. In The Thirteenth International Conference on Learning Representations, 2025

work page 2025
[32]

Similarity of neural network representations revisited

Simon Kornblith, Mohammad Norouzi, Honglak Lee, and Geoffrey Hinton. Similarity of neural network representations revisited. In Kamalika Chaudhuri and Ruslan Salakhutdinov (eds.), Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pp.\ 3519--3529. PMLR, 09--15 Jun 2019. URL http://p...

work page 2019
[33]

3d object representations for fine-grained categorization

Jonathan Krause, Michael Stark, Jia Deng, and Li Fei-Fei. 3d object representations for fine-grained categorization. In 4th International IEEE Workshop on 3D Representation and Recognition (3dRR-13), Sydney, Australia, 2013

work page 2013
[34]

Nv-embed: Improved techniques for training llms as generalist embedding models

Chankyu Lee, Rajarshi Roy, Mengyao Xu, Jonathan Raiman, Mohammad Shoeybi, Bryan Catanzaro, and Wei Ping. Nv-embed: Improved techniques for training llms as generalist embedding models. In Y. Yue, A. Garg, N. Peng, F. Sha, and R. Yu (eds.), International Conference on Learning Representations, volume 2025, pp.\ 79310--79333. arXiv, 2025

work page 2025
[35]

Next-vit: Next generation vision transformer for efficient deployment in realistic industrial scenarios

Jiashi Li, Xin Xia, Wei Li, Huixia Li, Xing Wang, Xuefeng Xiao, Rui Wang, Min Zheng, and Xin Pan. Next-vit: Next generation vision transformer for efficient deployment in realistic industrial scenarios. 2022. doi:10.48550/ARXIV.2207.05501

work page doi:10.48550/arxiv.2207.05501 2022
[36]

Blip-2: bootstrapping language-image pre-training with frozen image encoders and large language models

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: bootstrapping language-image pre-training with frozen image encoders and large language models. In Proceedings of the 40th International Conference on Machine Learning, ICML'23, 2023

work page 2023
[37]

Visual instruction tuning

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine (eds.), Advances in Neural Information Processing Systems, volume 36, pp.\ 34892--34916. Curran Associates, Inc., 2023. URL https://proceedings.neurips.cc/paper_files/paper/2023/file/6dcf277ea32ce328891...

work page 2023
[38]

Asymmetric visual semantic embedding framework for efficient vision-language alignment

Yang Liu, Mengyuan Liu, Shudong Huang, and Jiancheng Lv. Asymmetric visual semantic embedding framework for efficient vision-language alignment. Proceedings of the AAAI Conference on Artificial Intelligence, 39 0 (6): 0 5676--5684, April 2025. ISSN 2159-5399. doi:10.1609/aaai.v39i6.32605

work page doi:10.1609/aaai.v39i6.32605 2025
[39]

Caron, H

Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pp.\ 9992--10002, Los Alamitos, CA, USA, 10 2021. IEEE Computer Society. doi:10.1109/ICCV48922.2021.00986

work page doi:10.1109/iccv48922.2021.00986 2021
[40]

Borrowing knowledge from pre-trained language model: A new data-efficient visual learning paradigm

Wenxuan Ma, Shuang Li, JinMing Zhang, Chi Harold Liu, Jingxuan Kang, Yulin Wang, and Gao Huang. Borrowing knowledge from pre-trained language model: A new data-efficient visual learning paradigm. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp.\ 18786--18797, October 2023

work page 2023
[41]

S. Maji, J. Kannala, E. Rahtu, M. Blaschko, and A. Vedaldi. Fine-grained visual classification of aircraft. Technical report, 2013

work page 2013
[42]

Slip: Self-supervision meets language-image pre-training

Norman Mu, Alexander Kirillov, David Wagner, and Saining Xie. Slip: Self-supervision meets language-image pre-training. In Shai Avidan, Gabriel Brostow, Moustapha Ciss \'e , Giovanni Maria Farinella, and Tal Hassner (eds.), Computer Vision -- ECCV 2022, pp.\ 529--544, Cham, 2022. Springer Nature Switzerland. ISBN 978-3-031-19809-0

work page 2022
[43]

Nurminen

Lalli Myllyaho, Mikko Raatikainen, Tomi Männistö, Tommi Mikkonen, and Jukka K. Nurminen. Systematic literature review of validation methods for ai systems. Journal of Systems and Software, 181: 0 111050, 2021. ISSN 0164-1212. doi:10.1016/j.jss.2021.111050

work page doi:10.1016/j.jss.2021.111050 2021
[44]

Clip-embed-kd: Computationally efficient knowledge distillation using embeddings as teachers

Lakshmi Nair. Clip-embed-kd: Computationally efficient knowledge distillation using embeddings as teachers. Extended abstract: 28th IEEE High Performance Extreme Computing Conference (HPEC) 2024 - Outstanding short paper award, 2024

work page 2024
[45]

Which transformer to favor: A comparative analysis of efficiency in vision transformers

Tobias Christian Nauen, Sebastian Palacio, Federico Raue, and Andreas Dengel. Which transformer to favor: A comparative analysis of efficiency in vision transformers. In Proceedings of the Winter Conference on Applications of Computer Vision (WACV), pp.\ 6955--6966, February 2025

work page 2025
[46]

Automated flower classification over a large number of classes

Maria-Elena Nilsback and Andrew Zisserman. Automated flower classification over a large number of classes. In Indian Conference on Computer Vision, Graphics and Image Processing, Dec 2008

work page 2008
[47]

Maxime Oquab, Timoth \'e e Darcet, Th \'e o Moutakanni, Huy V. Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel HAZIZA, Francisco Massa, Alaaeldin El-Nouby, Mido Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Herve Jegou, Julien Mairal, Pat...

work page 2024
[48]

Parkhi, Andrea Vedaldi, Andrew Zisserman, and C

Omkar M. Parkhi, Andrea Vedaldi, Andrew Zisserman, and C. V. Jawahar. Cats and dogs. In IEEE Conference on Computer Vision and Pattern Recognition, 2012

work page 2012
[49]

Development methodologies for safety critical machine learning applications in the automotive domain: A survey

Martin Rabe, Stefan Milz, and Patrick Mader. Development methodologies for safety critical machine learning applications in the automotive domain: A survey. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, pp.\ 129--141, June 2021

work page 2021
[50]

Improving language understanding by generative pre-training

Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. Improving language understanding by generative pre-training. 2018

work page 2018
[51]

Language models are unsupervised multitask learners

Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. 2019. URL https://openai.com/blog/better-language-models/

work page 2019
[52]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. In Marina Meila and Tong Zhang (eds.), Proceedings of the 38th International Conference on Machine L...

work page 2021
[53]

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21 0 (140): 0 1--67, 2020

work page 2020
[54]

The shape of learning: Anisotropy and intrinsic dimensions in transformer-based models

Anton Razzhigaev, Matvey Mikhalchuk, Elizaveta Goncharova, Ivan Oseledets, Denis Dimitrov, and Andrey Kuznetsov. The shape of learning: Anisotropy and intrinsic dimensions in transformer-based models. In Yvette Graham and Matthew Purver (eds.), Findings of the Association for Computational Linguistics: EACL 2024, pp.\ 868--874, St. Julian ' s, Malta, Marc...

work page doi:10.18653/v1/2024.findings-eacl.58 2024
[55]

Matuszewski

Edward Sanderson and Bogdan J. Matuszewski. FCN-Transformer Feature Fusion for Polyp Segmentation, pp.\ 892--907. Springer International Publishing, 2022. ISBN 9783031120534. doi:10.1007/978-3-031-12053-4_65

work page doi:10.1007/978-3-031-12053-4_65 2022
[56]

A fistful of words: Learning transferable visual models from bag-of-words supervision, 2022

Ajinkya Tejankar, Maziar Sanjabi, Bichen Wu, Saining Xie, Madian Khabsa, Hamed Pirsiavash, and Hamed Firooz. A fistful of words: Learning transferable visual models from bag-of-words supervision, 2022. URL https://arxiv.org/abs/2112.13884

work page arXiv 2022
[57]

Dragonfly: Multi-resolution zoom-in encoding enhances vision-language models

Rahul Thapa, Kezhen Chen, Ian Covert, Rahul Chalamala, Ben Athiwaratkun, Shuaiwen Leon Song, and James Zou. Dragonfly: Multi-resolution zoom-in encoding enhances vision-language models. 2024. doi:10.48550/ARXIV.2406.00977

work page doi:10.48550/arxiv.2406.00977 2024
[58]

What makes for good views for contrastive learning? In H

Yonglong Tian, Chen Sun, Ben Poole, Dilip Krishnan, Cordelia Schmid, and Phillip Isola. What makes for good views for contrastive learning? In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin (eds.), Advances in Neural Information Processing Systems, volume 33, pp.\ 6827--6839. Curran Associates, Inc., 2020

work page 2020
[59]

Training data-efficient image transformers & distillation through attention

Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Herve Jegou. Training data-efficient image transformers & distillation through attention. In Marina Meila and Tong Zhang (eds.), Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pp.\ 10347--...

work page 2021
[60]

Deit iii: Revenge of the vit

Hugo Touvron, Matthieu Cord, and Herv \'e J \'e gou. Deit iii: Revenge of the vit. In Shai Avidan, Gabriel Brostow, Moustapha Ciss \'e , Giovanni Maria Farinella, and Tal Hassner (eds.), Computer Vision -- ECCV 2022, pp.\ 516--533, Cham, 2022. Springer Nature Switzerland

work page 2022
[61]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. Llama: Open and efficient foundation language models. 2023. doi:10.48550/ARXIV.2302.13971

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2302.13971 2023
[62]

Attention is all you need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017

work page 2017
[63]

Effisegnet: Gastrointestinal polyp segmentation through a pre-trained efficientnet-based network with a simplified decoder

Ioannis Vezakis, Konstantinos Georgas, Dimitrios Fotiadis, and George Matsopoulos. Effisegnet: Gastrointestinal polyp segmentation through a pre-trained efficientnet-based network with a simplified decoder. In 2024 46th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), pp.\ 1--4, 07 2024. doi:10.1109/EMBC53108...

work page doi:10.1109/embc53108.2024.10782015 2024
[64]

C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie. Caltech-ucsd birds 200. Technical Report CNS-TR-2011-001, California Institute of Technology, 2011

work page 2011
[65]

Internimage: Exploring large-scale vision foundation models with deformable convolutions

Wenhai Wang, Jifeng Dai, Zhe Chen, Zhenhang Huang, Zhiqi Li, Xizhou Zhu, Xiaowei Hu, Tong Lu, Lewei Lu, Hongsheng Li, Xiaogang Wang, and Yu Qiao. Internimage: Exploring large-scale vision foundation models with deformable convolutions. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.\ 14408--14419, 2023

work page 2023
[66]

Advances in neural architecture search

Xin Wang and Wenwu Zhu. Advances in neural architecture search. National Science Review, 11 0 (8), 2024. ISSN 2053-714X. doi:10.1093/nsr/nwae282

work page doi:10.1093/nsr/nwae282 2024
[67]

Energy efficiency of training neural network architectures: An empirical study

Yinlena Xu, Silverio Martínez-Fernández, Matias Martinez, and Xavier Franch. Energy efficiency of training neural network architectures: An empirical study. In Proceedings of the 56th Hawaii International Conference on System Sciences, HICSS. Hawaii International Conference on System Sciences, 2023. doi:10.24251/hicss.2023.098

work page doi:10.24251/hicss.2023.098 2023
[68]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2505.09388 2025
[69]

Clip-kd: An empirical study of clip model distillation

Chuanguang Yang, Zhulin An, Libo Huang, Junyu Bi, Xinqiang Yu, Han Yang, Boyu Diao, and Yongjun Xu. Clip-kd: An empirical study of clip model distillation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

work page 2024
[70]

Reconstruction vs

Jingfeng Yao, Bin Yang, and Xinggang Wang. Reconstruction vs. generation: Taming optimization dilemma in latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

work page 2025
[71]

Large Batch Optimization for Deep Learning: Training BERT in 76 minutes

Yang You, Jing Li, Sashank Reddi, Jonathan Hseu, Sanjiv Kumar, Srinadh Bhojanapalli, Xiaodan Song, James Demmel, Kurt Keutzer, and Cho-Jui Hsieh. Large batch optimization for deep learning: Training bert in 76 minutes. In International Conference on Learning Representations, 2020. doi:10.48550/arxiv.1904.00962. URL https://openreview.net/forum?id=Syx4wnEtvH

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1904.00962 2020
[72]

CoCa: Contrastive Captioners are Image-Text Foundation Models

Jiahui Yu, Zirui Wang, Vijay Vasudevan, Legg Yeung, Mojtaba Seyedhosseini, and Yonghui Wu. Coca: Contrastive captioners are image-text foundation models. Transactions on Machine Learning Research, 2022. ISSN 2835-8856. doi:10.48550/arxiv.2205.01917. URL https://openreview.net/forum?id=Ee277P3AYC

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2205.01917 2022
[73]

CutMix : Regularization strategy to train strong classifiers with localizable features

Sangdoo Yun, Dongyoon Han, Sanghyuk Chun, Seong Joon Oh, Youngjoon Yoo, and Junsuk Choe. CutMix : Regularization strategy to train strong classifiers with localizable features. In 2019 IEEE / CVF International Conference on Computer Vision ( ICCV ) . IEEE , 2019. doi:10.1109/iccv.2019.00612

work page doi:10.1109/iccv.2019.00612 2019
[74]

Dic-transformer: interpretation of plant disease classification results using image caption generation technology

Qingtian Zeng, Jian Sun, and Shansong Wang. Dic-transformer: interpretation of plant disease classification results using image caption generation technology. Frontiers in Plant Science, 14, 2024. ISSN 1664-462X. doi:10.3389/fpls.2023.1273029

work page doi:10.3389/fpls.2023.1273029 2024
[75]

Dauphin, and David Lopez-Paz

Hongyi Zhang, Moustapha Cisse, Yann N. Dauphin, and David Lopez-Paz. mixup: Beyond empirical risk minimization. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=r1Ddp1-Rb

work page 2018
[76]

Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models

Yanzhao Zhang, Mingxin Li, Dingkun Long, Xin Zhang, Huan Lin, Baosong Yang, Pengjun Xie, An Yang, Dayiheng Liu, Junyang Lin, Fei Huang, and Jingren Zhou. Qwen3 embedding: Advancing text embedding and reranking through foundation models. 2025. doi:10.48550/ARXIV.2506.05176

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2506.05176 2025
[77]

Random erasing data augmentation

Zhun Zhong, Liang Zheng, Guoliang Kang, Shaozi Li, and Yi Yang. Random erasing data augmentation. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), 2020

work page 2020

[1] [1]

A survey on lexical ambiguity detection and word sense disambiguation

Miuru Abeysiriwardana and Deshan Sumanathilaka. A survey on lexical ambiguity detection and word sense disambiguation. 2024. doi:10.48550/ARXIV.2403.16129

work page doi:10.48550/arxiv.2403.16129 2024

[2] [2]

Label-embedding for image classification

Zeynep Akata, Florent Perronnin, Zaid Harchaoui, and Cordelia Schmid. Label-embedding for image classification. IEEE Transactions on Pattern Analysis and Machine Intelligence, 38 0 (7): 0 1425--1438, 2015. ISSN 2160-9292. doi:10.1109/tpami.2015.2487986

work page doi:10.1109/tpami.2015.2487986 2015

[3] [3]

PaliGemma: A versatile 3B VLM for transfer

Lucas Beyer, Andreas Steiner, André Susano Pinto, Alexander Kolesnikov, Xiao Wang, Daniel Salz, Maxim Neumann, Ibrahim Alabdulmohsin, Michael Tschannen, Emanuele Bugliarello, Thomas Unterthiner, Daniel Keysers, Skanda Koppula, Fangyu Liu, Adam Grycner, Alexey Gritsenko, Neil Houlsby, Manoj Kumar, Keran Rong, Julian Eisenschlos, Rishabh Kabra, Matthias Bau...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2407.07726 2024

[4] [4]

Multimodal datasets: misogyny, pornography, and malignant stereotypes

Abeba Birhane, Vinay Uday Prabhu, and Emmanuel Kahembwe. Multimodal datasets: misogyny, pornography, and malignant stereotypes. October 2021. doi:10.48550/ARXIV.2110.01963

work page doi:10.48550/arxiv.2110.01963 2021

[5] [5]

Addressing uncertainty in the safety assurance of machine-learning

Simon Burton and Benjamin Herd. Addressing uncertainty in the safety assurance of machine-learning. Frontiers in Computer Science, 5, 2023. ISSN 2624-9898. doi:10.3389/fcomp.2023.1132580

work page doi:10.3389/fcomp.2023.1132580 2023

[6] [6]

Isotropy in the contextual embedding space: Clusters and manifolds

Xingyu Cai, Jiaji Huang, Yuchen Bian, and Kenneth Church. Isotropy in the contextual embedding space: Clusters and manifolds. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=xYGNO86OWDH

work page 2021

[7] [7]

End-to-end object detection with transformers

Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to-end object detection with transformers. In Andrea Vedaldi, Horst Bischof, Thomas Brox, and Jan-Michael Frahm (eds.), Computer Vision -- ECCV 2020, pp.\ 213--229, Cham, 2020. Springer International Publishing. ISBN 978-3-030-58452-8

work page 2020

[8] [8]

Emerging Properties in Self-Supervised Vision Transformers

Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers. In Proceedings of the International Conference on Computer Vision (ICCV). arXiv, 2021. doi:10.48550/ARXIV.2104.14294

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2104.14294 2021

[9] [9]

Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey E. Hinton. A simple framework for contrastive learning of visual representations. In Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event , Proceedings of Machine Learning Research, pp.\ 1597--1607. PMLR , 2020. URL http://proceedings.mlr....

work page 2020

[10] [10]

When vision transformers outperform resnets without pre-training or strong data augmentations

Xiangning Chen, Cho-Jui Hsieh, and Boqing Gong. When vision transformers outperform resnets without pre-training or strong data augmentations. In International Conference on Learning Representations, 2022

work page 2022

[11] [11]

Ekin Dogus Cubuk, Barret Zoph, Dandelion Man \'e , Vijay Vasudevan, and Quoc V. Le. Autoaugment: Learning augmentation strategies from data. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.\ 113--123, 2019

work page 2019

[12] [12]

Imagenet: A large- scale hierarchical image database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. ImageNet : A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition . IEEE , 2009. doi:10.1109/cvpr.2009.5206848

work page doi:10.1109/cvpr.2009.5206848 2009

[13] [13]

Virtex: Learning visual representations from textual annotations

Karan Desai and Justin Johnson. Virtex: Learning visual representations from textual annotations. In 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.\ 11157--11168, 2021. doi:10.1109/CVPR46437.2021.01101

work page doi:10.1109/cvpr46437.2021.01101 2021

[14] [14]

Devlin, M.-W

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. In Jill Burstein, Christy Doran, and Thamar Solorio (eds.), Proceedings of the 2019 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volu...

work page doi:10.18653/v1/n19-1423 2019

[15] [15]

An image is worth 16x16 words: Transformers for image recognition at scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. In 9th International Conference on Learning Representations, ICLR 2021, ...

work page 2021

[16] [16]

Xcit: Cross-covariance image transformers

Alaaeldin El-Nouby, Hugo Touvron, Mathilde Caron, Piotr Bojanowski, Matthijs Douze, Armand Joulin, Ivan Laptev, Natalia Neverova, Gabriel Synnaeve, Jakob Verbeek, and Hervé Jegou. Xcit: Cross-covariance image transformers. In A. Beygelzimer, Y. Dauphin, P. Liang, and J. Wortman Vaughan (eds.), Advances in Neural Information Processing Systems, 2021. doi:1...

work page doi:10.48550/arxiv.2106.09681 2021

[17] [17]

Caron, H

Zhiyuan Fang, Jianfeng Wang, Xiaowei Hu, Lijuan Wang, Yezhou Yang, and Zicheng Liu. Compressing visual-linguistic model via knowledge distillation. In 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pp.\ 1408--1418, 2021. doi:10.1109/ICCV48922.2021.00146

work page doi:10.1109/iccv48922.2021.00146 2021

[18] [18]

Caption supervision enables robust learners

Benjamin Feuer, Ameya Joshi, and Chinmay Hegde. Caption supervision enables robust learners. 2022. doi:10.48550/ARXIV.2210.07396

work page doi:10.48550/arxiv.2210.07396 2022

[19] [19]

Devise: A deep visual-semantic embedding model

Andrea Frome, Greg S Corrado, Jon Shlens, Samy Bengio, Jeff Dean, Marc Aurelio Ranzato, and Tomas Mikolov. Devise: A deep visual-semantic embedding model. In C.J. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K.Q. Weinberger (eds.), Advances in Neural Information Processing Systems, volume 26. Curran Associates, Inc., 2013. URL https://proceedings.neu...

work page 2013

[20] [20]

Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik

Ross B. Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. 2014 IEEE Conference on Computer Vision and Pattern Recognition, pp.\ 580--587, 2014

work page 2014

[21] [21]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev, Arthur Hinsvark, Arun Rao, Aston Zhang, Aurelien Rodriguez, Austen Gregerson, and A...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2407.21783 2024

[22] [22]

Crossmodal knowledge distillation with wordnet-relaxed text embeddings for robust image classification

Chenqi Guo, Mengshuo Rong, Qianli Feng, Rongfan Feng, and Yinglong Ma. Crossmodal knowledge distillation with wordnet-relaxed text embeddings for robust image classification. 2025. doi:10.48550/ARXIV.2503.24017

work page doi:10.48550/arxiv.2503.24017 2025

[23] [23]

Polysemy—evidence from linguistics, behavioral science, and contextualized language models

Janosch Haber and Massimo Poesio. Polysemy—evidence from linguistics, behavioral science, and contextualized language models. Computational Linguistics, 50 0 (1): 0 351--417, 2024. ISSN 1530-9312. doi:10.1162/coli_a_00500

work page doi:10.1162/coli_a_00500 2024

[24] [24]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.\ 770--778, 2016

work page 2016

[25] [25]

Mask R-CNN

Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. Mask r-cnn. In ICCV, pp.\ 2980--2988. arXiv, 2017. ISBN 978-1-5386-1032-9. doi:10.48550/ARXIV.1703.06870

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1703.06870 2017

[26] [26]

Masked image pretraining on language assisted representation

Zejiang Hou and Sun-Yuan Kung. Masked image pretraining on language assisted representation. In ICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2025. doi:10.1109/ICASSP49660.2025.10888259

work page doi:10.1109/icassp49660.2025.10888259 2025

[27] [27]

Position: The platonic representation hypothesis

Minyoung Huh, Brian Cheung, Tongzhou Wang, and Phillip Isola. Position: The platonic representation hypothesis. In Ruslan Salakhutdinov, Zico Kolter, Katherine Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp (eds.), Proceedings of the 41st International Conference on Machine Learning, volume 235 of Proceedings of Machine Learn...

work page 2024

[28] [28]

Vl2lite: Task-specific knowledge distillation from large vision-language models to lightweight networks, 2025

Jinseong Jang, Chunfei Ma, and Byeongwon Lee. Vl2lite: Task-specific knowledge distillation from large vision-language models to lightweight networks, 2025

work page 2025

[29] [29]

Scaling up visual and vision-language representation learning with noisy text supervision

Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representation learning with noisy text supervision. In Marina Meila and Tong Zhang (eds.), Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine...

work page 2021

[30] [30]

Combining Weakly and Webly Supervised Learning for Classifying Food Images

Parneet Kaur, Karan Sikka, and Ajay Divakaran. Combining weakly and webly supervised learning for classifying food images. 2017. doi:10.48550/ARXIV.1712.08730

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1712.08730 2017

[31] [31]

TIPS : Text-image pretraining with spatial awareness

Kevis kokitsi Maninis, Kaifeng Chen, Soham Ghosh, Arjun Karpur, Koert Chen, Ye Xia, Bingyi Cao, Daniel Salz, Guangxing Han, Jan Dlabal, Dan Gnanapragasam, Mojtaba Seyedhosseini, Howard Zhou, and Andre Araujo. TIPS : Text-image pretraining with spatial awareness. In The Thirteenth International Conference on Learning Representations, 2025

work page 2025

[32] [32]

Similarity of neural network representations revisited

Simon Kornblith, Mohammad Norouzi, Honglak Lee, and Geoffrey Hinton. Similarity of neural network representations revisited. In Kamalika Chaudhuri and Ruslan Salakhutdinov (eds.), Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pp.\ 3519--3529. PMLR, 09--15 Jun 2019. URL http://p...

work page 2019

[33] [33]

3d object representations for fine-grained categorization

Jonathan Krause, Michael Stark, Jia Deng, and Li Fei-Fei. 3d object representations for fine-grained categorization. In 4th International IEEE Workshop on 3D Representation and Recognition (3dRR-13), Sydney, Australia, 2013

work page 2013

[34] [34]

Nv-embed: Improved techniques for training llms as generalist embedding models

Chankyu Lee, Rajarshi Roy, Mengyao Xu, Jonathan Raiman, Mohammad Shoeybi, Bryan Catanzaro, and Wei Ping. Nv-embed: Improved techniques for training llms as generalist embedding models. In Y. Yue, A. Garg, N. Peng, F. Sha, and R. Yu (eds.), International Conference on Learning Representations, volume 2025, pp.\ 79310--79333. arXiv, 2025

work page 2025

[35] [35]

Next-vit: Next generation vision transformer for efficient deployment in realistic industrial scenarios

Jiashi Li, Xin Xia, Wei Li, Huixia Li, Xing Wang, Xuefeng Xiao, Rui Wang, Min Zheng, and Xin Pan. Next-vit: Next generation vision transformer for efficient deployment in realistic industrial scenarios. 2022. doi:10.48550/ARXIV.2207.05501

work page doi:10.48550/arxiv.2207.05501 2022

[36] [36]

Blip-2: bootstrapping language-image pre-training with frozen image encoders and large language models

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: bootstrapping language-image pre-training with frozen image encoders and large language models. In Proceedings of the 40th International Conference on Machine Learning, ICML'23, 2023

work page 2023

[37] [37]

Visual instruction tuning

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine (eds.), Advances in Neural Information Processing Systems, volume 36, pp.\ 34892--34916. Curran Associates, Inc., 2023. URL https://proceedings.neurips.cc/paper_files/paper/2023/file/6dcf277ea32ce328891...

work page 2023

[38] [38]

Asymmetric visual semantic embedding framework for efficient vision-language alignment

Yang Liu, Mengyuan Liu, Shudong Huang, and Jiancheng Lv. Asymmetric visual semantic embedding framework for efficient vision-language alignment. Proceedings of the AAAI Conference on Artificial Intelligence, 39 0 (6): 0 5676--5684, April 2025. ISSN 2159-5399. doi:10.1609/aaai.v39i6.32605

work page doi:10.1609/aaai.v39i6.32605 2025

[39] [39]

Caron, H

Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pp.\ 9992--10002, Los Alamitos, CA, USA, 10 2021. IEEE Computer Society. doi:10.1109/ICCV48922.2021.00986

work page doi:10.1109/iccv48922.2021.00986 2021

[40] [40]

Borrowing knowledge from pre-trained language model: A new data-efficient visual learning paradigm

Wenxuan Ma, Shuang Li, JinMing Zhang, Chi Harold Liu, Jingxuan Kang, Yulin Wang, and Gao Huang. Borrowing knowledge from pre-trained language model: A new data-efficient visual learning paradigm. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp.\ 18786--18797, October 2023

work page 2023

[41] [41]

S. Maji, J. Kannala, E. Rahtu, M. Blaschko, and A. Vedaldi. Fine-grained visual classification of aircraft. Technical report, 2013

work page 2013

[42] [42]

Slip: Self-supervision meets language-image pre-training

Norman Mu, Alexander Kirillov, David Wagner, and Saining Xie. Slip: Self-supervision meets language-image pre-training. In Shai Avidan, Gabriel Brostow, Moustapha Ciss \'e , Giovanni Maria Farinella, and Tal Hassner (eds.), Computer Vision -- ECCV 2022, pp.\ 529--544, Cham, 2022. Springer Nature Switzerland. ISBN 978-3-031-19809-0

work page 2022

[43] [43]

Nurminen

Lalli Myllyaho, Mikko Raatikainen, Tomi Männistö, Tommi Mikkonen, and Jukka K. Nurminen. Systematic literature review of validation methods for ai systems. Journal of Systems and Software, 181: 0 111050, 2021. ISSN 0164-1212. doi:10.1016/j.jss.2021.111050

work page doi:10.1016/j.jss.2021.111050 2021

[44] [44]

Clip-embed-kd: Computationally efficient knowledge distillation using embeddings as teachers

Lakshmi Nair. Clip-embed-kd: Computationally efficient knowledge distillation using embeddings as teachers. Extended abstract: 28th IEEE High Performance Extreme Computing Conference (HPEC) 2024 - Outstanding short paper award, 2024

work page 2024

[45] [45]

Which transformer to favor: A comparative analysis of efficiency in vision transformers

Tobias Christian Nauen, Sebastian Palacio, Federico Raue, and Andreas Dengel. Which transformer to favor: A comparative analysis of efficiency in vision transformers. In Proceedings of the Winter Conference on Applications of Computer Vision (WACV), pp.\ 6955--6966, February 2025

work page 2025

[46] [46]

Automated flower classification over a large number of classes

Maria-Elena Nilsback and Andrew Zisserman. Automated flower classification over a large number of classes. In Indian Conference on Computer Vision, Graphics and Image Processing, Dec 2008

work page 2008

[47] [47]

Maxime Oquab, Timoth \'e e Darcet, Th \'e o Moutakanni, Huy V. Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel HAZIZA, Francisco Massa, Alaaeldin El-Nouby, Mido Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Herve Jegou, Julien Mairal, Pat...

work page 2024

[48] [48]

Parkhi, Andrea Vedaldi, Andrew Zisserman, and C

Omkar M. Parkhi, Andrea Vedaldi, Andrew Zisserman, and C. V. Jawahar. Cats and dogs. In IEEE Conference on Computer Vision and Pattern Recognition, 2012

work page 2012

[49] [49]

Development methodologies for safety critical machine learning applications in the automotive domain: A survey

Martin Rabe, Stefan Milz, and Patrick Mader. Development methodologies for safety critical machine learning applications in the automotive domain: A survey. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, pp.\ 129--141, June 2021

work page 2021

[50] [50]

Improving language understanding by generative pre-training

Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. Improving language understanding by generative pre-training. 2018

work page 2018

[51] [51]

Language models are unsupervised multitask learners

Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. 2019. URL https://openai.com/blog/better-language-models/

work page 2019

[52] [52]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. In Marina Meila and Tong Zhang (eds.), Proceedings of the 38th International Conference on Machine L...

work page 2021

[53] [53]

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21 0 (140): 0 1--67, 2020

work page 2020

[54] [54]

The shape of learning: Anisotropy and intrinsic dimensions in transformer-based models

Anton Razzhigaev, Matvey Mikhalchuk, Elizaveta Goncharova, Ivan Oseledets, Denis Dimitrov, and Andrey Kuznetsov. The shape of learning: Anisotropy and intrinsic dimensions in transformer-based models. In Yvette Graham and Matthew Purver (eds.), Findings of the Association for Computational Linguistics: EACL 2024, pp.\ 868--874, St. Julian ' s, Malta, Marc...

work page doi:10.18653/v1/2024.findings-eacl.58 2024

[55] [55]

Matuszewski

Edward Sanderson and Bogdan J. Matuszewski. FCN-Transformer Feature Fusion for Polyp Segmentation, pp.\ 892--907. Springer International Publishing, 2022. ISBN 9783031120534. doi:10.1007/978-3-031-12053-4_65

work page doi:10.1007/978-3-031-12053-4_65 2022

[56] [56]

A fistful of words: Learning transferable visual models from bag-of-words supervision, 2022

Ajinkya Tejankar, Maziar Sanjabi, Bichen Wu, Saining Xie, Madian Khabsa, Hamed Pirsiavash, and Hamed Firooz. A fistful of words: Learning transferable visual models from bag-of-words supervision, 2022. URL https://arxiv.org/abs/2112.13884

work page arXiv 2022

[57] [57]

Dragonfly: Multi-resolution zoom-in encoding enhances vision-language models

Rahul Thapa, Kezhen Chen, Ian Covert, Rahul Chalamala, Ben Athiwaratkun, Shuaiwen Leon Song, and James Zou. Dragonfly: Multi-resolution zoom-in encoding enhances vision-language models. 2024. doi:10.48550/ARXIV.2406.00977

work page doi:10.48550/arxiv.2406.00977 2024

[58] [58]

What makes for good views for contrastive learning? In H

Yonglong Tian, Chen Sun, Ben Poole, Dilip Krishnan, Cordelia Schmid, and Phillip Isola. What makes for good views for contrastive learning? In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin (eds.), Advances in Neural Information Processing Systems, volume 33, pp.\ 6827--6839. Curran Associates, Inc., 2020

work page 2020

[59] [59]

Training data-efficient image transformers & distillation through attention

Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Herve Jegou. Training data-efficient image transformers & distillation through attention. In Marina Meila and Tong Zhang (eds.), Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pp.\ 10347--...

work page 2021

[60] [60]

Deit iii: Revenge of the vit

Hugo Touvron, Matthieu Cord, and Herv \'e J \'e gou. Deit iii: Revenge of the vit. In Shai Avidan, Gabriel Brostow, Moustapha Ciss \'e , Giovanni Maria Farinella, and Tal Hassner (eds.), Computer Vision -- ECCV 2022, pp.\ 516--533, Cham, 2022. Springer Nature Switzerland

work page 2022

[61] [61]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. Llama: Open and efficient foundation language models. 2023. doi:10.48550/ARXIV.2302.13971

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2302.13971 2023

[62] [62]

Attention is all you need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017

work page 2017

[63] [63]

Effisegnet: Gastrointestinal polyp segmentation through a pre-trained efficientnet-based network with a simplified decoder

Ioannis Vezakis, Konstantinos Georgas, Dimitrios Fotiadis, and George Matsopoulos. Effisegnet: Gastrointestinal polyp segmentation through a pre-trained efficientnet-based network with a simplified decoder. In 2024 46th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), pp.\ 1--4, 07 2024. doi:10.1109/EMBC53108...

work page doi:10.1109/embc53108.2024.10782015 2024

[64] [64]

C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie. Caltech-ucsd birds 200. Technical Report CNS-TR-2011-001, California Institute of Technology, 2011

work page 2011

[65] [65]

Internimage: Exploring large-scale vision foundation models with deformable convolutions

Wenhai Wang, Jifeng Dai, Zhe Chen, Zhenhang Huang, Zhiqi Li, Xizhou Zhu, Xiaowei Hu, Tong Lu, Lewei Lu, Hongsheng Li, Xiaogang Wang, and Yu Qiao. Internimage: Exploring large-scale vision foundation models with deformable convolutions. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.\ 14408--14419, 2023

work page 2023

[66] [66]

Advances in neural architecture search

Xin Wang and Wenwu Zhu. Advances in neural architecture search. National Science Review, 11 0 (8), 2024. ISSN 2053-714X. doi:10.1093/nsr/nwae282

work page doi:10.1093/nsr/nwae282 2024

[67] [67]

Energy efficiency of training neural network architectures: An empirical study

Yinlena Xu, Silverio Martínez-Fernández, Matias Martinez, and Xavier Franch. Energy efficiency of training neural network architectures: An empirical study. In Proceedings of the 56th Hawaii International Conference on System Sciences, HICSS. Hawaii International Conference on System Sciences, 2023. doi:10.24251/hicss.2023.098

work page doi:10.24251/hicss.2023.098 2023

[68] [68]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2505.09388 2025

[69] [69]

Clip-kd: An empirical study of clip model distillation

Chuanguang Yang, Zhulin An, Libo Huang, Junyu Bi, Xinqiang Yu, Han Yang, Boyu Diao, and Yongjun Xu. Clip-kd: An empirical study of clip model distillation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

work page 2024

[70] [70]

Reconstruction vs

Jingfeng Yao, Bin Yang, and Xinggang Wang. Reconstruction vs. generation: Taming optimization dilemma in latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

work page 2025

[71] [71]

Large Batch Optimization for Deep Learning: Training BERT in 76 minutes

Yang You, Jing Li, Sashank Reddi, Jonathan Hseu, Sanjiv Kumar, Srinadh Bhojanapalli, Xiaodan Song, James Demmel, Kurt Keutzer, and Cho-Jui Hsieh. Large batch optimization for deep learning: Training bert in 76 minutes. In International Conference on Learning Representations, 2020. doi:10.48550/arxiv.1904.00962. URL https://openreview.net/forum?id=Syx4wnEtvH

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1904.00962 2020

[72] [72]

CoCa: Contrastive Captioners are Image-Text Foundation Models

Jiahui Yu, Zirui Wang, Vijay Vasudevan, Legg Yeung, Mojtaba Seyedhosseini, and Yonghui Wu. Coca: Contrastive captioners are image-text foundation models. Transactions on Machine Learning Research, 2022. ISSN 2835-8856. doi:10.48550/arxiv.2205.01917. URL https://openreview.net/forum?id=Ee277P3AYC

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2205.01917 2022

[73] [73]

CutMix : Regularization strategy to train strong classifiers with localizable features

Sangdoo Yun, Dongyoon Han, Sanghyuk Chun, Seong Joon Oh, Youngjoon Yoo, and Junsuk Choe. CutMix : Regularization strategy to train strong classifiers with localizable features. In 2019 IEEE / CVF International Conference on Computer Vision ( ICCV ) . IEEE , 2019. doi:10.1109/iccv.2019.00612

work page doi:10.1109/iccv.2019.00612 2019

[74] [74]

Dic-transformer: interpretation of plant disease classification results using image caption generation technology

Qingtian Zeng, Jian Sun, and Shansong Wang. Dic-transformer: interpretation of plant disease classification results using image caption generation technology. Frontiers in Plant Science, 14, 2024. ISSN 1664-462X. doi:10.3389/fpls.2023.1273029

work page doi:10.3389/fpls.2023.1273029 2024

[75] [75]

Dauphin, and David Lopez-Paz

Hongyi Zhang, Moustapha Cisse, Yann N. Dauphin, and David Lopez-Paz. mixup: Beyond empirical risk minimization. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=r1Ddp1-Rb

work page 2018

[76] [76]

Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models

Yanzhao Zhang, Mingxin Li, Dingkun Long, Xin Zhang, Huan Lin, Baosong Yang, Pengjun Xie, An Yang, Dayiheng Liu, Junyang Lin, Fei Huang, and Jingren Zhou. Qwen3 embedding: Advancing text embedding and reranking through foundation models. 2025. doi:10.48550/ARXIV.2506.05176

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2506.05176 2025

[77] [77]

Random erasing data augmentation

Zhun Zhong, Liang Zheng, Guoliang Kang, Shaozi Li, and Yi Yang. Random erasing data augmentation. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), 2020

work page 2020