Recognition: no theorem link
Sigmoid Loss for Language Image Pre-Training
Pith reviewed 2026-05-16 13:00 UTC · model grok-4.3
The pith
A pairwise sigmoid loss for image-text pre-training achieves 84.5% zero-shot ImageNet accuracy using only four TPU chips in two days.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The sigmoid loss operates solely on image-text pairs and does not require a global view of the pairwise similarities for normalization, simultaneously allowing further scaling up of the batch size while also performing better at smaller batch sizes.
What carries the argument
The pairwise sigmoid loss, which applies a sigmoid activation to the dot product of image and text embeddings for each pair independently.
If this is right
- Training becomes possible with extremely large batch sizes up to one million without issues from global normalization.
- A moderate batch size of 32k provides most of the benefits, making training more practical.
- The loss allows independent control over the number of examples and the negative-to-positive ratio.
- High zero-shot performance is achievable with minimal hardware resources when combined with locked-image tuning.
Where Pith is reading between the lines
- Pre-training could become accessible on smaller compute budgets or even single machines with further optimizations.
- The method might apply to other contrastive learning setups beyond vision-language models.
- Future work could explore even larger scales or different modalities using the same loss structure.
Load-bearing premise
The sigmoid loss will keep producing high-quality representations at new scales or on new data without needing hyper-parameter adjustments.
What would settle it
Training a larger SigLIP model on a new dataset with fixed hyperparameters and observing substantially worse zero-shot accuracy than a comparable softmax contrastive model would falsify the claim.
read the original abstract
We propose a simple pairwise Sigmoid loss for Language-Image Pre-training (SigLIP). Unlike standard contrastive learning with softmax normalization, the sigmoid loss operates solely on image-text pairs and does not require a global view of the pairwise similarities for normalization. The sigmoid loss simultaneously allows further scaling up the batch size, while also performing better at smaller batch sizes. Combined with Locked-image Tuning, with only four TPUv4 chips, we train a SigLiT model that achieves 84.5% ImageNet zero-shot accuracy in two days. The disentanglement of the batch size from the loss further allows us to study the impact of examples vs pairs and negative to positive ratio. Finally, we push the batch size to the extreme, up to one million, and find that the benefits of growing batch size quickly diminish, with a more reasonable batch size of 32k being sufficient. We release our models at https://github.com/google-research/big_vision and hope our research motivates further explorations in improving the quality and efficiency of language-image pre-training.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes a pairwise sigmoid loss (SigLIP) for language-image pre-training that operates directly on image-text pairs without requiring global softmax normalization over the batch. This design enables scaling batch sizes while also improving results at smaller batches. Combined with Locked-image Tuning, the authors report training a model to 84.5% ImageNet zero-shot accuracy using only four TPUv4 chips in two days. They further ablate the effects of batch size (up to 1M), examples versus pairs, and negative-to-positive ratios, concluding that benefits diminish beyond a 32k batch size.
Significance. If the reported accuracies and efficiency gains hold, the work is significant because it removes the dependence on large-batch normalization that has constrained contrastive vision-language training since CLIP. The ability to train competitive models with modest hardware (four TPUv4 chips) and the public release of models at https://github.com/google-research/big_vision both lower the barrier to entry and support reproducibility. The batch-size scaling study also provides concrete guidance on practical operating points.
major comments (2)
- [Abstract] Abstract: The headline result of 84.5% ImageNet zero-shot accuracy with SigLiT on four TPUv4 chips in two days is load-bearing for the efficiency claim, yet the manuscript provides no accompanying table or section detailing the exact model size, training dataset, number of steps, or direct LiT baseline comparison under the identical four-chip budget; without these, the contribution attributable to the sigmoid loss versus other factors cannot be isolated.
- [Abstract] The paper states that the sigmoid loss 'performs better at smaller batch sizes' and 'allows further scaling up the batch size,' but the provided ablations stop at the authors' chosen regimes; there is no cross-model-size or cross-dataset experiment demonstrating that the sigmoid scale hyper-parameter transfers without retuning, which directly tests the weakest assumption that the loss remains effective when batch-wide normalization is removed.
minor comments (1)
- The GitHub release is welcome, but the manuscript should explicitly state whether the training scripts, exact hyper-parameters, and data-preprocessing pipelines used for the 84.5% result are included so that the two-day four-chip claim can be reproduced.
Simulated Author's Rebuttal
We thank the referee for their thoughtful review and constructive comments on our work. We address each major comment below and will make revisions to enhance the clarity and completeness of the manuscript.
read point-by-point responses
-
Referee: [Abstract] Abstract: The headline result of 84.5% ImageNet zero-shot accuracy with SigLiT on four TPUv4 chips in two days is load-bearing for the efficiency claim, yet the manuscript provides no accompanying table or section detailing the exact model size, training dataset, number of steps, or direct LiT baseline comparison under the identical four-chip budget; without these, the contribution attributable to the sigmoid loss versus other factors cannot be isolated.
Authors: We agree that the abstract's efficiency claim requires supporting details to allow isolation of the sigmoid loss contribution. In the revised manuscript we will add a dedicated table (or subsection) that specifies the exact model size, training dataset, number of steps, and a direct LiT baseline comparison trained under the identical four TPUv4-chip, two-day budget. revision: yes
-
Referee: [Abstract] The paper states that the sigmoid loss 'performs better at smaller batch sizes' and 'allows further scaling up the batch size,' but the provided ablations stop at the authors' chosen regimes; there is no cross-model-size or cross-dataset experiment demonstrating that the sigmoid scale hyper-parameter transfers without retuning, which directly tests the weakest assumption that the loss remains effective when batch-wide normalization is removed.
Authors: The sigmoid scale hyper-parameter was held fixed at the same value across the entire set of batch-size ablations (from small batches through 1 M). Because the same fixed value was used without retuning while still showing gains at smaller batches and continued (though diminishing) benefits at larger batches, the experiments already provide evidence that the loss remains effective once batch-wide normalization is removed. We will revise the text to explicitly note that the scale was not retuned and to discuss this as supporting robustness. Additional cross-model or cross-dataset sweeps of the scale parameter lie outside the current scope. revision: partial
Circularity Check
No circularity: empirical proposal and validation of pairwise sigmoid loss
full rationale
The paper defines a new sigmoid loss directly on image-text pairs without softmax normalization over the batch, then reports results from training SigLiT models on standard datasets and measuring zero-shot ImageNet accuracy. No equations reduce the reported accuracies or scaling claims back to fitted parameters by construction, and the work contains no load-bearing self-citations, uniqueness theorems, or ansatzes smuggled from prior author work. The derivation chain is self-contained because performance is obtained through explicit training runs rather than any algebraic or statistical reduction to the input assumptions.
Axiom & Free-Parameter Ledger
free parameters (1)
- sigmoid scale parameter
axioms (1)
- domain assumption Image-text pairs provide sufficient supervision without requiring global batch statistics for normalization
Forward citations
Cited by 20 Pith papers
-
ASH: Agents that Self-Hone via Embodied Learning
ASH reaches 11.2/12 milestones in Pokemon Emerald and 9.9/12 in Zelda by self-improving via an IDM trained on its own trajectories to label internet video, while baselines plateau at roughly 6/12.
-
OA-WAM: Object-Addressable World Action Model for Robust Robot Manipulation
OA-WAM uses persistent address vectors and dynamic content vectors in object slots to enable addressable world-action prediction, improving robustness on manipulation benchmarks under scene changes.
-
Aligned Multi-View Scripts for Universal Chart-to-Code Generation
Introduces an aligned multi-language dataset and a language-conditioned low-rank adapter for generating executable plotting code in Python, R, and LaTeX from chart images.
-
RSRCC: A Remote Sensing Regional Change Comprehension Benchmark Constructed via Retrieval-Augmented Best-of-N Ranking
RSRCC is a new 126k-question benchmark for fine-grained remote sensing change question-answering, constructed via a hierarchical semi-supervised pipeline with retrieval-augmented Best-of-N ranking.
-
Equifinality in Mixture of Experts: Routing Topology Does Not Determine Language Modeling Quality
Routing topology in sparse Mixture-of-Experts models does not determine asymptotic language modeling perplexity; multiple variants including cosine-similarity routing achieve statistically equivalent performance.
-
Why and When Visual Token Pruning Fails? A Study on Relevant Visual Information Shift in MLLMs Decoding
Visual token pruning in MLLMs fails on complex reasoning due to Relevant Visual Information Shift during decoding, but the DSTP framework fixes it training-free across models.
-
MARVEL: Multimodal Adaptive Reasoning-intensiVe Expand-rerank and retrievaL
MARVEL reaches 37.9 nDCG@10 on the MM-BRIGHT benchmark by combining LLM query expansion, a reasoning-enhanced dense retriever, and GPT-4o CoT reranking, beating prior multimodal encoders by 10.3 points.
-
Affordance Agent Harness: Verification-Gated Skill Orchestration
Affordance Agent Harness is a verification-gated orchestration system that unifies skills via an evidence store, episodic memory priors, an adaptive router, and a self-consistency verifier to improve accuracy-cost tra...
-
Majorization-Guided Test-Time Adaptation for Vision-Language Models under Modality-Specific Shift
MG-MTTA improves VLM accuracy under modality-specific shifts by replacing pure entropy minimization with majorization-guided adaptation that incorporates a reliability-aware gate prior.
-
MaMe & MaRe: Matrix-Based Token Merging and Restoration for Efficient Visual Perception and Synthesis
MaMe is a differentiable matrix-only token merging method that doubles ViT-B throughput with a 2% accuracy drop on pre-trained models and enables faster, higher-quality image synthesis when paired with MaRe.
-
IntentScore: Intent-Conditioned Action Evaluation for Computer-Use Agents
IntentScore learns intent-conditioned action scores from offline GUI trajectories and raises task success by 6.9 points on an unseen agent and environment.
-
Chasing Ghosts: A Simulation-to-Real Olfactory Navigation Stack with Optional Vision Augmentation
A simulation-to-real navigation policy enables a quadrotor to locate an odor source using only basic olfaction sensors and optional vision, validated in indoor real-world flights.
-
F1: A Vision-Language-Action Model Bridging Understanding and Generation to Actions
F1 integrates next-scale visual foresight prediction into a Mixture-of-Transformer VLA architecture to reformulate action generation as foresight-guided inverse dynamics, achieving higher success rates on 136 tasks.
-
CropVLM: A Domain-Adapted Vision-Language Model for Open-Set Crop Analysis
CropVLM is a domain-adapted vision-language model that achieves 72.51% zero-shot crop classification accuracy and superior open-set detection performance on novel species without retraining.
-
Motif-Video 2B: Technical Report
Motif-Video 2B achieves 83.76% VBench score, beating a 14B-parameter baseline with 7x fewer parameters and substantially less training data through shared cross-attention and a three-part backbone.
-
FF3R: Feedforward Feature 3D Reconstruction from Unconstrained views
FF3R unifies geometric and semantic 3D reconstruction in a single annotation-free feed-forward network trained solely via RGB and feature rendering supervision.
-
BRIDGE: Multimodal-to-Text Retrieval via Reinforcement-Learned Query Alignment
BRIDGE reaches 29.7 nDCG@10 on MM-BRIGHT by RL-aligning multimodal queries to text and using a reasoning retriever, beating multimodal encoders and, when combined with Nomic-Vision, exceeding the best text-only retrie...
-
Kimi K2.5: Visual Agentic Intelligence
Kimi K2.5 combines joint text-vision training with an Agent Swarm parallel orchestration framework to reach claimed state-of-the-art results on coding, vision, reasoning, and agent tasks while cutting latency up to 4.5 times.
-
Affordance Agent Harness: Verification-Gated Skill Orchestration
Affordance Agent Harness is a verification-gated orchestration framework that adaptively combines heterogeneous skills, retrieves episodic memories, and uses self-consistency checks to improve affordance grounding acc...
-
Are vision-language models ready to zero-shot replace supervised classification models in agriculture?
Zero-shot VLMs reach at most 62% accuracy on agricultural classification tasks while supervised models like YOLO11 perform markedly higher, indicating they are not ready to replace task-specific systems.
Reference graph
Works this paper leans on
-
[1]
Getting vit in shape: Scaling laws for compute-optimal model design
Ibrahim Alabdulmohsin, Xiaohua Zhai, Alexander Kolesnikov, and Lucas Beyer. Getting vit in shape: Scaling laws for compute-optimal model design. In NeurIPS, 2023. 7, 8, 17
work page 2023
-
[2]
ObjectNet: A large-scale bias-controlled dataset for pushing the limits of object recognition models
Andrei Barbu, David Mayo, Julian Alverio, William Luo, Christopher Wang, Dan Gutfreund, Josh Tenenbaum, and Boris Katz. ObjectNet: A large-scale bias-controlled dataset for pushing the limits of object recognition models. In NeurIPS, 2019. 7, 17
work page 2019
-
[3]
H ´enaff, Alexander Kolesnikov, Xi- aohua Zhai, and A ¨aron van den Oord
Lucas Beyer, Olivier J. H ´enaff, Alexander Kolesnikov, Xi- aohua Zhai, and A ¨aron van den Oord. Are we done with imagenet? CoRR, abs/2006.07159, 2020. 2, 7, 9, 17
-
[4]
Bet- ter plain vit baselines for imagenet-1k, 2022
Lucas Beyer, Xiaohua Zhai, and Alexander Kolesnikov. Bet- ter plain vit baselines for imagenet-1k, 2022. 10, 17
work page 2022
-
[5]
Lucas Beyer, Xiaohua Zhai, and Alexander Kolesnikov. Big vision. https://github.com/google-research/ big_vision, 2022. 10, 17
work page 2022
-
[6]
Coyo- 700m: Image-text pair dataset
Minwoo Byeon, Beomhee Park, Haecheon Kim, Sungjun Lee, Woonhyuk Baek, and Saehoon Kim. Coyo- 700m: Image-text pair dataset. https://github.com/ kakaobrain/coyo-dataset, 2022. 1
work page 2022
-
[7]
Conceptual 12M: Pushing web-scale image-text pre-training to recognize long-tail visual concepts
Soravit Changpinyo, Piyush Sharma, Nan Ding, and Radu Soricut. Conceptual 12M: Pushing web-scale image-text pre-training to recognize long-tail visual concepts. In CVPR,
-
[8]
VLP: A survey on vision- language pre-training
Feilong Chen, Duzhen Zhang, Minglun Han, Xiu-Yi Chen, Jing Shi, Shuang Xu, and Bo Xu. VLP: A survey on vision- language pre-training. Int. J. Autom. Comput., 20(1):38–56,
-
[9]
Generative pre- training from pixels
Mark Chen, Alec Radford, Rewon Child, Jeffrey Wu, Hee- woo Jun, David Luan, and Ilya Sutskever. Generative pre- training from pixels. In Proceedings of the 37th Interna- tional Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event, volume 119 of Proceedings of Ma- chine Learning Research , pages 1691–1703. PMLR, 2020. 8
work page 2020
-
[10]
Ting Chen, Simon Kornblith, Mohammad Norouzi, and Ge- offrey E. Hinton. A simple framework for contrastive learn- ing of visual representations. In ICML, 2020. 2, 4
work page 2020
-
[11]
Microsoft COCO Captions: Data Collection and Evaluation Server
Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedan- tam, Saurabh Gupta, Piotr Doll ´ar, and C. Lawrence Zitnick. Microsoft COCO captions: Data collection and evaluation server. CoRR, abs/1504.00325, 2015. 7, 17
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[12]
Xiangning Chen, Chen Liang, Da Huang, Esteban Real, Kaiyuan Wang, Yao Liu, Hieu Pham, Xuanyi Dong, Thang Luong, Cho-Jui Hsieh, Yifeng Lu, and Quoc V . Le. Symbolic discovery of optimization algorithms, 2023. 2, 6
work page 2023
-
[13]
Xi Chen, Xiao Wang, Soravit Changpinyo, A. J. Piergio- vanni, Piotr Padlewski, Daniel Salz, Sebastian Goodman, Adam Grycner, Basil Mustafa, Lucas Beyer, Alexander Kolesnikov, Joan Puigcerver, Nan Ding, Keran Rong, Has- san Akbari, Gaurav Mishra, Linting Xue, Ashish Thapliyal, James Bradbury, Weicheng Kuo, Mojtaba Seyedhosseini, Chao Jia, Burcu Karagol Aya...
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[14]
Imagenet: A large-scale hierarchical image database
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In CVPR, 2009. 4, 7, 9, 17
work page 2009
-
[15]
Redcaps: Web-curated image-text data created by the people, for the people
Karan Desai, Gaurav Kaul, Zubin Aysola, and Justin John- son. Redcaps: Web-curated image-text data created by the people, for the people. In Joaquin Vanschoren and Sai- Kit Yeung, editors, Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks 1, NeurIPS Datasets and Benchmarks 2021, December 2021, virtual, 2021. 1
work page 2021
-
[16]
Xiaoyi Dong, Jianmin Bao, Ting Zhang, Dongdong Chen, Shuyang Gu, Weiming Zhang, Lu Yuan, Dong Chen, Fang Wen, and Nenghai Yu. Clip itself is a strong fine-tuner: Achieving 85.7% and 88.0% top-1 accuracy with vit-b and vit-l on imagenet. CoRR, abs/2212.06138, 2022. 2
-
[17]
An image is worth 16×16 words: Transformers for image recognition at scale
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16×16 words: Transformers for image recognition at scale. In ICLR, 2021. 3, 7, 17
work page 2021
-
[18]
Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning . MIT Press, 2016. http://www. deeplearningbook.org. 1
work page 2016
-
[19]
Dimension- ality reduction by learning an invariant mapping
Raia Hadsell, Sumit Chopra, and Yann LeCun. Dimension- ality reduction by learning an invariant mapping. In CVPR, volume 2, 2006. 2
work page 2006
-
[20]
Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Pi- otr Doll ´ar, and Ross B. Girshick. Masked autoencoders are scalable vision learners. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022 , pages 15979–15988. IEEE, 2022. 8
work page 2022
-
[21]
Scaling up vision-language pre-training for image captioning.arXiv preprint arXiv:2111.12233, 2021a
Xiaowei Hu, Zhe Gan, Jianfeng Wang, Zhengyuan Yang, Zicheng Liu, Yumao Lu, and Lijuan Wang. Scaling up vision-language pre-training for image captioning. CoRR, abs/2111.12233, 2021. 1, 2
- [22]
-
[23]
Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig
Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc V . Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representation learning with noisy text supervision. In ICML, 2021. 1, 2
work page 2021
-
[24]
Supervised contrastive learning
Prannay Khosla, Piotr Teterwak, Chen Wang, Aaron Sarna, Yonglong Tian, Phillip Isola, Aaron Maschinot, Ce Liu, and Dilip Krishnan. Supervised contrastive learning. In Hugo Larochelle, Marc’Aurelio Ranzato, Raia Hadsell, Maria- Florina Balcan, and Hsuan-Tien Lin, editors, Advances in Neural Information Processing Systems 33: Annual Con- ference on Neural I...
work page 2020
-
[25]
11 Big transfer (BiT): General visual representation learning
Alexander Kolesnikov, Lucas Beyer, Xiaohua Zhai, Joan Puigcerver, Jessica Yung, Sylvain Gelly, and Neil Houlsby. 11 Big transfer (BiT): General visual representation learning. In ECCV, 2020. 7
work page 2020
-
[26]
Learning multiple layers of features from tiny images
Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical report, Univ. of Toronto, 2009. 9
work page 2009
-
[27]
Taku Kudo and John Richardson. SentencePiece: A sim- ple and language independent subword tokenizer and detok- enizer for neural text processing. In EMNLP, 2018. 5, 14
work page 2018
-
[28]
Junnan Li, Dongxu Li, Caiming Xiong, and Steven C. H. Hoi. BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesv´ari, Gang Niu, and Sivan Sabato, editors, Interna- tional Conference on Machine Learning, ICML 2022, 17- 23 July 2022, Baltimore,...
work page 2022
-
[29]
Xianhang Li, Zeyu Wang, and Cihang Xie. Clipa-v2: Scal- ing CLIP training with 81.1% zero-shot imagenet accuracy within a $10, 000 budget; an extra $4, 000 unlocks 81.8% accuracy. CoRR, abs/2306.15658, 2023. 8
-
[30]
Scaling language-image pre-training via masking
Yanghao Li, Haoqi Fan, Ronghang Hu, Christoph Feichten- hofer, and Kaiming He. Scaling language-image pre-training via masking. CoRR, abs/2212.00794, 2022. 2, 7
-
[31]
Matthias Minderer, Alexey A. Gritsenko, Austin Stone, Maxim Neumann, Dirk Weissenborn, Alexey Dosovitskiy, Aravindh Mahendran, Anurag Arnab, Mostafa Dehghani, Zhuoran Shen, Xiao Wang, Xiaohua Zhai, Thomas Kipf, and Neil Houlsby. Simple open-vocabulary object detection. In Shai Avidan, Gabriel J. Brostow, Moustapha Ciss ´e, Gio- vanni Maria Farinella, and ...
work page 2022
-
[32]
Model cards for model reporting
Margaret Mitchell, Simone Wu, Andrew Zaldivar, Parker Barnes, Lucy Vasserman, Ben Hutchinson, Elena Spitzer, Inioluwa Deborah Raji, and Timnit Gebru. Model cards for model reporting. In danah boyd and Jamie H. Morgen- stern, editors, Proceedings of the Conference on Fairness, Accountability, and Transparency, FAT* 2019, Atlanta, GA, USA, January 29-31, 20...
work page 2019
-
[33]
Jishnu Mukhoti, Tsung-Yu Lin, Omid Poursaeed, Rui Wang, Ashish Shah, Philip H. S. Torr, and Ser-Nam Lim. Open vocabulary semantic segmentation with patch aligned con- trastive learning, 2022. 2
work page 2022
-
[34]
Parkhi, Andrea Vedaldi, Andrew Zisserman, and C
Omkar M. Parkhi, Andrea Vedaldi, Andrew Zisserman, and C. V . Jawahar. Cats and dogs. InIEEE Conference on Com- puter Vision and Pattern Recognition, 2012. 9
work page 2012
- [35]
-
[36]
Learning transferable visual models from natural language supervision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. In ICML, 2021. 1, 2, 3, 8
work page 2021
-
[37]
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer.arXiv e-prints, 2019. 5, 14
work page 2019
-
[38]
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res. , 21:140:1–140:67, 2020. 17
work page 2020
-
[39]
Do ImageNet classifiers generalize to Im- ageNet? In ICML, 2019
Benjamin Recht, Rebecca Roelofs, Ludwig Schmidt, and Vaishaal Shankar. Do ImageNet classifiers generalize to Im- ageNet? In ICML, 2019. 7, 17
work page 2019
-
[40]
LAION-5B: An open large-scale dataset for training next generation image-text models
Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Worts- man, Patrick Schramowski, Srivatsa Kundurthy, Katherine Crowson, Ludwig Schmidt, Robert Kaczmarczyk, and Jenia Jitsev. LAION-5B: an open large-scale dataset for training next generation image-text model...
work page internal anchor Pith review Pith/arXiv arXiv
-
[41]
WIT: wikipedia-based image text dataset for multimodal multilingual machine learning
Krishna Srinivasan, Karthik Raman, Jiecao Chen, Michael Bendersky, and Marc Najork. WIT: wikipedia-based image text dataset for multimodal multilingual machine learning. CoRR, abs/2103.01913, 2021. 1
-
[42]
How to train your ViT? Data, augmentation, and regularization in vision transformers
Andreas Steiner, Alexander Kolesnikov, Xiaohua Zhai, Ross Wightman, Jakob Uszkoreit, and Lucas Beyer. How to train your ViT? Data, augmentation, and regularization in vision transformers. CoRR, abs/2106.10270, 2021. 1, 6, 7
-
[43]
EVA-CLIP: Improved Training Techniques for CLIP at Scale
Quan Sun, Yuxin Fang, Ledell Wu, Xinlong Wang, and Yue Cao. EV A-CLIP: improved training techniques for CLIP at scale. CoRR, abs/2303.15389, 2023. 8
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[44]
Thapliyal, Jordi Pont-Tuset, Xi Chen, and Radu Soricut
Ashish V . Thapliyal, Jordi Pont-Tuset, Xi Chen, and Radu Soricut. Crossmodal-3600: A massively multilingual mul- timodal evaluation dataset. In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang, editors, Proceedings of the 2022 Conference on Empirical Methods in Natural Language Pro- cessing, EMNLP 2022, Abu Dhabi, United Arab Emirates, December 7-11, 2022...
work page 2022
-
[45]
LLaMA: Open and Efficient Foundation Language Models
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth´ee Lacroix, Baptiste Rozi`ere, Naman Goyal, Eric Hambro, Faisal Azhar, Aur´elien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. Llama: Open and efficient foundation language models. CoRR, abs/2302.13971, 2023. 7
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[46]
Representation Learning with Contrastive Predictive Coding
A ¨aron van den Oord, Yazhe Li, and Oriol Vinyals. Repre- sentation learning with contrastive predictive coding. CoRR, abs/1807.03748, 2018. 2
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[47]
Gomez, Lukasz Kaiser, and Illia Polosukhin
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In NeurIPS, 2017. 3, 17
work page 2017
-
[48]
Nllb-clip – train performant multilin- gual image retrieval model on a budget, 2023
Alexander Visheratin. Nllb-clip – train performant multilin- gual image retrieval model on a budget, 2023. 7
work page 2023
-
[49]
GIT: A Generative Image-to-text Transformer for Vision and Language
Jianfeng Wang, Zhengyuan Yang, Xiaowei Hu, Linjie Li, Kevin Lin, Zhe Gan, Zicheng Liu, Ce Liu, and Lijuan Wang. GIT: A generative image-to-text transformer for vision and language. CoRR, abs/2205.14100, 2022. 1, 2 12
work page internal anchor Pith review arXiv 2022
-
[50]
Simvlm: Simple visual language model pretraining with weak supervision
Zirui Wang, Jiahui Yu, Adams Wei Yu, Zihang Dai, Yulia Tsvetkov, and Yuan Cao. Simvlm: Simple visual language model pretraining with weak supervision. In The Tenth In- ternational Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022 . OpenReview.net,
work page 2022
-
[51]
Resnet strikes back: An improved training procedure in timm
Ross Wightman, Hugo Touvron, and Herv ´e J ´egou. Resnet strikes back: An improved training procedure in timm. CoRR, abs/2110.00476, 2021. 2
-
[52]
Reaching 80% zero-shot accuracy with OpenCLIP: VIT-G/14 trained on LAION-2B
Mitchell Wortsman. Reaching 80% zero-shot accuracy with OpenCLIP: VIT-G/14 trained on LAION-2B. https: //web.archive.org/web/20230127012732/ https://laion.ai/blog/giant-openclip/. 2
-
[53]
Robust fine-tuning of zero-shot models
Mitchell Wortsman, Gabriel Ilharco, Jong Wook Kim, Mike Li, Simon Kornblith, Rebecca Roelofs, Raphael Gon- tijo Lopes, Hannaneh Hajishirzi, Ali Farhadi, Hongseok Namkoong, and Ludwig Schmidt. Robust fine-tuning of zero-shot models. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022, pages 79...
work page 2022
-
[54]
mT5: A massively multilingual pre-trained text-to- text transformer
Linting Xue, Noah Constant, Adam Roberts, Mihir Kale, Rami Al-Rfou, Aditya Siddhant, Aditya Barua, and Colin Raffel. mT5: A massively multilingual pre-trained text-to- text transformer. In NAACL-HLT, 2021. 5, 17
work page 2021
-
[55]
Unified contrastive learn- ing in image-text-label space
Jianwei Yang, Chunyuan Li, Pengchuan Zhang, Bin Xiao, Ce Liu, Lu Yuan, and Jianfeng Gao. Unified contrastive learn- ing in image-text-label space. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022 , pages 19141–19151. IEEE, 2022. 2
work page 2022
-
[56]
CoCa: Contrastive Captioners are Image-Text Foundation Models
Jiahui Yu, Zirui Wang, Vijay Vasudevan, Legg Yeung, Mojtaba Seyedhosseini, and Yonghui Wu. Coca: Con- trastive captioners are image-text foundation models. CoRR, abs/2205.01917, 2022. 2
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[57]
Florence: A New Foundation Model for Computer Vision
Lu Yuan, Dongdong Chen, Yi-Ling Chen, Noel Codella, Xiyang Dai, Jianfeng Gao, Houdong Hu, Xuedong Huang, Boxin Li, Chunyuan Li, Ce Liu, Mengchen Liu, Zicheng Liu, Yumao Lu, Yu Shi, Lijuan Wang, Jianfeng Wang, Bin Xiao, Zhen Xiao, Jianwei Yang, Michael Zeng, Luowei Zhou, and Pengchuan Zhang. Florence: A new foundation model for computer vision. CoRR, abs/2...
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[58]
Xiaohua Zhai, Alexander Kolesnikov, Neil Houlsby, and Lu- cas Beyer. Scaling vision transformers. CVPR, 2022. 1, 4, 6, 7, 14
work page 2022
-
[59]
Lit: Zero-shot transfer with locked-image text tuning
Xiaohua Zhai, Xiao Wang, Basil Mustafa, Andreas Steiner, Daniel Keysers, Alexander Kolesnikov, and Lucas Beyer. Lit: Zero-shot transfer with locked-image text tuning. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18- 24, 2022, pages 18102–18112. IEEE, 2022. 1, 2, 3, 4, 6, 7, 14
work page 2022
-
[60]
Yuhao Zhang, Hang Jiang, Yasuhide Miura, Christopher D. Manning, and Curtis P. Langlotz. Contrastive learning of medical visual representations from paired images and text. In Zachary C. Lipton, Rajesh Ranganath, Mark P. Sendak, Michael W. Sjoding, and Serena Yeung, editors, Proceed- ings of the Machine Learning for Healthcare Conference, MLHC 2022, 5-6 A...
work page 2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.