pith. machine review for the scientific record. sign in

arxiv: 2303.15343 · v4 · submitted 2023-03-27 · 💻 cs.CV · cs.AI

Recognition: no theorem link

Sigmoid Loss for Language Image Pre-Training

Authors on Pith no claims yet

Pith reviewed 2026-05-16 13:00 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords sigmoid lossSigLIPlanguage-image pre-trainingcontrastive learningzero-shot accuracybatch size scaling
0
0 comments X

The pith

A pairwise sigmoid loss for image-text pre-training achieves 84.5% zero-shot ImageNet accuracy using only four TPU chips in two days.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a sigmoid loss that computes a loss for each image-text pair independently, without needing to normalize across the entire batch like in standard contrastive methods. This design allows training with much larger batch sizes and also works well with smaller ones, removing the coupling between batch size and loss function. When paired with locked-image tuning, the resulting SigLiT model reaches 84.5 percent accuracy on ImageNet zero-shot classification after training for two days on four TPUv4 chips. The separation of batch size also enables experiments varying the number of examples, pairs, and negative-to-positive ratios. Tests up to batch size one million show that gains level off, with 32 thousand being adequate.

Core claim

The sigmoid loss operates solely on image-text pairs and does not require a global view of the pairwise similarities for normalization, simultaneously allowing further scaling up of the batch size while also performing better at smaller batch sizes.

What carries the argument

The pairwise sigmoid loss, which applies a sigmoid activation to the dot product of image and text embeddings for each pair independently.

If this is right

  • Training becomes possible with extremely large batch sizes up to one million without issues from global normalization.
  • A moderate batch size of 32k provides most of the benefits, making training more practical.
  • The loss allows independent control over the number of examples and the negative-to-positive ratio.
  • High zero-shot performance is achievable with minimal hardware resources when combined with locked-image tuning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Pre-training could become accessible on smaller compute budgets or even single machines with further optimizations.
  • The method might apply to other contrastive learning setups beyond vision-language models.
  • Future work could explore even larger scales or different modalities using the same loss structure.

Load-bearing premise

The sigmoid loss will keep producing high-quality representations at new scales or on new data without needing hyper-parameter adjustments.

What would settle it

Training a larger SigLIP model on a new dataset with fixed hyperparameters and observing substantially worse zero-shot accuracy than a comparable softmax contrastive model would falsify the claim.

read the original abstract

We propose a simple pairwise Sigmoid loss for Language-Image Pre-training (SigLIP). Unlike standard contrastive learning with softmax normalization, the sigmoid loss operates solely on image-text pairs and does not require a global view of the pairwise similarities for normalization. The sigmoid loss simultaneously allows further scaling up the batch size, while also performing better at smaller batch sizes. Combined with Locked-image Tuning, with only four TPUv4 chips, we train a SigLiT model that achieves 84.5% ImageNet zero-shot accuracy in two days. The disentanglement of the batch size from the loss further allows us to study the impact of examples vs pairs and negative to positive ratio. Finally, we push the batch size to the extreme, up to one million, and find that the benefits of growing batch size quickly diminish, with a more reasonable batch size of 32k being sufficient. We release our models at https://github.com/google-research/big_vision and hope our research motivates further explorations in improving the quality and efficiency of language-image pre-training.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes a pairwise sigmoid loss (SigLIP) for language-image pre-training that operates directly on image-text pairs without requiring global softmax normalization over the batch. This design enables scaling batch sizes while also improving results at smaller batches. Combined with Locked-image Tuning, the authors report training a model to 84.5% ImageNet zero-shot accuracy using only four TPUv4 chips in two days. They further ablate the effects of batch size (up to 1M), examples versus pairs, and negative-to-positive ratios, concluding that benefits diminish beyond a 32k batch size.

Significance. If the reported accuracies and efficiency gains hold, the work is significant because it removes the dependence on large-batch normalization that has constrained contrastive vision-language training since CLIP. The ability to train competitive models with modest hardware (four TPUv4 chips) and the public release of models at https://github.com/google-research/big_vision both lower the barrier to entry and support reproducibility. The batch-size scaling study also provides concrete guidance on practical operating points.

major comments (2)
  1. [Abstract] Abstract: The headline result of 84.5% ImageNet zero-shot accuracy with SigLiT on four TPUv4 chips in two days is load-bearing for the efficiency claim, yet the manuscript provides no accompanying table or section detailing the exact model size, training dataset, number of steps, or direct LiT baseline comparison under the identical four-chip budget; without these, the contribution attributable to the sigmoid loss versus other factors cannot be isolated.
  2. [Abstract] The paper states that the sigmoid loss 'performs better at smaller batch sizes' and 'allows further scaling up the batch size,' but the provided ablations stop at the authors' chosen regimes; there is no cross-model-size or cross-dataset experiment demonstrating that the sigmoid scale hyper-parameter transfers without retuning, which directly tests the weakest assumption that the loss remains effective when batch-wide normalization is removed.
minor comments (1)
  1. The GitHub release is welcome, but the manuscript should explicitly state whether the training scripts, exact hyper-parameters, and data-preprocessing pipelines used for the 84.5% result are included so that the two-day four-chip claim can be reproduced.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful review and constructive comments on our work. We address each major comment below and will make revisions to enhance the clarity and completeness of the manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The headline result of 84.5% ImageNet zero-shot accuracy with SigLiT on four TPUv4 chips in two days is load-bearing for the efficiency claim, yet the manuscript provides no accompanying table or section detailing the exact model size, training dataset, number of steps, or direct LiT baseline comparison under the identical four-chip budget; without these, the contribution attributable to the sigmoid loss versus other factors cannot be isolated.

    Authors: We agree that the abstract's efficiency claim requires supporting details to allow isolation of the sigmoid loss contribution. In the revised manuscript we will add a dedicated table (or subsection) that specifies the exact model size, training dataset, number of steps, and a direct LiT baseline comparison trained under the identical four TPUv4-chip, two-day budget. revision: yes

  2. Referee: [Abstract] The paper states that the sigmoid loss 'performs better at smaller batch sizes' and 'allows further scaling up the batch size,' but the provided ablations stop at the authors' chosen regimes; there is no cross-model-size or cross-dataset experiment demonstrating that the sigmoid scale hyper-parameter transfers without retuning, which directly tests the weakest assumption that the loss remains effective when batch-wide normalization is removed.

    Authors: The sigmoid scale hyper-parameter was held fixed at the same value across the entire set of batch-size ablations (from small batches through 1 M). Because the same fixed value was used without retuning while still showing gains at smaller batches and continued (though diminishing) benefits at larger batches, the experiments already provide evidence that the loss remains effective once batch-wide normalization is removed. We will revise the text to explicitly note that the scale was not retuned and to discuss this as supporting robustness. Additional cross-model or cross-dataset sweeps of the scale parameter lie outside the current scope. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical proposal and validation of pairwise sigmoid loss

full rationale

The paper defines a new sigmoid loss directly on image-text pairs without softmax normalization over the batch, then reports results from training SigLiT models on standard datasets and measuring zero-shot ImageNet accuracy. No equations reduce the reported accuracies or scaling claims back to fitted parameters by construction, and the work contains no load-bearing self-citations, uniqueness theorems, or ansatzes smuggled from prior author work. The derivation chain is self-contained because performance is obtained through explicit training runs rather than any algebraic or statistical reduction to the input assumptions.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The approach rests on the empirical effectiveness of the sigmoid loss under standard contrastive pre-training assumptions; no new physical entities or unstated mathematical axioms are introduced beyond the loss definition itself.

free parameters (1)
  • sigmoid scale parameter
    The loss formulation typically includes a learnable or fixed scaling factor analogous to temperature in contrastive losses; its value is not specified in the abstract.
axioms (1)
  • domain assumption Image-text pairs provide sufficient supervision without requiring global batch statistics for normalization
    The loss is stated to operate solely on individual pairs.

pith-pipeline@v0.9.0 · 5486 in / 1184 out tokens · 31664 ms · 2026-05-16T13:00:19.151560+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 20 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. ASH: Agents that Self-Hone via Embodied Learning

    cs.AI 2026-05 unverdicted novelty 7.0

    ASH reaches 11.2/12 milestones in Pokemon Emerald and 9.9/12 in Zelda by self-improving via an IDM trained on its own trajectories to label internet video, while baselines plateau at roughly 6/12.

  2. OA-WAM: Object-Addressable World Action Model for Robust Robot Manipulation

    cs.RO 2026-05 unverdicted novelty 7.0

    OA-WAM uses persistent address vectors and dynamic content vectors in object slots to enable addressable world-action prediction, improving robustness on manipulation benchmarks under scene changes.

  3. Aligned Multi-View Scripts for Universal Chart-to-Code Generation

    cs.CL 2026-04 unverdicted novelty 7.0

    Introduces an aligned multi-language dataset and a language-conditioned low-rank adapter for generating executable plotting code in Python, R, and LaTeX from chart images.

  4. RSRCC: A Remote Sensing Regional Change Comprehension Benchmark Constructed via Retrieval-Augmented Best-of-N Ranking

    cs.CV 2026-04 unverdicted novelty 7.0

    RSRCC is a new 126k-question benchmark for fine-grained remote sensing change question-answering, constructed via a hierarchical semi-supervised pipeline with retrieval-augmented Best-of-N ranking.

  5. Equifinality in Mixture of Experts: Routing Topology Does Not Determine Language Modeling Quality

    cs.AI 2026-04 conditional novelty 7.0

    Routing topology in sparse Mixture-of-Experts models does not determine asymptotic language modeling perplexity; multiple variants including cosine-similarity routing achieve statistically equivalent performance.

  6. Why and When Visual Token Pruning Fails? A Study on Relevant Visual Information Shift in MLLMs Decoding

    cs.CV 2026-04 unverdicted novelty 7.0

    Visual token pruning in MLLMs fails on complex reasoning due to Relevant Visual Information Shift during decoding, but the DSTP framework fixes it training-free across models.

  7. MARVEL: Multimodal Adaptive Reasoning-intensiVe Expand-rerank and retrievaL

    cs.IR 2026-04 unverdicted novelty 7.0

    MARVEL reaches 37.9 nDCG@10 on the MM-BRIGHT benchmark by combining LLM query expansion, a reasoning-enhanced dense retriever, and GPT-4o CoT reranking, beating prior multimodal encoders by 10.3 points.

  8. Affordance Agent Harness: Verification-Gated Skill Orchestration

    cs.RO 2026-05 unverdicted novelty 6.0

    Affordance Agent Harness is a verification-gated orchestration system that unifies skills via an evidence store, episodic memory priors, an adaptive router, and a self-consistency verifier to improve accuracy-cost tra...

  9. Majorization-Guided Test-Time Adaptation for Vision-Language Models under Modality-Specific Shift

    cs.CV 2026-04 unverdicted novelty 6.0

    MG-MTTA improves VLM accuracy under modality-specific shifts by replacing pure entropy minimization with majorization-guided adaptation that incorporates a reliability-aware gate prior.

  10. MaMe & MaRe: Matrix-Based Token Merging and Restoration for Efficient Visual Perception and Synthesis

    cs.CV 2026-04 unverdicted novelty 6.0

    MaMe is a differentiable matrix-only token merging method that doubles ViT-B throughput with a 2% accuracy drop on pre-trained models and enables faster, higher-quality image synthesis when paired with MaRe.

  11. IntentScore: Intent-Conditioned Action Evaluation for Computer-Use Agents

    cs.AI 2026-04 unverdicted novelty 6.0

    IntentScore learns intent-conditioned action scores from offline GUI trajectories and raises task success by 6.9 points on an unseen agent and environment.

  12. Chasing Ghosts: A Simulation-to-Real Olfactory Navigation Stack with Optional Vision Augmentation

    cs.RO 2026-02 unverdicted novelty 6.0

    A simulation-to-real navigation policy enables a quadrotor to locate an odor source using only basic olfaction sensors and optional vision, validated in indoor real-world flights.

  13. F1: A Vision-Language-Action Model Bridging Understanding and Generation to Actions

    cs.RO 2025-09 unverdicted novelty 6.0

    F1 integrates next-scale visual foresight prediction into a Mixture-of-Transformer VLA architecture to reformulate action generation as foresight-guided inverse dynamics, achieving higher success rates on 136 tasks.

  14. CropVLM: A Domain-Adapted Vision-Language Model for Open-Set Crop Analysis

    cs.CV 2026-05 unverdicted novelty 5.0

    CropVLM is a domain-adapted vision-language model that achieves 72.51% zero-shot crop classification accuracy and superior open-set detection performance on novel species without retraining.

  15. Motif-Video 2B: Technical Report

    cs.CV 2026-04 unverdicted novelty 5.0

    Motif-Video 2B achieves 83.76% VBench score, beating a 14B-parameter baseline with 7x fewer parameters and substantially less training data through shared cross-attention and a three-part backbone.

  16. FF3R: Feedforward Feature 3D Reconstruction from Unconstrained views

    cs.CV 2026-04 unverdicted novelty 5.0

    FF3R unifies geometric and semantic 3D reconstruction in a single annotation-free feed-forward network trained solely via RGB and feature rendering supervision.

  17. BRIDGE: Multimodal-to-Text Retrieval via Reinforcement-Learned Query Alignment

    cs.IR 2026-04 unverdicted novelty 5.0

    BRIDGE reaches 29.7 nDCG@10 on MM-BRIGHT by RL-aligning multimodal queries to text and using a reasoning retriever, beating multimodal encoders and, when combined with Nomic-Vision, exceeding the best text-only retrie...

  18. Kimi K2.5: Visual Agentic Intelligence

    cs.CL 2026-02 unverdicted novelty 5.0

    Kimi K2.5 combines joint text-vision training with an Agent Swarm parallel orchestration framework to reach claimed state-of-the-art results on coding, vision, reasoning, and agent tasks while cutting latency up to 4.5 times.

  19. Affordance Agent Harness: Verification-Gated Skill Orchestration

    cs.RO 2026-05 unverdicted novelty 4.0

    Affordance Agent Harness is a verification-gated orchestration framework that adaptively combines heterogeneous skills, retrieves episodic memories, and uses self-consistency checks to improve affordance grounding acc...

  20. Are vision-language models ready to zero-shot replace supervised classification models in agriculture?

    cs.CV 2025-12 unverdicted novelty 4.0

    Zero-shot VLMs reach at most 62% accuracy on agricultural classification tasks while supervised models like YOLO11 perform markedly higher, indicating they are not ready to replace task-specific systems.

Reference graph

Works this paper leans on

60 extracted references · 60 canonical work pages · cited by 19 Pith papers · 9 internal anchors

  1. [1]

    Getting vit in shape: Scaling laws for compute-optimal model design

    Ibrahim Alabdulmohsin, Xiaohua Zhai, Alexander Kolesnikov, and Lucas Beyer. Getting vit in shape: Scaling laws for compute-optimal model design. In NeurIPS, 2023. 7, 8, 17

  2. [2]

    ObjectNet: A large-scale bias-controlled dataset for pushing the limits of object recognition models

    Andrei Barbu, David Mayo, Julian Alverio, William Luo, Christopher Wang, Dan Gutfreund, Josh Tenenbaum, and Boris Katz. ObjectNet: A large-scale bias-controlled dataset for pushing the limits of object recognition models. In NeurIPS, 2019. 7, 17

  3. [3]

    H ´enaff, Alexander Kolesnikov, Xi- aohua Zhai, and A ¨aron van den Oord

    Lucas Beyer, Olivier J. H ´enaff, Alexander Kolesnikov, Xi- aohua Zhai, and A ¨aron van den Oord. Are we done with imagenet? CoRR, abs/2006.07159, 2020. 2, 7, 9, 17

  4. [4]

    Bet- ter plain vit baselines for imagenet-1k, 2022

    Lucas Beyer, Xiaohua Zhai, and Alexander Kolesnikov. Bet- ter plain vit baselines for imagenet-1k, 2022. 10, 17

  5. [5]

    Big vision

    Lucas Beyer, Xiaohua Zhai, and Alexander Kolesnikov. Big vision. https://github.com/google-research/ big_vision, 2022. 10, 17

  6. [6]

    Coyo- 700m: Image-text pair dataset

    Minwoo Byeon, Beomhee Park, Haecheon Kim, Sungjun Lee, Woonhyuk Baek, and Saehoon Kim. Coyo- 700m: Image-text pair dataset. https://github.com/ kakaobrain/coyo-dataset, 2022. 1

  7. [7]

    Conceptual 12M: Pushing web-scale image-text pre-training to recognize long-tail visual concepts

    Soravit Changpinyo, Piyush Sharma, Nan Ding, and Radu Soricut. Conceptual 12M: Pushing web-scale image-text pre-training to recognize long-tail visual concepts. In CVPR,

  8. [8]

    VLP: A survey on vision- language pre-training

    Feilong Chen, Duzhen Zhang, Minglun Han, Xiu-Yi Chen, Jing Shi, Shuang Xu, and Bo Xu. VLP: A survey on vision- language pre-training. Int. J. Autom. Comput., 20(1):38–56,

  9. [9]

    Generative pre- training from pixels

    Mark Chen, Alec Radford, Rewon Child, Jeffrey Wu, Hee- woo Jun, David Luan, and Ilya Sutskever. Generative pre- training from pixels. In Proceedings of the 37th Interna- tional Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event, volume 119 of Proceedings of Ma- chine Learning Research , pages 1691–1703. PMLR, 2020. 8

  10. [10]

    Ting Chen, Simon Kornblith, Mohammad Norouzi, and Ge- offrey E. Hinton. A simple framework for contrastive learn- ing of visual representations. In ICML, 2020. 2, 4

  11. [11]

    Microsoft COCO Captions: Data Collection and Evaluation Server

    Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedan- tam, Saurabh Gupta, Piotr Doll ´ar, and C. Lawrence Zitnick. Microsoft COCO captions: Data collection and evaluation server. CoRR, abs/1504.00325, 2015. 7, 17

  12. [12]

    Xiangning Chen, Chen Liang, Da Huang, Esteban Real, Kaiyuan Wang, Yao Liu, Hieu Pham, Xuanyi Dong, Thang Luong, Cho-Jui Hsieh, Yifeng Lu, and Quoc V . Le. Symbolic discovery of optimization algorithms, 2023. 2, 6

  13. [13]

    Xi Chen, Xiao Wang, Soravit Changpinyo, A. J. Piergio- vanni, Piotr Padlewski, Daniel Salz, Sebastian Goodman, Adam Grycner, Basil Mustafa, Lucas Beyer, Alexander Kolesnikov, Joan Puigcerver, Nan Ding, Keran Rong, Has- san Akbari, Gaurav Mishra, Linting Xue, Ashish Thapliyal, James Bradbury, Weicheng Kuo, Mojtaba Seyedhosseini, Chao Jia, Burcu Karagol Aya...

  14. [14]

    Imagenet: A large-scale hierarchical image database

    Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In CVPR, 2009. 4, 7, 9, 17

  15. [15]

    Redcaps: Web-curated image-text data created by the people, for the people

    Karan Desai, Gaurav Kaul, Zubin Aysola, and Justin John- son. Redcaps: Web-curated image-text data created by the people, for the people. In Joaquin Vanschoren and Sai- Kit Yeung, editors, Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks 1, NeurIPS Datasets and Benchmarks 2021, December 2021, virtual, 2021. 1

  16. [16]

    Clip itself is a strong fine-tuner: Achieving 85.7% and 88.0% top-1 accuracy with vit-b and vit-l on imagenet

    Xiaoyi Dong, Jianmin Bao, Ting Zhang, Dongdong Chen, Shuyang Gu, Weiming Zhang, Lu Yuan, Dong Chen, Fang Wen, and Nenghai Yu. Clip itself is a strong fine-tuner: Achieving 85.7% and 88.0% top-1 accuracy with vit-b and vit-l on imagenet. CoRR, abs/2212.06138, 2022. 2

  17. [17]

    An image is worth 16×16 words: Transformers for image recognition at scale

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16×16 words: Transformers for image recognition at scale. In ICLR, 2021. 3, 7, 17

  18. [18]

    Deep Learning

    Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning . MIT Press, 2016. http://www. deeplearningbook.org. 1

  19. [19]

    Dimension- ality reduction by learning an invariant mapping

    Raia Hadsell, Sumit Chopra, and Yann LeCun. Dimension- ality reduction by learning an invariant mapping. In CVPR, volume 2, 2006. 2

  20. [20]

    Girshick

    Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Pi- otr Doll ´ar, and Ross B. Girshick. Masked autoencoders are scalable vision learners. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022 , pages 15979–15988. IEEE, 2022. 8

  21. [21]

    Scaling up vision-language pre-training for image captioning.arXiv preprint arXiv:2111.12233, 2021a

    Xiaowei Hu, Zhe Gan, Jianfeng Wang, Zhengyuan Yang, Zicheng Liu, Yumao Lu, and Lijuan Wang. Scaling up vision-language pre-training for image captioning. CoRR, abs/2111.12233, 2021. 1, 2

  22. [22]

    OpenCLIP

    Gabriel Ilharco, Mitchell Wortsman, Nicholas Carlini, Rohan Taori, Achal Dave, Vaishaal Shankar, Hongseok Namkoong, John Miller, Hannaneh Hajishirzi, Ali Farhadi, and Ludwig Schmidt. OpenCLIP. Zenodo, 2021. 8

  23. [23]

    Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig

    Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc V . Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representation learning with noisy text supervision. In ICML, 2021. 1, 2

  24. [24]

    Supervised contrastive learning

    Prannay Khosla, Piotr Teterwak, Chen Wang, Aaron Sarna, Yonglong Tian, Phillip Isola, Aaron Maschinot, Ce Liu, and Dilip Krishnan. Supervised contrastive learning. In Hugo Larochelle, Marc’Aurelio Ranzato, Raia Hadsell, Maria- Florina Balcan, and Hsuan-Tien Lin, editors, Advances in Neural Information Processing Systems 33: Annual Con- ference on Neural I...

  25. [25]

    11 Big transfer (BiT): General visual representation learning

    Alexander Kolesnikov, Lucas Beyer, Xiaohua Zhai, Joan Puigcerver, Jessica Yung, Sylvain Gelly, and Neil Houlsby. 11 Big transfer (BiT): General visual representation learning. In ECCV, 2020. 7

  26. [26]

    Learning multiple layers of features from tiny images

    Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical report, Univ. of Toronto, 2009. 9

  27. [27]

    SentencePiece: A sim- ple and language independent subword tokenizer and detok- enizer for neural text processing

    Taku Kudo and John Richardson. SentencePiece: A sim- ple and language independent subword tokenizer and detok- enizer for neural text processing. In EMNLP, 2018. 5, 14

  28. [28]

    Junnan Li, Dongxu Li, Caiming Xiong, and Steven C. H. Hoi. BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesv´ari, Gang Niu, and Sivan Sabato, editors, Interna- tional Conference on Machine Learning, ICML 2022, 17- 23 July 2022, Baltimore,...

  29. [29]

    Clipa-v2: Scal- ing CLIP training with 81.1% zero-shot imagenet accuracy within a $10, 000 budget; an extra $4, 000 unlocks 81.8% accuracy

    Xianhang Li, Zeyu Wang, and Cihang Xie. Clipa-v2: Scal- ing CLIP training with 81.1% zero-shot imagenet accuracy within a $10, 000 budget; an extra $4, 000 unlocks 81.8% accuracy. CoRR, abs/2306.15658, 2023. 8

  30. [30]

    Scaling language-image pre-training via masking

    Yanghao Li, Haoqi Fan, Ronghang Hu, Christoph Feichten- hofer, and Kaiming He. Scaling language-image pre-training via masking. CoRR, abs/2212.00794, 2022. 2, 7

  31. [31]

    Matthias Minderer, Alexey A. Gritsenko, Austin Stone, Maxim Neumann, Dirk Weissenborn, Alexey Dosovitskiy, Aravindh Mahendran, Anurag Arnab, Mostafa Dehghani, Zhuoran Shen, Xiao Wang, Xiaohua Zhai, Thomas Kipf, and Neil Houlsby. Simple open-vocabulary object detection. In Shai Avidan, Gabriel J. Brostow, Moustapha Ciss ´e, Gio- vanni Maria Farinella, and ...

  32. [32]

    Model cards for model reporting

    Margaret Mitchell, Simone Wu, Andrew Zaldivar, Parker Barnes, Lucy Vasserman, Ben Hutchinson, Elena Spitzer, Inioluwa Deborah Raji, and Timnit Gebru. Model cards for model reporting. In danah boyd and Jamie H. Morgen- stern, editors, Proceedings of the Conference on Fairness, Accountability, and Transparency, FAT* 2019, Atlanta, GA, USA, January 29-31, 20...

  33. [33]

    Jishnu Mukhoti, Tsung-Yu Lin, Omid Poursaeed, Rui Wang, Ashish Shah, Philip H. S. Torr, and Ser-Nam Lim. Open vocabulary semantic segmentation with patch aligned con- trastive learning, 2022. 2

  34. [34]

    Parkhi, Andrea Vedaldi, Andrew Zisserman, and C

    Omkar M. Parkhi, Andrea Vedaldi, Andrew Zisserman, and C. V . Jawahar. Cats and dogs. InIEEE Conference on Com- puter Vision and Pattern Recognition, 2012. 9

  35. [35]

    Hieu Pham, Zihang Dai, Golnaz Ghiasi, Hanxiao Liu, Adams Wei Yu, Minh-Thang Luong, Mingxing Tan, and Quoc V . Le. Combined scaling for zero-shot transfer learn- ing. CoRR, abs/2111.10050, 2021. 2, 4

  36. [36]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. In ICML, 2021. 1, 2, 3, 8

  37. [37]

    Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer.arXiv e-prints, 2019. 5, 14

  38. [38]

    Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res. , 21:140:1–140:67, 2020. 17

  39. [39]

    Do ImageNet classifiers generalize to Im- ageNet? In ICML, 2019

    Benjamin Recht, Rebecca Roelofs, Ludwig Schmidt, and Vaishaal Shankar. Do ImageNet classifiers generalize to Im- ageNet? In ICML, 2019. 7, 17

  40. [40]

    LAION-5B: An open large-scale dataset for training next generation image-text models

    Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Worts- man, Patrick Schramowski, Srivatsa Kundurthy, Katherine Crowson, Ludwig Schmidt, Robert Kaczmarczyk, and Jenia Jitsev. LAION-5B: an open large-scale dataset for training next generation image-text model...

  41. [41]

    WIT: wikipedia-based image text dataset for multimodal multilingual machine learning

    Krishna Srinivasan, Karthik Raman, Jiecao Chen, Michael Bendersky, and Marc Najork. WIT: wikipedia-based image text dataset for multimodal multilingual machine learning. CoRR, abs/2103.01913, 2021. 1

  42. [42]

    How to train your ViT? Data, augmentation, and regularization in vision transformers

    Andreas Steiner, Alexander Kolesnikov, Xiaohua Zhai, Ross Wightman, Jakob Uszkoreit, and Lucas Beyer. How to train your ViT? Data, augmentation, and regularization in vision transformers. CoRR, abs/2106.10270, 2021. 1, 6, 7

  43. [43]

    EVA-CLIP: Improved Training Techniques for CLIP at Scale

    Quan Sun, Yuxin Fang, Ledell Wu, Xinlong Wang, and Yue Cao. EV A-CLIP: improved training techniques for CLIP at scale. CoRR, abs/2303.15389, 2023. 8

  44. [44]

    Thapliyal, Jordi Pont-Tuset, Xi Chen, and Radu Soricut

    Ashish V . Thapliyal, Jordi Pont-Tuset, Xi Chen, and Radu Soricut. Crossmodal-3600: A massively multilingual mul- timodal evaluation dataset. In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang, editors, Proceedings of the 2022 Conference on Empirical Methods in Natural Language Pro- cessing, EMNLP 2022, Abu Dhabi, United Arab Emirates, December 7-11, 2022...

  45. [45]

    LLaMA: Open and Efficient Foundation Language Models

    Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth´ee Lacroix, Baptiste Rozi`ere, Naman Goyal, Eric Hambro, Faisal Azhar, Aur´elien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. Llama: Open and efficient foundation language models. CoRR, abs/2302.13971, 2023. 7

  46. [46]

    Representation Learning with Contrastive Predictive Coding

    A ¨aron van den Oord, Yazhe Li, and Oriol Vinyals. Repre- sentation learning with contrastive predictive coding. CoRR, abs/1807.03748, 2018. 2

  47. [47]

    Gomez, Lukasz Kaiser, and Illia Polosukhin

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In NeurIPS, 2017. 3, 17

  48. [48]

    Nllb-clip – train performant multilin- gual image retrieval model on a budget, 2023

    Alexander Visheratin. Nllb-clip – train performant multilin- gual image retrieval model on a budget, 2023. 7

  49. [49]

    GIT: A Generative Image-to-text Transformer for Vision and Language

    Jianfeng Wang, Zhengyuan Yang, Xiaowei Hu, Linjie Li, Kevin Lin, Zhe Gan, Zicheng Liu, Ce Liu, and Lijuan Wang. GIT: A generative image-to-text transformer for vision and language. CoRR, abs/2205.14100, 2022. 1, 2 12

  50. [50]

    Simvlm: Simple visual language model pretraining with weak supervision

    Zirui Wang, Jiahui Yu, Adams Wei Yu, Zihang Dai, Yulia Tsvetkov, and Yuan Cao. Simvlm: Simple visual language model pretraining with weak supervision. In The Tenth In- ternational Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022 . OpenReview.net,

  51. [51]

    Resnet strikes back: An improved training procedure in timm

    Ross Wightman, Hugo Touvron, and Herv ´e J ´egou. Resnet strikes back: An improved training procedure in timm. CoRR, abs/2110.00476, 2021. 2

  52. [52]

    Reaching 80% zero-shot accuracy with OpenCLIP: VIT-G/14 trained on LAION-2B

    Mitchell Wortsman. Reaching 80% zero-shot accuracy with OpenCLIP: VIT-G/14 trained on LAION-2B. https: //web.archive.org/web/20230127012732/ https://laion.ai/blog/giant-openclip/. 2

  53. [53]

    Robust fine-tuning of zero-shot models

    Mitchell Wortsman, Gabriel Ilharco, Jong Wook Kim, Mike Li, Simon Kornblith, Rebecca Roelofs, Raphael Gon- tijo Lopes, Hannaneh Hajishirzi, Ali Farhadi, Hongseok Namkoong, and Ludwig Schmidt. Robust fine-tuning of zero-shot models. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022, pages 79...

  54. [54]

    mT5: A massively multilingual pre-trained text-to- text transformer

    Linting Xue, Noah Constant, Adam Roberts, Mihir Kale, Rami Al-Rfou, Aditya Siddhant, Aditya Barua, and Colin Raffel. mT5: A massively multilingual pre-trained text-to- text transformer. In NAACL-HLT, 2021. 5, 17

  55. [55]

    Unified contrastive learn- ing in image-text-label space

    Jianwei Yang, Chunyuan Li, Pengchuan Zhang, Bin Xiao, Ce Liu, Lu Yuan, and Jianfeng Gao. Unified contrastive learn- ing in image-text-label space. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022 , pages 19141–19151. IEEE, 2022. 2

  56. [56]

    CoCa: Contrastive Captioners are Image-Text Foundation Models

    Jiahui Yu, Zirui Wang, Vijay Vasudevan, Legg Yeung, Mojtaba Seyedhosseini, and Yonghui Wu. Coca: Con- trastive captioners are image-text foundation models. CoRR, abs/2205.01917, 2022. 2

  57. [57]

    Florence: A New Foundation Model for Computer Vision

    Lu Yuan, Dongdong Chen, Yi-Ling Chen, Noel Codella, Xiyang Dai, Jianfeng Gao, Houdong Hu, Xuedong Huang, Boxin Li, Chunyuan Li, Ce Liu, Mengchen Liu, Zicheng Liu, Yumao Lu, Yu Shi, Lijuan Wang, Jianfeng Wang, Bin Xiao, Zhen Xiao, Jianwei Yang, Michael Zeng, Luowei Zhou, and Pengchuan Zhang. Florence: A new foundation model for computer vision. CoRR, abs/2...

  58. [58]

    Scaling vision transformers

    Xiaohua Zhai, Alexander Kolesnikov, Neil Houlsby, and Lu- cas Beyer. Scaling vision transformers. CVPR, 2022. 1, 4, 6, 7, 14

  59. [59]

    Lit: Zero-shot transfer with locked-image text tuning

    Xiaohua Zhai, Xiao Wang, Basil Mustafa, Andreas Steiner, Daniel Keysers, Alexander Kolesnikov, and Lucas Beyer. Lit: Zero-shot transfer with locked-image text tuning. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18- 24, 2022, pages 18102–18112. IEEE, 2022. 1, 2, 3, 4, 6, 7, 14

  60. [60]

    Manning, and Curtis P

    Yuhao Zhang, Hang Jiang, Yasuhide Miura, Christopher D. Manning, and Curtis P. Langlotz. Contrastive learning of medical visual representations from paired images and text. In Zachary C. Lipton, Rajesh Ranganath, Mark P. Sendak, Michael W. Sjoding, and Serena Yeung, editors, Proceed- ings of the Machine Learning for Healthcare Conference, MLHC 2022, 5-6 A...