arxiv: 2303.15343 · v4 · submitted 2023-03-27 · 💻 cs.CV · cs.AI

Recognition: no theorem link

Sigmoid Loss for Language Image Pre-Training

Xiaohua Zhai , Basil Mustafa , Alexander Kolesnikov , Lucas Beyer

Authors on Pith no claims yet

Pith reviewed 2026-05-16 13:00 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords sigmoid lossSigLIPlanguage-image pre-trainingcontrastive learningzero-shot accuracybatch size scaling

0 comments

The pith

A pairwise sigmoid loss for image-text pre-training achieves 84.5% zero-shot ImageNet accuracy using only four TPU chips in two days.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a sigmoid loss that computes a loss for each image-text pair independently, without needing to normalize across the entire batch like in standard contrastive methods. This design allows training with much larger batch sizes and also works well with smaller ones, removing the coupling between batch size and loss function. When paired with locked-image tuning, the resulting SigLiT model reaches 84.5 percent accuracy on ImageNet zero-shot classification after training for two days on four TPUv4 chips. The separation of batch size also enables experiments varying the number of examples, pairs, and negative-to-positive ratios. Tests up to batch size one million show that gains level off, with 32 thousand being adequate.

Core claim

The sigmoid loss operates solely on image-text pairs and does not require a global view of the pairwise similarities for normalization, simultaneously allowing further scaling up of the batch size while also performing better at smaller batch sizes.

What carries the argument

The pairwise sigmoid loss, which applies a sigmoid activation to the dot product of image and text embeddings for each pair independently.

If this is right

Training becomes possible with extremely large batch sizes up to one million without issues from global normalization.
A moderate batch size of 32k provides most of the benefits, making training more practical.
The loss allows independent control over the number of examples and the negative-to-positive ratio.
High zero-shot performance is achievable with minimal hardware resources when combined with locked-image tuning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Pre-training could become accessible on smaller compute budgets or even single machines with further optimizations.
The method might apply to other contrastive learning setups beyond vision-language models.
Future work could explore even larger scales or different modalities using the same loss structure.

Load-bearing premise

The sigmoid loss will keep producing high-quality representations at new scales or on new data without needing hyper-parameter adjustments.

What would settle it

Training a larger SigLIP model on a new dataset with fixed hyperparameters and observing substantially worse zero-shot accuracy than a comparable softmax contrastive model would falsify the claim.

read the original abstract

We propose a simple pairwise Sigmoid loss for Language-Image Pre-training (SigLIP). Unlike standard contrastive learning with softmax normalization, the sigmoid loss operates solely on image-text pairs and does not require a global view of the pairwise similarities for normalization. The sigmoid loss simultaneously allows further scaling up the batch size, while also performing better at smaller batch sizes. Combined with Locked-image Tuning, with only four TPUv4 chips, we train a SigLiT model that achieves 84.5% ImageNet zero-shot accuracy in two days. The disentanglement of the batch size from the loss further allows us to study the impact of examples vs pairs and negative to positive ratio. Finally, we push the batch size to the extreme, up to one million, and find that the benefits of growing batch size quickly diminish, with a more reasonable batch size of 32k being sufficient. We release our models at https://github.com/google-research/big_vision and hope our research motivates further explorations in improving the quality and efficiency of language-image pre-training.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SigLIP's pairwise sigmoid loss removes the global normalization step from contrastive pretraining, which lets batch size vary freely and supports the reported 84.5% zero-shot result on minimal hardware, but the numbers need full experimental controls to confirm they generalize.

read the letter

The main thing to know is that this paper swaps the standard softmax contrastive loss for a simple pairwise sigmoid loss on image-text pairs. No batch-wide normalization is needed, so batch size becomes an independent choice rather than something baked into the loss. They pair it with Locked-image Tuning and report training a model to 84.5% ImageNet zero-shot accuracy in two days on four TPUv4 chips. They also scale the batch to one million, show that gains mostly stop after 32k, and run ablations on negative-to-positive ratios and examples versus pairs. The model release is a practical plus for anyone who wants to test it directly. The formulation is clean, the scaling observations are straightforward, and the efficiency claim is the clearest new data point. The experiments appear internally consistent on the terms they set, with concrete accuracy numbers tied to the loss change. The soft spot is that the abstract gives limited visibility into training details, data splits, variance across runs, or whether the same scale parameter and hyperparameters would hold up on different data distributions or larger models without retuning. The stress-test note is right that the lack of global normalization could shift gradient dynamics in untested regimes, even though the paper already checked up to very large batches. This is for people working on efficient vision-language pretraining or contrastive loss variants. A reader who cares about compute budgets or wants a drop-in alternative to InfoNCE would get immediate value from the batch-size and ratio studies. The core idea is simple enough and the reported results competitive enough that it deserves a serious referee rather than a desk reject.

Referee Report

2 major / 1 minor

Summary. The paper proposes a pairwise sigmoid loss (SigLIP) for language-image pre-training that operates directly on image-text pairs without requiring global softmax normalization over the batch. This design enables scaling batch sizes while also improving results at smaller batches. Combined with Locked-image Tuning, the authors report training a model to 84.5% ImageNet zero-shot accuracy using only four TPUv4 chips in two days. They further ablate the effects of batch size (up to 1M), examples versus pairs, and negative-to-positive ratios, concluding that benefits diminish beyond a 32k batch size.

Significance. If the reported accuracies and efficiency gains hold, the work is significant because it removes the dependence on large-batch normalization that has constrained contrastive vision-language training since CLIP. The ability to train competitive models with modest hardware (four TPUv4 chips) and the public release of models at https://github.com/google-research/big_vision both lower the barrier to entry and support reproducibility. The batch-size scaling study also provides concrete guidance on practical operating points.

major comments (2)

[Abstract] Abstract: The headline result of 84.5% ImageNet zero-shot accuracy with SigLiT on four TPUv4 chips in two days is load-bearing for the efficiency claim, yet the manuscript provides no accompanying table or section detailing the exact model size, training dataset, number of steps, or direct LiT baseline comparison under the identical four-chip budget; without these, the contribution attributable to the sigmoid loss versus other factors cannot be isolated.
[Abstract] The paper states that the sigmoid loss 'performs better at smaller batch sizes' and 'allows further scaling up the batch size,' but the provided ablations stop at the authors' chosen regimes; there is no cross-model-size or cross-dataset experiment demonstrating that the sigmoid scale hyper-parameter transfers without retuning, which directly tests the weakest assumption that the loss remains effective when batch-wide normalization is removed.

minor comments (1)

The GitHub release is welcome, but the manuscript should explicitly state whether the training scripts, exact hyper-parameters, and data-preprocessing pipelines used for the 84.5% result are included so that the two-day four-chip claim can be reproduced.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful review and constructive comments on our work. We address each major comment below and will make revisions to enhance the clarity and completeness of the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: The headline result of 84.5% ImageNet zero-shot accuracy with SigLiT on four TPUv4 chips in two days is load-bearing for the efficiency claim, yet the manuscript provides no accompanying table or section detailing the exact model size, training dataset, number of steps, or direct LiT baseline comparison under the identical four-chip budget; without these, the contribution attributable to the sigmoid loss versus other factors cannot be isolated.

Authors: We agree that the abstract's efficiency claim requires supporting details to allow isolation of the sigmoid loss contribution. In the revised manuscript we will add a dedicated table (or subsection) that specifies the exact model size, training dataset, number of steps, and a direct LiT baseline comparison trained under the identical four TPUv4-chip, two-day budget. revision: yes
Referee: [Abstract] The paper states that the sigmoid loss 'performs better at smaller batch sizes' and 'allows further scaling up the batch size,' but the provided ablations stop at the authors' chosen regimes; there is no cross-model-size or cross-dataset experiment demonstrating that the sigmoid scale hyper-parameter transfers without retuning, which directly tests the weakest assumption that the loss remains effective when batch-wide normalization is removed.

Authors: The sigmoid scale hyper-parameter was held fixed at the same value across the entire set of batch-size ablations (from small batches through 1 M). Because the same fixed value was used without retuning while still showing gains at smaller batches and continued (though diminishing) benefits at larger batches, the experiments already provide evidence that the loss remains effective once batch-wide normalization is removed. We will revise the text to explicitly note that the scale was not retuned and to discuss this as supporting robustness. Additional cross-model or cross-dataset sweeps of the scale parameter lie outside the current scope. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical proposal and validation of pairwise sigmoid loss

full rationale

The paper defines a new sigmoid loss directly on image-text pairs without softmax normalization over the batch, then reports results from training SigLiT models on standard datasets and measuring zero-shot ImageNet accuracy. No equations reduce the reported accuracies or scaling claims back to fitted parameters by construction, and the work contains no load-bearing self-citations, uniqueness theorems, or ansatzes smuggled from prior author work. The derivation chain is self-contained because performance is obtained through explicit training runs rather than any algebraic or statistical reduction to the input assumptions.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The approach rests on the empirical effectiveness of the sigmoid loss under standard contrastive pre-training assumptions; no new physical entities or unstated mathematical axioms are introduced beyond the loss definition itself.

free parameters (1)

sigmoid scale parameter
The loss formulation typically includes a learnable or fixed scaling factor analogous to temperature in contrastive losses; its value is not specified in the abstract.

axioms (1)

domain assumption Image-text pairs provide sufficient supervision without requiring global batch statistics for normalization
The loss is stated to operate solely on individual pairs.

pith-pipeline@v0.9.0 · 5486 in / 1184 out tokens · 31664 ms · 2026-05-16T13:00:19.151560+00:00 · methodology

discussion (0)

Forward citations

Cited by 20 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

ASH: Agents that Self-Hone via Embodied Learning
cs.AI 2026-05 unverdicted novelty 7.0

ASH reaches 11.2/12 milestones in Pokemon Emerald and 9.9/12 in Zelda by self-improving via an IDM trained on its own trajectories to label internet video, while baselines plateau at roughly 6/12.
OA-WAM: Object-Addressable World Action Model for Robust Robot Manipulation
cs.RO 2026-05 unverdicted novelty 7.0

OA-WAM uses persistent address vectors and dynamic content vectors in object slots to enable addressable world-action prediction, improving robustness on manipulation benchmarks under scene changes.
Aligned Multi-View Scripts for Universal Chart-to-Code Generation
cs.CL 2026-04 unverdicted novelty 7.0

Introduces an aligned multi-language dataset and a language-conditioned low-rank adapter for generating executable plotting code in Python, R, and LaTeX from chart images.
RSRCC: A Remote Sensing Regional Change Comprehension Benchmark Constructed via Retrieval-Augmented Best-of-N Ranking
cs.CV 2026-04 unverdicted novelty 7.0

RSRCC is a new 126k-question benchmark for fine-grained remote sensing change question-answering, constructed via a hierarchical semi-supervised pipeline with retrieval-augmented Best-of-N ranking.
Equifinality in Mixture of Experts: Routing Topology Does Not Determine Language Modeling Quality
cs.AI 2026-04 conditional novelty 7.0

Routing topology in sparse Mixture-of-Experts models does not determine asymptotic language modeling perplexity; multiple variants including cosine-similarity routing achieve statistically equivalent performance.
Why and When Visual Token Pruning Fails? A Study on Relevant Visual Information Shift in MLLMs Decoding
cs.CV 2026-04 unverdicted novelty 7.0

Visual token pruning in MLLMs fails on complex reasoning due to Relevant Visual Information Shift during decoding, but the DSTP framework fixes it training-free across models.
MARVEL: Multimodal Adaptive Reasoning-intensiVe Expand-rerank and retrievaL
cs.IR 2026-04 unverdicted novelty 7.0

MARVEL reaches 37.9 nDCG@10 on the MM-BRIGHT benchmark by combining LLM query expansion, a reasoning-enhanced dense retriever, and GPT-4o CoT reranking, beating prior multimodal encoders by 10.3 points.
Affordance Agent Harness: Verification-Gated Skill Orchestration
cs.RO 2026-05 unverdicted novelty 6.0

Affordance Agent Harness is a verification-gated orchestration system that unifies skills via an evidence store, episodic memory priors, an adaptive router, and a self-consistency verifier to improve accuracy-cost tra...
Majorization-Guided Test-Time Adaptation for Vision-Language Models under Modality-Specific Shift
cs.CV 2026-04 unverdicted novelty 6.0

MG-MTTA improves VLM accuracy under modality-specific shifts by replacing pure entropy minimization with majorization-guided adaptation that incorporates a reliability-aware gate prior.
MaMe & MaRe: Matrix-Based Token Merging and Restoration for Efficient Visual Perception and Synthesis
cs.CV 2026-04 unverdicted novelty 6.0

MaMe is a differentiable matrix-only token merging method that doubles ViT-B throughput with a 2% accuracy drop on pre-trained models and enables faster, higher-quality image synthesis when paired with MaRe.
IntentScore: Intent-Conditioned Action Evaluation for Computer-Use Agents
cs.AI 2026-04 unverdicted novelty 6.0

IntentScore learns intent-conditioned action scores from offline GUI trajectories and raises task success by 6.9 points on an unseen agent and environment.
Chasing Ghosts: A Simulation-to-Real Olfactory Navigation Stack with Optional Vision Augmentation
cs.RO 2026-02 unverdicted novelty 6.0

A simulation-to-real navigation policy enables a quadrotor to locate an odor source using only basic olfaction sensors and optional vision, validated in indoor real-world flights.
F1: A Vision-Language-Action Model Bridging Understanding and Generation to Actions
cs.RO 2025-09 unverdicted novelty 6.0

F1 integrates next-scale visual foresight prediction into a Mixture-of-Transformer VLA architecture to reformulate action generation as foresight-guided inverse dynamics, achieving higher success rates on 136 tasks.
CropVLM: A Domain-Adapted Vision-Language Model for Open-Set Crop Analysis
cs.CV 2026-05 unverdicted novelty 5.0

CropVLM is a domain-adapted vision-language model that achieves 72.51% zero-shot crop classification accuracy and superior open-set detection performance on novel species without retraining.
Motif-Video 2B: Technical Report
cs.CV 2026-04 unverdicted novelty 5.0

Motif-Video 2B achieves 83.76% VBench score, beating a 14B-parameter baseline with 7x fewer parameters and substantially less training data through shared cross-attention and a three-part backbone.
FF3R: Feedforward Feature 3D Reconstruction from Unconstrained views
cs.CV 2026-04 unverdicted novelty 5.0

FF3R unifies geometric and semantic 3D reconstruction in a single annotation-free feed-forward network trained solely via RGB and feature rendering supervision.
BRIDGE: Multimodal-to-Text Retrieval via Reinforcement-Learned Query Alignment
cs.IR 2026-04 unverdicted novelty 5.0

BRIDGE reaches 29.7 nDCG@10 on MM-BRIGHT by RL-aligning multimodal queries to text and using a reasoning retriever, beating multimodal encoders and, when combined with Nomic-Vision, exceeding the best text-only retrie...
Kimi K2.5: Visual Agentic Intelligence
cs.CL 2026-02 unverdicted novelty 5.0

Kimi K2.5 combines joint text-vision training with an Agent Swarm parallel orchestration framework to reach claimed state-of-the-art results on coding, vision, reasoning, and agent tasks while cutting latency up to 4.5 times.
Affordance Agent Harness: Verification-Gated Skill Orchestration
cs.RO 2026-05 unverdicted novelty 4.0

Affordance Agent Harness is a verification-gated orchestration framework that adaptively combines heterogeneous skills, retrieves episodic memories, and uses self-consistency checks to improve affordance grounding acc...
Are vision-language models ready to zero-shot replace supervised classification models in agriculture?
cs.CV 2025-12 unverdicted novelty 4.0

Zero-shot VLMs reach at most 62% accuracy on agricultural classification tasks while supervised models like YOLO11 perform markedly higher, indicating they are not ready to replace task-specific systems.

Reference graph

Works this paper leans on

60 extracted references · 60 canonical work pages · cited by 19 Pith papers · 9 internal anchors

[1]

Getting vit in shape: Scaling laws for compute-optimal model design

Ibrahim Alabdulmohsin, Xiaohua Zhai, Alexander Kolesnikov, and Lucas Beyer. Getting vit in shape: Scaling laws for compute-optimal model design. In NeurIPS, 2023. 7, 8, 17

work page 2023
[2]

ObjectNet: A large-scale bias-controlled dataset for pushing the limits of object recognition models

Andrei Barbu, David Mayo, Julian Alverio, William Luo, Christopher Wang, Dan Gutfreund, Josh Tenenbaum, and Boris Katz. ObjectNet: A large-scale bias-controlled dataset for pushing the limits of object recognition models. In NeurIPS, 2019. 7, 17

work page 2019
[3]

H ´enaff, Alexander Kolesnikov, Xi- aohua Zhai, and A ¨aron van den Oord

Lucas Beyer, Olivier J. H ´enaff, Alexander Kolesnikov, Xi- aohua Zhai, and A ¨aron van den Oord. Are we done with imagenet? CoRR, abs/2006.07159, 2020. 2, 7, 9, 17

work page arXiv 2006
[4]

Bet- ter plain vit baselines for imagenet-1k, 2022

Lucas Beyer, Xiaohua Zhai, and Alexander Kolesnikov. Bet- ter plain vit baselines for imagenet-1k, 2022. 10, 17

work page 2022
[5]

Big vision

Lucas Beyer, Xiaohua Zhai, and Alexander Kolesnikov. Big vision. https://github.com/google-research/ big_vision, 2022. 10, 17

work page 2022
[6]

Coyo- 700m: Image-text pair dataset

Minwoo Byeon, Beomhee Park, Haecheon Kim, Sungjun Lee, Woonhyuk Baek, and Saehoon Kim. Coyo- 700m: Image-text pair dataset. https://github.com/ kakaobrain/coyo-dataset, 2022. 1

work page 2022
[7]

Conceptual 12M: Pushing web-scale image-text pre-training to recognize long-tail visual concepts

Soravit Changpinyo, Piyush Sharma, Nan Ding, and Radu Soricut. Conceptual 12M: Pushing web-scale image-text pre-training to recognize long-tail visual concepts. In CVPR,

work page
[8]

VLP: A survey on vision- language pre-training

Feilong Chen, Duzhen Zhang, Minglun Han, Xiu-Yi Chen, Jing Shi, Shuang Xu, and Bo Xu. VLP: A survey on vision- language pre-training. Int. J. Autom. Comput., 20(1):38–56,

work page
[9]

Generative pre- training from pixels

Mark Chen, Alec Radford, Rewon Child, Jeffrey Wu, Hee- woo Jun, David Luan, and Ilya Sutskever. Generative pre- training from pixels. In Proceedings of the 37th Interna- tional Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event, volume 119 of Proceedings of Ma- chine Learning Research , pages 1691–1703. PMLR, 2020. 8

work page 2020
[10]

Ting Chen, Simon Kornblith, Mohammad Norouzi, and Ge- offrey E. Hinton. A simple framework for contrastive learn- ing of visual representations. In ICML, 2020. 2, 4

work page 2020
[11]

Microsoft COCO Captions: Data Collection and Evaluation Server

Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedan- tam, Saurabh Gupta, Piotr Doll ´ar, and C. Lawrence Zitnick. Microsoft COCO captions: Data collection and evaluation server. CoRR, abs/1504.00325, 2015. 7, 17

work page internal anchor Pith review Pith/arXiv arXiv 2015
[12]

Xiangning Chen, Chen Liang, Da Huang, Esteban Real, Kaiyuan Wang, Yao Liu, Hieu Pham, Xuanyi Dong, Thang Luong, Cho-Jui Hsieh, Yifeng Lu, and Quoc V . Le. Symbolic discovery of optimization algorithms, 2023. 2, 6

work page 2023
[13]

Xi Chen, Xiao Wang, Soravit Changpinyo, A. J. Piergio- vanni, Piotr Padlewski, Daniel Salz, Sebastian Goodman, Adam Grycner, Basil Mustafa, Lucas Beyer, Alexander Kolesnikov, Joan Puigcerver, Nan Ding, Keran Rong, Has- san Akbari, Gaurav Mishra, Linting Xue, Ashish Thapliyal, James Bradbury, Weicheng Kuo, Mojtaba Seyedhosseini, Chao Jia, Burcu Karagol Aya...

work page internal anchor Pith review Pith/arXiv arXiv 2022
[14]

Imagenet: A large-scale hierarchical image database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In CVPR, 2009. 4, 7, 9, 17

work page 2009
[15]

Redcaps: Web-curated image-text data created by the people, for the people

Karan Desai, Gaurav Kaul, Zubin Aysola, and Justin John- son. Redcaps: Web-curated image-text data created by the people, for the people. In Joaquin Vanschoren and Sai- Kit Yeung, editors, Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks 1, NeurIPS Datasets and Benchmarks 2021, December 2021, virtual, 2021. 1

work page 2021
[16]

Clip itself is a strong ﬁne-tuner: Achieving 85.7% and 88.0% top-1 accuracy with vit-b and vit-l on imagenet

Xiaoyi Dong, Jianmin Bao, Ting Zhang, Dongdong Chen, Shuyang Gu, Weiming Zhang, Lu Yuan, Dong Chen, Fang Wen, and Nenghai Yu. Clip itself is a strong ﬁne-tuner: Achieving 85.7% and 88.0% top-1 accuracy with vit-b and vit-l on imagenet. CoRR, abs/2212.06138, 2022. 2

work page arXiv 2022
[17]

An image is worth 16×16 words: Transformers for image recognition at scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16×16 words: Transformers for image recognition at scale. In ICLR, 2021. 3, 7, 17

work page 2021
[18]

Deep Learning

Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning . MIT Press, 2016. http://www. deeplearningbook.org. 1

work page 2016
[19]

Dimension- ality reduction by learning an invariant mapping

Raia Hadsell, Sumit Chopra, and Yann LeCun. Dimension- ality reduction by learning an invariant mapping. In CVPR, volume 2, 2006. 2

work page 2006
[20]

Girshick

Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Pi- otr Doll ´ar, and Ross B. Girshick. Masked autoencoders are scalable vision learners. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022 , pages 15979–15988. IEEE, 2022. 8

work page 2022
[21]

Scaling up vision-language pre-training for image captioning.arXiv preprint arXiv:2111.12233, 2021a

Xiaowei Hu, Zhe Gan, Jianfeng Wang, Zhengyuan Yang, Zicheng Liu, Yumao Lu, and Lijuan Wang. Scaling up vision-language pre-training for image captioning. CoRR, abs/2111.12233, 2021. 1, 2

work page arXiv 2021
[22]

OpenCLIP

Gabriel Ilharco, Mitchell Wortsman, Nicholas Carlini, Rohan Taori, Achal Dave, Vaishaal Shankar, Hongseok Namkoong, John Miller, Hannaneh Hajishirzi, Ali Farhadi, and Ludwig Schmidt. OpenCLIP. Zenodo, 2021. 8

work page 2021
[23]

Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig

Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc V . Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representation learning with noisy text supervision. In ICML, 2021. 1, 2

work page 2021
[24]

Supervised contrastive learning

Prannay Khosla, Piotr Teterwak, Chen Wang, Aaron Sarna, Yonglong Tian, Phillip Isola, Aaron Maschinot, Ce Liu, and Dilip Krishnan. Supervised contrastive learning. In Hugo Larochelle, Marc’Aurelio Ranzato, Raia Hadsell, Maria- Florina Balcan, and Hsuan-Tien Lin, editors, Advances in Neural Information Processing Systems 33: Annual Con- ference on Neural I...

work page 2020
[25]

11 Big transfer (BiT): General visual representation learning

Alexander Kolesnikov, Lucas Beyer, Xiaohua Zhai, Joan Puigcerver, Jessica Yung, Sylvain Gelly, and Neil Houlsby. 11 Big transfer (BiT): General visual representation learning. In ECCV, 2020. 7

work page 2020
[26]

Learning multiple layers of features from tiny images

Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical report, Univ. of Toronto, 2009. 9

work page 2009
[27]

SentencePiece: A sim- ple and language independent subword tokenizer and detok- enizer for neural text processing

Taku Kudo and John Richardson. SentencePiece: A sim- ple and language independent subword tokenizer and detok- enizer for neural text processing. In EMNLP, 2018. 5, 14

work page 2018
[28]

Junnan Li, Dongxu Li, Caiming Xiong, and Steven C. H. Hoi. BLIP: bootstrapping language-image pre-training for uniﬁed vision-language understanding and generation. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesv´ari, Gang Niu, and Sivan Sabato, editors, Interna- tional Conference on Machine Learning, ICML 2022, 17- 23 July 2022, Baltimore,...

work page 2022
[29]

Clipa-v2: Scal- ing CLIP training with 81.1% zero-shot imagenet accuracy within a $10, 000 budget; an extra $4, 000 unlocks 81.8% accuracy

Xianhang Li, Zeyu Wang, and Cihang Xie. Clipa-v2: Scal- ing CLIP training with 81.1% zero-shot imagenet accuracy within a $10, 000 budget; an extra $4, 000 unlocks 81.8% accuracy. CoRR, abs/2306.15658, 2023. 8

work page arXiv 2023
[30]

Scaling language-image pre-training via masking

Yanghao Li, Haoqi Fan, Ronghang Hu, Christoph Feichten- hofer, and Kaiming He. Scaling language-image pre-training via masking. CoRR, abs/2212.00794, 2022. 2, 7

work page arXiv 2022
[31]

Matthias Minderer, Alexey A. Gritsenko, Austin Stone, Maxim Neumann, Dirk Weissenborn, Alexey Dosovitskiy, Aravindh Mahendran, Anurag Arnab, Mostafa Dehghani, Zhuoran Shen, Xiao Wang, Xiaohua Zhai, Thomas Kipf, and Neil Houlsby. Simple open-vocabulary object detection. In Shai Avidan, Gabriel J. Brostow, Moustapha Ciss ´e, Gio- vanni Maria Farinella, and ...

work page 2022
[32]

Model cards for model reporting

Margaret Mitchell, Simone Wu, Andrew Zaldivar, Parker Barnes, Lucy Vasserman, Ben Hutchinson, Elena Spitzer, Inioluwa Deborah Raji, and Timnit Gebru. Model cards for model reporting. In danah boyd and Jamie H. Morgen- stern, editors, Proceedings of the Conference on Fairness, Accountability, and Transparency, FAT* 2019, Atlanta, GA, USA, January 29-31, 20...

work page 2019
[33]

Jishnu Mukhoti, Tsung-Yu Lin, Omid Poursaeed, Rui Wang, Ashish Shah, Philip H. S. Torr, and Ser-Nam Lim. Open vocabulary semantic segmentation with patch aligned con- trastive learning, 2022. 2

work page 2022
[34]

Parkhi, Andrea Vedaldi, Andrew Zisserman, and C

Omkar M. Parkhi, Andrea Vedaldi, Andrew Zisserman, and C. V . Jawahar. Cats and dogs. InIEEE Conference on Com- puter Vision and Pattern Recognition, 2012. 9

work page 2012
[35]

Hieu Pham, Zihang Dai, Golnaz Ghiasi, Hanxiao Liu, Adams Wei Yu, Minh-Thang Luong, Mingxing Tan, and Quoc V . Le. Combined scaling for zero-shot transfer learn- ing. CoRR, abs/2111.10050, 2021. 2, 4

work page arXiv 2021
[36]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. In ICML, 2021. 1, 2, 3, 8

work page 2021
[37]

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a uniﬁed text-to-text transformer.arXiv e-prints, 2019. 5, 14

work page 2019
[38]

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a uniﬁed text-to-text transformer. J. Mach. Learn. Res. , 21:140:1–140:67, 2020. 17

work page 2020
[39]

Do ImageNet classiﬁers generalize to Im- ageNet? In ICML, 2019

Benjamin Recht, Rebecca Roelofs, Ludwig Schmidt, and Vaishaal Shankar. Do ImageNet classiﬁers generalize to Im- ageNet? In ICML, 2019. 7, 17

work page 2019
[40]

LAION-5B: An open large-scale dataset for training next generation image-text models

Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Worts- man, Patrick Schramowski, Srivatsa Kundurthy, Katherine Crowson, Ludwig Schmidt, Robert Kaczmarczyk, and Jenia Jitsev. LAION-5B: an open large-scale dataset for training next generation image-text model...

work page internal anchor Pith review Pith/arXiv arXiv
[41]

WIT: wikipedia-based image text dataset for multimodal multilingual machine learning

Krishna Srinivasan, Karthik Raman, Jiecao Chen, Michael Bendersky, and Marc Najork. WIT: wikipedia-based image text dataset for multimodal multilingual machine learning. CoRR, abs/2103.01913, 2021. 1

work page arXiv 2021
[42]

How to train your ViT? Data, augmentation, and regularization in vision transformers

Andreas Steiner, Alexander Kolesnikov, Xiaohua Zhai, Ross Wightman, Jakob Uszkoreit, and Lucas Beyer. How to train your ViT? Data, augmentation, and regularization in vision transformers. CoRR, abs/2106.10270, 2021. 1, 6, 7

work page arXiv 2021
[43]

EVA-CLIP: Improved Training Techniques for CLIP at Scale

Quan Sun, Yuxin Fang, Ledell Wu, Xinlong Wang, and Yue Cao. EV A-CLIP: improved training techniques for CLIP at scale. CoRR, abs/2303.15389, 2023. 8

work page internal anchor Pith review Pith/arXiv arXiv 2023
[44]

Thapliyal, Jordi Pont-Tuset, Xi Chen, and Radu Soricut

Ashish V . Thapliyal, Jordi Pont-Tuset, Xi Chen, and Radu Soricut. Crossmodal-3600: A massively multilingual mul- timodal evaluation dataset. In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang, editors, Proceedings of the 2022 Conference on Empirical Methods in Natural Language Pro- cessing, EMNLP 2022, Abu Dhabi, United Arab Emirates, December 7-11, 2022...

work page 2022
[45]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth´ee Lacroix, Baptiste Rozi`ere, Naman Goyal, Eric Hambro, Faisal Azhar, Aur´elien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. Llama: Open and efﬁcient foundation language models. CoRR, abs/2302.13971, 2023. 7

work page internal anchor Pith review Pith/arXiv arXiv 2023
[46]

Representation Learning with Contrastive Predictive Coding

A ¨aron van den Oord, Yazhe Li, and Oriol Vinyals. Repre- sentation learning with contrastive predictive coding. CoRR, abs/1807.03748, 2018. 2

work page internal anchor Pith review Pith/arXiv arXiv 2018
[47]

Gomez, Lukasz Kaiser, and Illia Polosukhin

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In NeurIPS, 2017. 3, 17

work page 2017
[48]

Nllb-clip – train performant multilin- gual image retrieval model on a budget, 2023

Alexander Visheratin. Nllb-clip – train performant multilin- gual image retrieval model on a budget, 2023. 7

work page 2023
[49]

GIT: A Generative Image-to-text Transformer for Vision and Language

Jianfeng Wang, Zhengyuan Yang, Xiaowei Hu, Linjie Li, Kevin Lin, Zhe Gan, Zicheng Liu, Ce Liu, and Lijuan Wang. GIT: A generative image-to-text transformer for vision and language. CoRR, abs/2205.14100, 2022. 1, 2 12

work page internal anchor Pith review arXiv 2022
[50]

Simvlm: Simple visual language model pretraining with weak supervision

Zirui Wang, Jiahui Yu, Adams Wei Yu, Zihang Dai, Yulia Tsvetkov, and Yuan Cao. Simvlm: Simple visual language model pretraining with weak supervision. In The Tenth In- ternational Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022 . OpenReview.net,

work page 2022
[51]

Resnet strikes back: An improved training procedure in timm

Ross Wightman, Hugo Touvron, and Herv ´e J ´egou. Resnet strikes back: An improved training procedure in timm. CoRR, abs/2110.00476, 2021. 2

work page arXiv 2021
[52]

Reaching 80% zero-shot accuracy with OpenCLIP: VIT-G/14 trained on LAION-2B

Mitchell Wortsman. Reaching 80% zero-shot accuracy with OpenCLIP: VIT-G/14 trained on LAION-2B. https: //web.archive.org/web/20230127012732/ https://laion.ai/blog/giant-openclip/. 2

work page arXiv
[53]

Robust ﬁne-tuning of zero-shot models

Mitchell Wortsman, Gabriel Ilharco, Jong Wook Kim, Mike Li, Simon Kornblith, Rebecca Roelofs, Raphael Gon- tijo Lopes, Hannaneh Hajishirzi, Ali Farhadi, Hongseok Namkoong, and Ludwig Schmidt. Robust ﬁne-tuning of zero-shot models. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022, pages 79...

work page 2022
[54]

mT5: A massively multilingual pre-trained text-to- text transformer

Linting Xue, Noah Constant, Adam Roberts, Mihir Kale, Rami Al-Rfou, Aditya Siddhant, Aditya Barua, and Colin Raffel. mT5: A massively multilingual pre-trained text-to- text transformer. In NAACL-HLT, 2021. 5, 17

work page 2021
[55]

Uniﬁed contrastive learn- ing in image-text-label space

Jianwei Yang, Chunyuan Li, Pengchuan Zhang, Bin Xiao, Ce Liu, Lu Yuan, and Jianfeng Gao. Uniﬁed contrastive learn- ing in image-text-label space. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022 , pages 19141–19151. IEEE, 2022. 2

work page 2022
[56]

CoCa: Contrastive Captioners are Image-Text Foundation Models

Jiahui Yu, Zirui Wang, Vijay Vasudevan, Legg Yeung, Mojtaba Seyedhosseini, and Yonghui Wu. Coca: Con- trastive captioners are image-text foundation models. CoRR, abs/2205.01917, 2022. 2

work page internal anchor Pith review Pith/arXiv arXiv 2022
[57]

Florence: A New Foundation Model for Computer Vision

Lu Yuan, Dongdong Chen, Yi-Ling Chen, Noel Codella, Xiyang Dai, Jianfeng Gao, Houdong Hu, Xuedong Huang, Boxin Li, Chunyuan Li, Ce Liu, Mengchen Liu, Zicheng Liu, Yumao Lu, Yu Shi, Lijuan Wang, Jianfeng Wang, Bin Xiao, Zhen Xiao, Jianwei Yang, Michael Zeng, Luowei Zhou, and Pengchuan Zhang. Florence: A new foundation model for computer vision. CoRR, abs/2...

work page internal anchor Pith review Pith/arXiv arXiv 2021
[58]

Scaling vision transformers

Xiaohua Zhai, Alexander Kolesnikov, Neil Houlsby, and Lu- cas Beyer. Scaling vision transformers. CVPR, 2022. 1, 4, 6, 7, 14

work page 2022
[59]

Lit: Zero-shot transfer with locked-image text tuning

Xiaohua Zhai, Xiao Wang, Basil Mustafa, Andreas Steiner, Daniel Keysers, Alexander Kolesnikov, and Lucas Beyer. Lit: Zero-shot transfer with locked-image text tuning. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18- 24, 2022, pages 18102–18112. IEEE, 2022. 1, 2, 3, 4, 6, 7, 14

work page 2022
[60]

Manning, and Curtis P

Yuhao Zhang, Hang Jiang, Yasuhide Miura, Christopher D. Manning, and Curtis P. Langlotz. Contrastive learning of medical visual representations from paired images and text. In Zachary C. Lipton, Rajesh Ranganath, Mark P. Sendak, Michael W. Sjoding, and Serena Yeung, editors, Proceed- ings of the Machine Learning for Healthcare Conference, MLHC 2022, 5-6 A...

work page 2022