arxiv: 2205.01917 · v2 · submitted 2022-05-04 · 💻 cs.CV · cs.LG· cs.MM

Recognition: 2 theorem links

· Lean Theorem

CoCa: Contrastive Captioners are Image-Text Foundation Models

Jiahui Yu , Zirui Wang , Vijay Vasudevan , Legg Yeung , Mojtaba Seyedhosseini , Yonghui Wu

Authors on Pith no claims yet

Pith reviewed 2026-05-15 10:48 UTC · model grok-4.3

classification 💻 cs.CV cs.LGcs.MM

keywords contrastive learningimage captioningfoundation modelszero-shot transfermultimodal modelsimage-textencoder-decoder

0 comments

The pith

CoCa jointly trains contrastive and captioning losses in one encoder-decoder to create image-text foundation models that reach new state-of-the-art on ImageNet and multimodal tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents CoCa as a single model pretrained from scratch on large-scale image-text data using two losses together. A contrastive loss aligns image embeddings with text embeddings, while a captioning loss trains the model to generate text tokens from images autoregressively. The decoder is split so its first half processes only text without seeing the image, and its second half adds cross-attention to the image encoder for joint representations. This design unifies natural language supervision across unlabeled web data and labeled datasets by treating all annotations as text. The resulting model transfers with zero-shot inference or minimal adaptation to achieve leading accuracy on ImageNet classification, video action recognition, image captioning, retrieval, and visual question answering.

Core claim

CoCa is an image-text encoder-decoder model pretrained end-to-end by computing a contrastive loss between unimodal image and text embeddings together with a captioning loss on multimodal decoder outputs; the decoder omits cross-attention in its first half to produce separate unimodal text representations before cascading the remaining layers that cross-attend to the image encoder.

What carries the argument

The cascaded decoder that produces unimodal text representations without cross-attention in its first half and multimodal image-text representations with cross-attention in its second half.

Load-bearing premise

Omitting cross-attention in the first half of the decoder layers cleanly separates unimodal text representations from multimodal ones without harming overall capacity or optimization stability.

What would settle it

An experiment that trains an otherwise identical model but keeps cross-attention in every decoder layer and measures whether zero-shot ImageNet top-1 accuracy or captioning metrics improve or degrade.

read the original abstract

Exploring large-scale pretrained foundation models is of significant interest in computer vision because these models can be quickly transferred to many downstream tasks. This paper presents Contrastive Captioner (CoCa), a minimalist design to pretrain an image-text encoder-decoder foundation model jointly with contrastive loss and captioning loss, thereby subsuming model capabilities from contrastive approaches like CLIP and generative methods like SimVLM. In contrast to standard encoder-decoder transformers where all decoder layers attend to encoder outputs, CoCa omits cross-attention in the first half of decoder layers to encode unimodal text representations, and cascades the remaining decoder layers which cross-attend to the image encoder for multimodal image-text representations. We apply a contrastive loss between unimodal image and text embeddings, in addition to a captioning loss on the multimodal decoder outputs which predicts text tokens autoregressively. By sharing the same computational graph, the two training objectives are computed efficiently with minimal overhead. CoCa is pretrained end-to-end and from scratch on both web-scale alt-text data and annotated images by treating all labels simply as text, seamlessly unifying natural language supervision for representation learning. Empirically, CoCa achieves state-of-the-art performance with zero-shot transfer or minimal task-specific adaptation on a broad range of downstream tasks, spanning visual recognition (ImageNet, Kinetics-400/600/700, Moments-in-Time), crossmodal retrieval (MSCOCO, Flickr30K, MSR-VTT), multimodal understanding (VQA, SNLI-VE, NLVR2), and image captioning (MSCOCO, NoCaps). Notably on ImageNet classification, CoCa obtains 86.3% zero-shot top-1 accuracy, 90.6% with a frozen encoder and learned classification head, and new state-of-the-art 91.0% top-1 accuracy on ImageNet with a finetuned encoder.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CoCa shows a single model can handle both contrastive and captioning pretraining with a decoder split, delivering solid ImageNet numbers and broad task coverage, but the key design choice has no ablations.

read the letter

CoCa trains one encoder-decoder on web-scale image-text data using both a contrastive loss on unimodal embeddings and an autoregressive captioning loss on multimodal outputs. The main architectural move is skipping cross-attention in the first half of the decoder layers so early text representations stay unimodal for the contrastive term, then adding it later for caption generation. They share the graph so the two losses add little overhead, and they treat all labels as plain text during pretraining.

Referee Report

2 major / 2 minor

Summary. The paper introduces CoCa, a minimalist image-text encoder-decoder foundation model pretrained jointly from scratch with contrastive loss on unimodal image and text embeddings and captioning loss on multimodal decoder outputs. The key design omits cross-attention in the first half of decoder layers to produce unimodal text representations for the contrastive objective while cascading the remaining layers for multimodal image-text representations; all labels are treated as text for unified supervision. It reports state-of-the-art results including 86.3% zero-shot top-1 on ImageNet, 90.6% with frozen encoder plus head, 91.0% finetuned top-1 on ImageNet, plus leading numbers on Kinetics, MSCOCO, VQA, and captioning tasks.

Significance. If the results hold, this work is significant for showing an efficient unification of contrastive and generative pretraining in a single computational graph with minimal overhead, yielding strong transfer across visual recognition, retrieval, and multimodal understanding benchmarks. The seamless treatment of all supervision as text and the reported consistent gains on multiple independent held-out test sets are notable strengths.

major comments (2)

[Model Architecture / Decoder Design] The central architectural claim—that omitting cross-attention in the first half of decoder layers cleanly separates unimodal text representations (for contrastive loss) from multimodal ones (for captioning loss) with minimal overhead—lacks supporting ablation. No comparison is shown against a standard encoder-decoder transformer with cross-attention in every decoder layer or against applying the contrastive loss to multimodal embeddings. This choice is load-bearing for the unification claim and the reported ImageNet numbers (86.3% zero-shot, 91.0% finetuned).
[Experiments and Results] The experimental results report strong absolute numbers (e.g., 91.0% ImageNet top-1 finetuned) without error bars, sensitivity analysis on the contrastive-vs-captioning loss weight (explicitly listed as a free parameter), or full hyperparameter tables. This weakens assessment of robustness and reproducibility of the gains.

minor comments (2)

[Abstract] The abstract would benefit from a brief statement of model scale or training data volume to contextualize the results.
[Methods] Notation for unimodal vs. multimodal embeddings should be defined once and used consistently to avoid minor ambiguity in the methods description.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments and detailed review. We address each major comment below, indicating the revisions we will make to strengthen the manuscript.

read point-by-point responses

Referee: [Model Architecture / Decoder Design] The central architectural claim—that omitting cross-attention in the first half of decoder layers cleanly separates unimodal text representations (for contrastive loss) from multimodal ones (for captioning loss) with minimal overhead—lacks supporting ablation. No comparison is shown against a standard encoder-decoder transformer with cross-attention in every decoder layer or against applying the contrastive loss to multimodal embeddings. This choice is load-bearing for the unification claim and the reported ImageNet numbers (86.3% zero-shot, 91.0% finetuned).

Authors: We thank the referee for highlighting this point. The motivation for the cascaded decoder design (unimodal text representations in early layers for contrastive loss, followed by multimodal layers for captioning) is described in Section 3.2, and the overall unification is supported by the end-to-end results across tasks. However, we agree that dedicated ablations would provide stronger evidence. In the revised manuscript we will add comparisons to a standard full cross-attention encoder-decoder baseline and to variants applying contrastive loss on multimodal embeddings; these will appear in Section 4.2 with accompanying efficiency and accuracy metrics. revision: yes
Referee: [Experiments and Results] The experimental results report strong absolute numbers (e.g., 91.0% ImageNet top-1 finetuned) without error bars, sensitivity analysis on the contrastive-vs-captioning loss weight (explicitly listed as a free parameter), or full hyperparameter tables. This weakens assessment of robustness and reproducibility of the gains.

Authors: We acknowledge the referee's concern about robustness and reproducibility. In the revised version we will report error bars for the primary ImageNet and downstream results (computed from multiple independent runs where compute permits), include a sensitivity sweep over the contrastive-to-captioning loss weight in the supplementary material, and provide complete hyperparameter tables in a new appendix section. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical results on held-out benchmarks with no self-referential derivations

full rationale

The paper proposes an architectural variant (omitting cross-attention in early decoder layers) and reports direct empirical measurements such as 86.3% zero-shot ImageNet top-1 accuracy. These quantities are obtained by standard training and evaluation on external test sets; no equation or result is defined in terms of itself, no fitted parameter is renamed as a prediction, and no load-bearing premise reduces to a self-citation chain. The unification of contrastive and captioning objectives is achieved through explicit joint optimization rather than by construction, making the derivation chain self-contained against independent benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard transformer layer assumptions and the empirical effectiveness of the chosen loss combination; no new physical entities or unstated mathematical axioms are introduced.

free parameters (1)

contrastive vs captioning loss weight
The relative weighting between the two objectives is chosen during training to achieve the reported numbers.

axioms (1)

standard math Standard multi-head attention and feed-forward transformer blocks behave as in the original Vaswani et al. formulation
The paper builds directly on the transformer architecture without re-deriving its properties.

pith-pipeline@v0.9.0 · 5673 in / 1462 out tokens · 57446 ms · 2026-05-15T10:48:13.746831+00:00 · methodology

discussion (0)

Forward citations

Cited by 20 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

OZ-TAL: Online Zero-Shot Temporal Action Localization
cs.CV 2026-05 unverdicted novelty 7.0

Defines OZ-TAL task and presents a training-free VLM-based method that outperforms prior approaches for online and offline zero-shot temporal action localization on THUMOS14 and ActivityNet-1.3.
Bottleneck Tokens for Unified Multimodal Retrieval
cs.LG 2026-04 unverdicted novelty 7.0

Bottleneck Tokens paired with a masked generative objective achieve state-of-the-art unified multimodal retrieval performance among 2B-scale models on the MMEB-V2 benchmark with 78 datasets.
InstrAct: Towards Action-Centric Understanding in Instructional Videos
cs.CV 2026-04 unverdicted novelty 7.0

InstrAction pretrains video foundation models using action-centric data filtering, hard negatives, an Action Perceiver module, DTW-Align, and Masked Action Modeling to reduce static bias and outperform prior models on...
InstAP: Instance-Aware Vision-Language Pre-Train for Spatial-Temporal Understanding
cs.CV 2026-04 unverdicted novelty 7.0

InstAP introduces instance-aware pre-training with a new dual-granularity dataset InstVL that improves both fine-grained instance retrieval and global video understanding over standard VLP baselines.
LRM: Large Reconstruction Model for Single Image to 3D
cs.CV 2023-11 conditional novelty 7.0

LRM is a large transformer that predicts a NeRF directly from a single image after training on a million-object multi-view dataset.
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
cs.CV 2023-01 unverdicted novelty 7.0

BLIP-2 bootstraps vision-language pre-training from frozen image encoders and LLMs via a lightweight two-stage Querying Transformer, delivering SOTA results with 54x fewer trainable parameters than Flamingo80B on zero...
LAION-5B: An open large-scale dataset for training next generation image-text models
cs.CV 2022-10 accept novelty 7.0

LAION-5B is an openly released dataset of 5.85 billion CLIP-filtered image-text pairs that enables replication of foundational vision-language models.
Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding
cs.CV 2022-05 accept novelty 7.0

Imagen achieves state-of-the-art photorealistic text-to-image generation by scaling a text-only pretrained T5 language model within a diffusion framework, reaching FID 7.27 on COCO without training on it.
What Matters for Diffusion-Friendly Latent Manifold? Prior-Aligned Autoencoders for Latent Diffusion
cs.CV 2026-05 unverdicted novelty 6.0

Prior-Aligned AutoEncoders shape latent manifolds with spatial coherence, local continuity, and global semantics to improve latent diffusion, achieving SOTA gFID 1.03 on ImageNet 256x256 with up to 13x faster convergence.
Joint Semantic Token Selection and Prompt Optimization for Interpretable Prompt Learning
cs.CV 2026-05 unverdicted novelty 6.0

IPL alternates discrete semantic token selection using approximate submodular optimization with continuous prompt optimization to boost both interpretability and task performance in vision-language model adaptation.
Compared to What? Baselines and Metrics for Counterfactual Prompting
cs.CL 2026-05 conditional novelty 6.0

Counterfactual prompting effects on LLMs are often indistinguishable from those caused by meaning-preserving paraphrases, causing most previously reported demographic sensitivities to disappear under proper statistica...
Revisiting Feature Prediction for Learning Visual Representations from Video
cs.CV 2024-02 conditional novelty 6.0

V-JEPA models trained only on feature prediction from 2 million public videos achieve 81.9% on Kinetics-400, 72.2% on Something-Something-v2, and 77.9% on ImageNet-1K using frozen ViT-H/16 backbones.
Vision Transformers Need Registers
cs.CV 2023-09 unverdicted novelty 6.0

Adding register tokens to Vision Transformers eliminates high-norm background artifacts and raises state-of-the-art performance on dense visual prediction tasks.
Aligning Text-to-Image Models using Human Feedback
cs.LG 2023-02 unverdicted novelty 6.0

A three-stage fine-tuning process uses human ratings to train a reward model and then improves text-to-image alignment by maximizing reward-weighted likelihood.
Scaling Autoregressive Models for Content-Rich Text-to-Image Generation
cs.CV 2022-06 unverdicted novelty 6.0

Scaling an autoregressive Transformer to 20B parameters for text-to-image generation using image token sequences achieves new SOTA zero-shot FID of 7.23 and fine-tuned FID of 3.22 on MS-COCO.
Let ViT Speak: Generative Language-Image Pre-training
cs.CV 2026-05 unverdicted novelty 5.0

GenLIP pretrains ViTs to generate language tokens from visual tokens via autoregressive language modeling, matching strong baselines on multimodal tasks with less data.
From Codebooks to VLMs: Evaluating Automated Visual Discourse Analysis for Climate Change on Social Media
cs.CV 2026-04 unverdicted novelty 5.0

VLMs recover reliable population-level trends in climate change visual discourse on social media even when per-image accuracy is only moderate.
Hallucination of Multimodal Large Language Models: A Survey
cs.CV 2024-04 accept novelty 5.0

The survey organizes causes of hallucinations in MLLMs, reviews evaluation benchmarks and metrics, and outlines mitigation approaches plus open questions.
InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks
cs.CV 2023-12 unverdicted novelty 5.0

InternVL scales a vision model to 6B parameters and aligns it with LLMs using web data to achieve state-of-the-art results on 32 visual-linguistic benchmarks.
Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models
cs.CV 2024-02 unverdicted novelty 2.0

The paper reviews the background, technology, applications, limitations, and future directions of OpenAI's Sora text-to-video generative model based on public information.

Reference graph

Works this paper leans on

80 extracted references · 80 canonical work pages · cited by 20 Pith papers · 12 internal anchors

[1]

On the Opportunities and Risks of Foundation Models

Rishi Bommasani, Drew A Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, et al. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[2]

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[3]

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a uniﬁed text-to-text transformer. arXiv preprint arXiv:1910.10683, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1910
[4]

Language models are few-shot learners

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020

work page 1901
[5]

Pathways: Asynchronous distributed dataﬂow for ml

Paul Barham, Aakanksha Chowdhery, Jeff Dean, Sanjay Ghemawat, Steven Hand, Dan Hurt, Michael Isard, Hyeontaek Lim, Ruoming Pang, Sudip Roy, et al. Pathways: Asynchronous distributed dataﬂow for ml. arXiv preprint arXiv:2203.12533, 2022

work page arXiv 2022
[6]

Rich feature hierarchies for accurate object detection and semantic segmentation

Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 580–587, 2014

work page 2014
[7]

Fully convolutional networks for se- mantic segmentation

Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for se- mantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3431–3440, 2015

work page 2015
[8]

Two-stream convolutional networks for action recognition in videos

Karen Simonyan and Andrew Zisserman. Two-stream convolutional networks for action recognition in videos. Advances in neural information processing systems, 27, 2014

work page 2014
[9]

Imagenet: A large- scale hierarchical image database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large- scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009. 13

work page 2009
[10]

Coatnet: Marrying convolution and attention for all data sizes

Zihang Dai, Hanxiao Liu, Quoc V Le, and Mingxing Tan. Coatnet: Marrying convolution and attention for all data sizes. Advances in Neural Information Processing Systems, 34:3965–3977, 2021

work page 2021
[11]

Co-training transformer with videos and images improves action recognition

Bowen Zhang, Jiahui Yu, Christopher Fifty, Wei Han, Andrew M Dai, Ruoming Pang, and Fei Sha. Co-training transformer with videos and images improves action recognition. arXiv preprint arXiv:2112.07175, 2021

work page arXiv 2021
[12]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, pages 8748–8763. PMLR, 2021

work page 2021
[13]

Scaling up visual and vision-language representation learning with noisy text supervision

Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun- Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representation learning with noisy text supervision. In International Conference on Machine Learning, pages 4904–4916. PMLR, 2021

work page 2021
[14]

Florence: A new foundation model for computer vision.arXiv preprint arXiv:2111.11432, 2021

Lu Yuan, Dongdong Chen, Yi-Ling Chen, Noel Codella, Xiyang Dai, Jianfeng Gao, Houdong Hu, Xuedong Huang, Boxin Li, Chunyuan Li, et al. Florence: A new foundation model for computer vision. arXiv preprint arXiv:2111.11432, 2021

work page arXiv 2021
[15]

Show and tell: A neural image caption generator

Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. Show and tell: A neural image caption generator. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3156–3164, 2015

work page 2015
[16]

Simvlm: Sim- ple visual language model pretraining with weak supervision

Zirui Wang, Jiahui Yu, Adams Wei Yu, Zihang Dai, Yulia Tsvetkov, and Yuan Cao. Simvlm: Sim- ple visual language model pretraining with weak supervision. arXiv preprint arXiv:2108.10904, 2021

work page arXiv 2021
[17]

Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework

Peng Wang, An Yang, Rui Men, Junyang Lin, Shuai Bai, Zhikang Li, Jianxin Ma, Chang Zhou, Jingren Zhou, and Hongxia Yang. Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework. arXiv preprint arXiv:2202.03052, 2022

work page arXiv 2022
[18]

Imagenet classiﬁcation with deep convolutional neural networks

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classiﬁcation with deep convolutional neural networks. Advances in neural information processing systems, 25, 2012

work page 2012
[19]

Attention is all you need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017

work page 2017
[20]

Exploring the limits of weakly supervised pretraining

Dhruv Mahajan, Ross Girshick, Vignesh Ramanathan, Kaiming He, Manohar Paluri, Yixuan Li, Ashwin Bharambe, and Laurens Van Der Maaten. Exploring the limits of weakly supervised pretraining. In Proceedings of the European conference on computer vision (ECCV), pages 181–196, 2018

work page 2018
[21]

Scaling vision transform- ers, 2021

Xiaohua Zhai, Alexander Kolesnikov, Neil Houlsby, and Lucas Beyer. Scaling vision transform- ers, 2021

work page 2021
[22]

BEiT: BERT Pre-Training of Image Transformers

Hangbo Bao, Li Dong, and Furu Wei. Beit: Bert pre-training of image transformers. arXiv preprint arXiv:2106.08254, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[23]

Masked autoencoders are scalable vision learners.arXiv:2111.06377, 2021

Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. Masked autoencoders are scalable vision learners. arXiv preprint arXiv:2111.06377, 2021

work page arXiv 2021
[24]

Simmim: A simple framework for masked image modeling

Zhenda Xie, Zheng Zhang, Yue Cao, Yutong Lin, Jianmin Bao, Zhuliang Yao, Qi Dai, and Han Hu. Simmim: A simple framework for masked image modeling. arXiv preprint arXiv:2111.09886, 2021

work page arXiv 2021
[25]

Lxmert: Learning cross-modality encoder representations from transformers,

Hao Tan and Mohit Bansal. Lxmert: Learning cross-modality encoder representations from transformers. arXiv preprint arXiv:1908.07490, 2019

work page arXiv 1908
[26]

Uniter: Universal image-text representation learning

Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy, Faisal Ahmed, Zhe Gan, Yu Cheng, and Jingjing Liu. Uniter: Universal image-text representation learning. In ECCV, 2020

work page 2020
[27]

Vinvl: Revisiting visual representations in vision-language models

Pengchuan Zhang, Xiujun Li, Xiaowei Hu, Jianwei Yang, Lei Zhang, Lijuan Wang, Yejin Choi, and Jianfeng Gao. Vinvl: Revisiting visual representations in vision-language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5579–5588, June 2021. 14

work page 2021
[28]

Faster r-cnn: Towards real-time object detection with region proposal networks

Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In C. Cortes, N. Lawrence, D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 28. Curran Associates, Inc., 2015

work page 2015
[29]

Vilt: Vision-and-language transformer without convolution or region supervision, 2021

Wonjae Kim, Bokyung Son, and Ildoo Kim. Vilt: Vision-and-language transformer without convolution or region supervision, 2021

work page 2021
[30]

Vlmo: Uniﬁed vision-language pre- training with mixture-of-modality-experts

Wenhui Wang, Hangbo Bao, Li Dong, and Furu Wei. Vlmo: Uniﬁed vision-language pre- training with mixture-of-modality-experts. arXiv preprint arXiv:2111.02358, 2021

work page arXiv 2021
[31]

Uniﬁed contrastive learning in image-text-label space, 2022

Jianwei Yang, Chunyuan Li, Pengchuan Zhang, Bin Xiao, Ce Liu, Lu Yuan, and Jianfeng Gao. Uniﬁed contrastive learning in image-text-label space, 2022

work page 2022
[32]

Lit: Zero-shot transfer with locked-image text tuning

Xiaohua Zhai, Xiao Wang, Basil Mustafa, Andreas Steiner, Daniel Keysers, Alexander Kolesnikov, and Lucas Beyer. Lit: Zero-shot transfer with locked-image text tuning. arXiv preprint arXiv:2111.07991, 2021

work page arXiv 2021
[33]

Hieu Pham, Zihang Dai, Golnaz Ghiasi, Kenji Kawaguchi, Hanxiao Liu, Adams Wei Yu, Jiahui Yu, Yi-Ting Chen, Minh-Thang Luong, Yonghui Wu, Mingxing Tan, and Quoc V . Le. Combined scaling for open-vocabulary image classiﬁcation, 2021

work page 2021
[34]

Answer-me: Multi-task open-vocabulary visual question answering

AJ Piergiovanni, Wei Li, Weicheng Kuo, Mohammad Saffar, Fred Bertsch, and Anelia An- gelova. Answer-me: Multi-task open-vocabulary visual question answering. arXiv preprint arXiv:2205.00949, 2022

work page arXiv 2022
[35]

Flava: A foundational language and vision alignment model

Amanpreet Singh, Ronghang Hu, Vedanuj Goswami, Guillaume Couairon, Wojciech Galuba, Marcus Rohrbach, and Douwe Kiela. Flava: A foundational language and vision alignment model. arXiv preprint arXiv:2112.04482, 2021

work page arXiv 2021
[36]

Align before fuse: Vision and language representation learning with momentum distillation

Junnan Li, Ramprasaath Selvaraju, Akhilesh Gotmare, Shaﬁq Joty, Caiming Xiong, and Steven Chu Hong Hoi. Align before fuse: Vision and language representation learning with momentum distillation. Advances in Neural Information Processing Systems, 34, 2021

work page 2021
[37]

Blip: Bootstrapping language- image pre-training for uniﬁed vision-language understanding and generation.arXiv preprint arXiv:2201.12086, 2022

Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language- image pre-training for uniﬁed vision-language understanding and generation.arXiv preprint arXiv:2201.12086, 2022

work page arXiv 2022
[38]

The evolution of out-of-distribution robustness throughout ﬁne-tuning

Anders Andreassen, Yasaman Bahri, Behnam Neyshabur, and Rebecca Roelofs. The evolution of out-of-distribution robustness throughout ﬁne-tuning. arXiv preprint arXiv:2106.15831, 2021

work page arXiv 2021
[39]

An image is worth 16x16 words: Transformers for image recognition at scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2021

work page 2021
[40]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016

work page 2016
[41]

A learning algorithm for continually running fully recurrent neural networks

Ronald J Williams and David Zipser. A learning algorithm for continually running fully recurrent neural networks. Neural computation, 1(2):270–280, 1989

work page 1989
[42]

Set transformer: A framework for attention-based permutation-invariant neural networks

Juho Lee, Yoonho Lee, Jungtaek Kim, Adam Kosiorek, Seungjin Choi, and Yee Whye Teh. Set transformer: A framework for attention-based permutation-invariant neural networks. In International Conference on Machine Learning, pages 3744–3753. PMLR, 2019

work page 2019
[43]

Neural Machine Translation of Rare Words with Subword Units

Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural machine translation of rare words with subword units. arXiv preprint arXiv:1508.07909, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[44]

Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates

Taku Kudo. Subword regularization: Improving neural network translation models with multiple subword candidates. arXiv preprint arXiv:1804.10959, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[45]

Lingvo: a modular and scalable framework for sequence-to-sequence modeling, 2019

Jonathan Shen, Patrick Nguyen, Yonghui Wu, Zhifeng Chen, et al. Lingvo: a modular and scalable framework for sequence-to-sequence modeling, 2019

work page 2019
[46]

Gpipe: Efﬁcient training of giant neural networks using pipeline parallelism

Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan Firat, Dehao Chen, Mia Chen, Hy- oukJoong Lee, Jiquan Ngiam, Quoc V Le, Yonghui Wu, et al. Gpipe: Efﬁcient training of giant neural networks using pipeline parallelism. Advances in neural information processing systems, 32, 2019. 15

work page 2019
[47]

Automatic cross-replica sharding of weight update in data-parallel training

Yuanzhong Xu, HyoukJoong Lee, Dehao Chen, Hongjun Choi, Blake Hechtman, and Shibo Wang. Automatic cross-replica sharding of weight update in data-parallel training. arXiv preprint arXiv:2004.13336, 2020

work page arXiv 2004
[48]

GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding

Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, and Zhifeng Chen. Gshard: Scaling giant models with condi- tional computation and automatic sharding. arXiv preprint arXiv:2006.16668, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2006
[49]

Gspmd: general and scalable parallelization for ml computation graphs

Yuanzhong Xu, HyoukJoong Lee, Dehao Chen, Blake Hechtman, Yanping Huang, Rahul Joshi, Maxim Krikun, Dmitry Lepikhin, Andy Ly, Marcello Maggioni, et al. Gspmd: general and scalable parallelization for ml computation graphs. arXiv preprint arXiv:2105.04663, 2021

work page arXiv 2021
[50]

Adafactor: Adaptive learning rates with sublinear memory cost

Noam Shazeer and Mitchell Stern. Adafactor: Adaptive learning rates with sublinear memory cost. In International Conference on Machine Learning, pages 4596–4604. PMLR, 2018

work page 2018
[51]

Meta pseudo labels

Hieu Pham, Zihang Dai, Qizhe Xie, and Quoc V Le. Meta pseudo labels. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11557–11568, 2021

work page 2021
[52]

Model soups: averaging weights of multiple ﬁne-tuned models improves accuracy without increasing inference time

Mitchell Wortsman, Gabriel Ilharco, Samir Yitzhak Gadre, Rebecca Roelofs, Raphael Gontijo- Lopes, Ari S Morcos, Hongseok Namkoong, Ali Farhadi, Yair Carmon, Simon Kornblith, et al. Model soups: averaging weights of multiple ﬁne-tuned models improves accuracy without increasing inference time. arXiv preprint arXiv:2203.05482, 2022

work page arXiv 2022
[53]

Vivit: A video vision transformer

Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario Lu ˇci´c, and Cordelia Schmid. Vivit: A video vision transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 6836–6846, 2021

work page 2021
[54]

Movinets: Mobile video networks for efﬁcient video recognition

Dan Kondratyuk, Liangzhe Yuan, Yandong Li, Li Zhang, Mingxing Tan, Matthew Brown, and Boqing Gong. Movinets: Mobile video networks for efﬁcient video recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16020–16030, 2021

work page 2021
[55]

Vatt: Transformers for multimodal self-supervised learning from raw video, audio and text

Hassan Akbari, Liangzhe Yuan, Rui Qian, Wei-Hong Chuang, Shih-Fu Chang, Yin Cui, and Boqing Gong. Vatt: Transformers for multimodal self-supervised learning from raw video, audio and text. Advances in Neural Information Processing Systems, 34, 2021

work page 2021
[56]

Masked feature prediction for self-supervised visual pre-training

Chen Wei, Haoqi Fan, Saining Xie, Chao-Yuan Wu, Alan Yuille, and Christoph Feichten- hofer. Masked feature prediction for self-supervised visual pre-training. arXiv preprint arXiv:2112.09133, 2021

work page arXiv 2021
[57]

The Kinetics Human Action Video Dataset

Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijaya- narasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, et al. The kinetics human action video dataset. arXiv preprint arXiv:1705.06950, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[58]

A Short Note about Kinetics-600

Joao Carreira, Eric Noland, Andras Banki-Horvath, Chloe Hillier, and Andrew Zisserman. A short note about kinetics-600. arXiv preprint arXiv:1808.01340, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[59]

A short note on the kinetics- 700 human action dataset

Joao Carreira, Eric Noland, Chloe Hillier, and Andrew Zisserman. A short note on the kinetics- 700 human action dataset. arXiv preprint arXiv:1907.06987, 2019

work page arXiv 1907
[60]

Moments in time dataset: one million videos for event understanding

Mathew Monfort, Alex Andonian, Bolei Zhou, Kandan Ramakrishnan, Sarah Adel Bargal, Tom Yan, Lisa Brown, Quanfu Fan, Dan Gutfreund, Carl V ondrick, et al. Moments in time dataset: one million videos for event understanding. IEEE transactions on pattern analysis and machine intelligence, 42(2):502–508, 2019

work page 2019
[61]

Filip: Fine-grained interactive language-image pre-training.ArXiv, abs/2111.07783, 2021

Lewei Yao, Runhui Huang, Lu Hou, Guansong Lu, Minzhe Niu, Hang Xu, Xiaodan Liang, Zhenguo Li, Xin Jiang, and Chunjing Xu. Filip: Fine-grained interactive language-image pre-training. arXiv preprint arXiv:2111.07783, 2021

work page arXiv 2021
[62]

Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models

Bryan A Plummer, Liwei Wang, Chris M Cervantes, Juan C Caicedo, Julia Hockenmaier, and Svetlana Lazebnik. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In Proceedings of the IEEE international conference on computer vision, pages 2641–2649, 2015

work page 2015
[63]

Microsoft COCO Captions: Data Collection and Evaluation Server

Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325, 2015. 16

work page internal anchor Pith review Pith/arXiv arXiv 2015
[64]

Natural adversarial examples

Dan Hendrycks, Kevin Zhao, Steven Basart, Jacob Steinhardt, and Dawn Song. Natural adversarial examples. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15262–15271, 2021

work page 2021
[65]

The many faces of robustness: A critical analysis of out-of-distribution generalization

Dan Hendrycks, Steven Basart, Norman Mu, Saurav Kadavath, Frank Wang, Evan Dorundo, Rahul Desai, Tyler Zhu, Samyak Parajuli, Mike Guo, et al. The many faces of robustness: A critical analysis of out-of-distribution generalization. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 8340–8349, 2021

work page 2021
[66]

Do imagenet classiﬁers generalize to imagenet? In International Conference on Machine Learning, pages 5389–5400

Benjamin Recht, Rebecca Roelofs, Ludwig Schmidt, and Vaishaal Shankar. Do imagenet classiﬁers generalize to imagenet? In International Conference on Machine Learning, pages 5389–5400. PMLR, 2019

work page 2019
[67]

Learning robust global repre- sentations by penalizing local predictive power

Haohan Wang, Songwei Ge, Zachary Lipton, and Eric P Xing. Learning robust global repre- sentations by penalizing local predictive power. Advances in Neural Information Processing Systems, 32, 2019

work page 2019
[68]

Objectnet: A large-scale bias-controlled dataset for pushing the limits of object recognition models

Andrei Barbu, David Mayo, Julian Alverio, William Luo, Christopher Wang, Dan Gutfreund, Josh Tenenbaum, and Boris Katz. Objectnet: A large-scale bias-controlled dataset for pushing the limits of object recognition models. Advances in neural information processing systems, 32, 2019

work page 2019
[69]

A straightforward framework for video retrieval using clip

Jesús Andrés Portillo-Quintero, José Carlos Ortiz-Bayliss, and Hugo Terashima-Marín. A straightforward framework for video retrieval using clip. arXiv preprint arXiv:2102.12443, 2021

work page arXiv 2021
[70]

Socratic models: Composing zero-shot multimodal reasoning with language

Andy Zeng, Adrian Wong, Stefan Welker, Krzysztof Choromanski, Federico Tombari, Aveek Purohit, Michael Ryoo, Vikas Sindhwani, Johnny Lee, Vincent Vanhoucke, and Pete Florence. Socratic models: Composing zero-shot multimodal reasoning with language. arXiv preprint arXiv:2204.00598, 2022

work page arXiv 2022
[71]

Msr-vtt: A large video description dataset for bridging video and language

Jun Xu, Tao Mei, Ting Yao, and Yong Rui. Msr-vtt: A large video description dataset for bridging video and language. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5288–5296, 2016

work page 2016
[72]

A short note on the kinetics-700-2020 human action dataset

Lucas Smaira, João Carreira, Eric Noland, Ellen Clancy, Amy Wu, and Andrew Zisserman. A short note on the kinetics-700-2020 human action dataset. arXiv preprint arXiv:2010.10864, 2020

work page arXiv 2020
[73]

How much can clip beneﬁt vision-and-language tasks?arXiv preprint arXiv:2107.06383, 2021

Sheng Shen, Liunian Harold Li, Hao Tan, Mohit Bansal, Anna Rohrbach, Kai-Wei Chang, Zhewei Yao, and Kurt Keutzer. How much can clip beneﬁt vision-and-language tasks?arXiv preprint arXiv:2107.06383, 2021

work page arXiv 2021
[74]

An empirical study of training end-to-end vision-and- language transformers

Zi-Yi Dou, Yichong Xu, Zhe Gan, Jianfeng Wang, Shuohang Wang, Lijuan Wang, Chenguang Zhu, Zicheng Liu, Michael Zeng, et al. An empirical study of training end-to-end vision-and- language transformers. arXiv preprint arXiv:2111.02387, 2021

work page arXiv 2021
[75]

Making the v in vqa matter: Elevating the role of image understanding in visual question answering

Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6904–6913, 2017

work page 2017
[76]

Visual Entailment: A Novel Task for Fine-Grained Image Understanding

Ning Xie, Farley Lai, Derek Doran, and Asim Kadav. Visual entailment: A novel task for ﬁne-grained image understanding. arXiv preprint arXiv:1901.06706, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1901
[77]

A Corpus for Reasoning About Natural Language Grounded in Photographs

Alane Suhr, Stephanie Zhou, Ally Zhang, Iris Zhang, Huajun Bai, and Yoav Artzi. A corpus for reasoning about natural language grounded in photographs. arXiv preprint arXiv:1811.00491, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[78]

nocaps: novel object captioning at scale

Harsh Agrawal, Karan Desai, Yufei Wang, Xinlei Chen, Rishabh Jain, Mark Johnson, Dhruv Batra, Devi Parikh, Stefan Lee, and Peter Anderson. nocaps: novel object captioning at scale. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 8948–8957, 2019

work page 2019
[79]

Self- critical sequence training for image captioning

Steven J Rennie, Etienne Marcheret, Youssef Mroueh, Jerret Ross, and Vaibhava Goel. Self- critical sequence training for image captioning. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7008–7024, 2017

work page 2017
[80]

Scaling up vision-language pre-training for image captioning

Xiaowei Hu, Zhe Gan, Jianfeng Wang, Zhengyuan Yang, Zicheng Liu, Yumao Lu, and Li- juan Wang. Scaling up vision-language pre-training for image captioning. arXiv preprint arXiv:2111.12233, 2021. 17 A Visual Recognition Finetuning Details Hyper-parameter ImageNet Kinetics-400/600/700 Moments-in-Time Frozen-feature Finetuning Frozen-feature Finetuning Froze...

work page arXiv 2021