pith. machine review for the scientific record. sign in

arxiv: 2205.01917 · v2 · submitted 2022-05-04 · 💻 cs.CV · cs.LG· cs.MM

Recognition: 2 theorem links

· Lean Theorem

CoCa: Contrastive Captioners are Image-Text Foundation Models

Authors on Pith no claims yet

Pith reviewed 2026-05-15 10:48 UTC · model grok-4.3

classification 💻 cs.CV cs.LGcs.MM
keywords contrastive learningimage captioningfoundation modelszero-shot transfermultimodal modelsimage-textencoder-decoder
0
0 comments X

The pith

CoCa jointly trains contrastive and captioning losses in one encoder-decoder to create image-text foundation models that reach new state-of-the-art on ImageNet and multimodal tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents CoCa as a single model pretrained from scratch on large-scale image-text data using two losses together. A contrastive loss aligns image embeddings with text embeddings, while a captioning loss trains the model to generate text tokens from images autoregressively. The decoder is split so its first half processes only text without seeing the image, and its second half adds cross-attention to the image encoder for joint representations. This design unifies natural language supervision across unlabeled web data and labeled datasets by treating all annotations as text. The resulting model transfers with zero-shot inference or minimal adaptation to achieve leading accuracy on ImageNet classification, video action recognition, image captioning, retrieval, and visual question answering.

Core claim

CoCa is an image-text encoder-decoder model pretrained end-to-end by computing a contrastive loss between unimodal image and text embeddings together with a captioning loss on multimodal decoder outputs; the decoder omits cross-attention in its first half to produce separate unimodal text representations before cascading the remaining layers that cross-attend to the image encoder.

What carries the argument

The cascaded decoder that produces unimodal text representations without cross-attention in its first half and multimodal image-text representations with cross-attention in its second half.

Load-bearing premise

Omitting cross-attention in the first half of the decoder layers cleanly separates unimodal text representations from multimodal ones without harming overall capacity or optimization stability.

What would settle it

An experiment that trains an otherwise identical model but keeps cross-attention in every decoder layer and measures whether zero-shot ImageNet top-1 accuracy or captioning metrics improve or degrade.

read the original abstract

Exploring large-scale pretrained foundation models is of significant interest in computer vision because these models can be quickly transferred to many downstream tasks. This paper presents Contrastive Captioner (CoCa), a minimalist design to pretrain an image-text encoder-decoder foundation model jointly with contrastive loss and captioning loss, thereby subsuming model capabilities from contrastive approaches like CLIP and generative methods like SimVLM. In contrast to standard encoder-decoder transformers where all decoder layers attend to encoder outputs, CoCa omits cross-attention in the first half of decoder layers to encode unimodal text representations, and cascades the remaining decoder layers which cross-attend to the image encoder for multimodal image-text representations. We apply a contrastive loss between unimodal image and text embeddings, in addition to a captioning loss on the multimodal decoder outputs which predicts text tokens autoregressively. By sharing the same computational graph, the two training objectives are computed efficiently with minimal overhead. CoCa is pretrained end-to-end and from scratch on both web-scale alt-text data and annotated images by treating all labels simply as text, seamlessly unifying natural language supervision for representation learning. Empirically, CoCa achieves state-of-the-art performance with zero-shot transfer or minimal task-specific adaptation on a broad range of downstream tasks, spanning visual recognition (ImageNet, Kinetics-400/600/700, Moments-in-Time), crossmodal retrieval (MSCOCO, Flickr30K, MSR-VTT), multimodal understanding (VQA, SNLI-VE, NLVR2), and image captioning (MSCOCO, NoCaps). Notably on ImageNet classification, CoCa obtains 86.3% zero-shot top-1 accuracy, 90.6% with a frozen encoder and learned classification head, and new state-of-the-art 91.0% top-1 accuracy on ImageNet with a finetuned encoder.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces CoCa, a minimalist image-text encoder-decoder foundation model pretrained jointly from scratch with contrastive loss on unimodal image and text embeddings and captioning loss on multimodal decoder outputs. The key design omits cross-attention in the first half of decoder layers to produce unimodal text representations for the contrastive objective while cascading the remaining layers for multimodal image-text representations; all labels are treated as text for unified supervision. It reports state-of-the-art results including 86.3% zero-shot top-1 on ImageNet, 90.6% with frozen encoder plus head, 91.0% finetuned top-1 on ImageNet, plus leading numbers on Kinetics, MSCOCO, VQA, and captioning tasks.

Significance. If the results hold, this work is significant for showing an efficient unification of contrastive and generative pretraining in a single computational graph with minimal overhead, yielding strong transfer across visual recognition, retrieval, and multimodal understanding benchmarks. The seamless treatment of all supervision as text and the reported consistent gains on multiple independent held-out test sets are notable strengths.

major comments (2)
  1. [Model Architecture / Decoder Design] The central architectural claim—that omitting cross-attention in the first half of decoder layers cleanly separates unimodal text representations (for contrastive loss) from multimodal ones (for captioning loss) with minimal overhead—lacks supporting ablation. No comparison is shown against a standard encoder-decoder transformer with cross-attention in every decoder layer or against applying the contrastive loss to multimodal embeddings. This choice is load-bearing for the unification claim and the reported ImageNet numbers (86.3% zero-shot, 91.0% finetuned).
  2. [Experiments and Results] The experimental results report strong absolute numbers (e.g., 91.0% ImageNet top-1 finetuned) without error bars, sensitivity analysis on the contrastive-vs-captioning loss weight (explicitly listed as a free parameter), or full hyperparameter tables. This weakens assessment of robustness and reproducibility of the gains.
minor comments (2)
  1. [Abstract] The abstract would benefit from a brief statement of model scale or training data volume to contextualize the results.
  2. [Methods] Notation for unimodal vs. multimodal embeddings should be defined once and used consistently to avoid minor ambiguity in the methods description.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments and detailed review. We address each major comment below, indicating the revisions we will make to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Model Architecture / Decoder Design] The central architectural claim—that omitting cross-attention in the first half of decoder layers cleanly separates unimodal text representations (for contrastive loss) from multimodal ones (for captioning loss) with minimal overhead—lacks supporting ablation. No comparison is shown against a standard encoder-decoder transformer with cross-attention in every decoder layer or against applying the contrastive loss to multimodal embeddings. This choice is load-bearing for the unification claim and the reported ImageNet numbers (86.3% zero-shot, 91.0% finetuned).

    Authors: We thank the referee for highlighting this point. The motivation for the cascaded decoder design (unimodal text representations in early layers for contrastive loss, followed by multimodal layers for captioning) is described in Section 3.2, and the overall unification is supported by the end-to-end results across tasks. However, we agree that dedicated ablations would provide stronger evidence. In the revised manuscript we will add comparisons to a standard full cross-attention encoder-decoder baseline and to variants applying contrastive loss on multimodal embeddings; these will appear in Section 4.2 with accompanying efficiency and accuracy metrics. revision: yes

  2. Referee: [Experiments and Results] The experimental results report strong absolute numbers (e.g., 91.0% ImageNet top-1 finetuned) without error bars, sensitivity analysis on the contrastive-vs-captioning loss weight (explicitly listed as a free parameter), or full hyperparameter tables. This weakens assessment of robustness and reproducibility of the gains.

    Authors: We acknowledge the referee's concern about robustness and reproducibility. In the revised version we will report error bars for the primary ImageNet and downstream results (computed from multiple independent runs where compute permits), include a sensitivity sweep over the contrastive-to-captioning loss weight in the supplementary material, and provide complete hyperparameter tables in a new appendix section. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical results on held-out benchmarks with no self-referential derivations

full rationale

The paper proposes an architectural variant (omitting cross-attention in early decoder layers) and reports direct empirical measurements such as 86.3% zero-shot ImageNet top-1 accuracy. These quantities are obtained by standard training and evaluation on external test sets; no equation or result is defined in terms of itself, no fitted parameter is renamed as a prediction, and no load-bearing premise reduces to a self-citation chain. The unification of contrastive and captioning objectives is achieved through explicit joint optimization rather than by construction, making the derivation chain self-contained against independent benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard transformer layer assumptions and the empirical effectiveness of the chosen loss combination; no new physical entities or unstated mathematical axioms are introduced.

free parameters (1)
  • contrastive vs captioning loss weight
    The relative weighting between the two objectives is chosen during training to achieve the reported numbers.
axioms (1)
  • standard math Standard multi-head attention and feed-forward transformer blocks behave as in the original Vaswani et al. formulation
    The paper builds directly on the transformer architecture without re-deriving its properties.

pith-pipeline@v0.9.0 · 5673 in / 1462 out tokens · 57446 ms · 2026-05-15T10:48:13.746831+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 20 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. OZ-TAL: Online Zero-Shot Temporal Action Localization

    cs.CV 2026-05 unverdicted novelty 7.0

    Defines OZ-TAL task and presents a training-free VLM-based method that outperforms prior approaches for online and offline zero-shot temporal action localization on THUMOS14 and ActivityNet-1.3.

  2. Bottleneck Tokens for Unified Multimodal Retrieval

    cs.LG 2026-04 unverdicted novelty 7.0

    Bottleneck Tokens paired with a masked generative objective achieve state-of-the-art unified multimodal retrieval performance among 2B-scale models on the MMEB-V2 benchmark with 78 datasets.

  3. InstrAct: Towards Action-Centric Understanding in Instructional Videos

    cs.CV 2026-04 unverdicted novelty 7.0

    InstrAction pretrains video foundation models using action-centric data filtering, hard negatives, an Action Perceiver module, DTW-Align, and Masked Action Modeling to reduce static bias and outperform prior models on...

  4. InstAP: Instance-Aware Vision-Language Pre-Train for Spatial-Temporal Understanding

    cs.CV 2026-04 unverdicted novelty 7.0

    InstAP introduces instance-aware pre-training with a new dual-granularity dataset InstVL that improves both fine-grained instance retrieval and global video understanding over standard VLP baselines.

  5. LRM: Large Reconstruction Model for Single Image to 3D

    cs.CV 2023-11 conditional novelty 7.0

    LRM is a large transformer that predicts a NeRF directly from a single image after training on a million-object multi-view dataset.

  6. BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

    cs.CV 2023-01 unverdicted novelty 7.0

    BLIP-2 bootstraps vision-language pre-training from frozen image encoders and LLMs via a lightweight two-stage Querying Transformer, delivering SOTA results with 54x fewer trainable parameters than Flamingo80B on zero...

  7. LAION-5B: An open large-scale dataset for training next generation image-text models

    cs.CV 2022-10 accept novelty 7.0

    LAION-5B is an openly released dataset of 5.85 billion CLIP-filtered image-text pairs that enables replication of foundational vision-language models.

  8. Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding

    cs.CV 2022-05 accept novelty 7.0

    Imagen achieves state-of-the-art photorealistic text-to-image generation by scaling a text-only pretrained T5 language model within a diffusion framework, reaching FID 7.27 on COCO without training on it.

  9. What Matters for Diffusion-Friendly Latent Manifold? Prior-Aligned Autoencoders for Latent Diffusion

    cs.CV 2026-05 unverdicted novelty 6.0

    Prior-Aligned AutoEncoders shape latent manifolds with spatial coherence, local continuity, and global semantics to improve latent diffusion, achieving SOTA gFID 1.03 on ImageNet 256x256 with up to 13x faster convergence.

  10. Joint Semantic Token Selection and Prompt Optimization for Interpretable Prompt Learning

    cs.CV 2026-05 unverdicted novelty 6.0

    IPL alternates discrete semantic token selection using approximate submodular optimization with continuous prompt optimization to boost both interpretability and task performance in vision-language model adaptation.

  11. Compared to What? Baselines and Metrics for Counterfactual Prompting

    cs.CL 2026-05 conditional novelty 6.0

    Counterfactual prompting effects on LLMs are often indistinguishable from those caused by meaning-preserving paraphrases, causing most previously reported demographic sensitivities to disappear under proper statistica...

  12. Revisiting Feature Prediction for Learning Visual Representations from Video

    cs.CV 2024-02 conditional novelty 6.0

    V-JEPA models trained only on feature prediction from 2 million public videos achieve 81.9% on Kinetics-400, 72.2% on Something-Something-v2, and 77.9% on ImageNet-1K using frozen ViT-H/16 backbones.

  13. Vision Transformers Need Registers

    cs.CV 2023-09 unverdicted novelty 6.0

    Adding register tokens to Vision Transformers eliminates high-norm background artifacts and raises state-of-the-art performance on dense visual prediction tasks.

  14. Aligning Text-to-Image Models using Human Feedback

    cs.LG 2023-02 unverdicted novelty 6.0

    A three-stage fine-tuning process uses human ratings to train a reward model and then improves text-to-image alignment by maximizing reward-weighted likelihood.

  15. Scaling Autoregressive Models for Content-Rich Text-to-Image Generation

    cs.CV 2022-06 unverdicted novelty 6.0

    Scaling an autoregressive Transformer to 20B parameters for text-to-image generation using image token sequences achieves new SOTA zero-shot FID of 7.23 and fine-tuned FID of 3.22 on MS-COCO.

  16. Let ViT Speak: Generative Language-Image Pre-training

    cs.CV 2026-05 unverdicted novelty 5.0

    GenLIP pretrains ViTs to generate language tokens from visual tokens via autoregressive language modeling, matching strong baselines on multimodal tasks with less data.

  17. From Codebooks to VLMs: Evaluating Automated Visual Discourse Analysis for Climate Change on Social Media

    cs.CV 2026-04 unverdicted novelty 5.0

    VLMs recover reliable population-level trends in climate change visual discourse on social media even when per-image accuracy is only moderate.

  18. Hallucination of Multimodal Large Language Models: A Survey

    cs.CV 2024-04 accept novelty 5.0

    The survey organizes causes of hallucinations in MLLMs, reviews evaluation benchmarks and metrics, and outlines mitigation approaches plus open questions.

  19. InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks

    cs.CV 2023-12 unverdicted novelty 5.0

    InternVL scales a vision model to 6B parameters and aligns it with LLMs using web data to achieve state-of-the-art results on 32 visual-linguistic benchmarks.

  20. Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models

    cs.CV 2024-02 unverdicted novelty 2.0

    The paper reviews the background, technology, applications, limitations, and future directions of OpenAI's Sora text-to-video generative model based on public information.

Reference graph

Works this paper leans on

80 extracted references · 80 canonical work pages · cited by 20 Pith papers · 12 internal anchors

  1. [1]

    On the Opportunities and Risks of Foundation Models

    Rishi Bommasani, Drew A Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, et al. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258, 2021

  2. [2]

    BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018

  3. [3]

    Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

    Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv preprint arXiv:1910.10683, 2019

  4. [4]

    Language models are few-shot learners

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020

  5. [5]

    Pathways: Asynchronous distributed dataflow for ml

    Paul Barham, Aakanksha Chowdhery, Jeff Dean, Sanjay Ghemawat, Steven Hand, Dan Hurt, Michael Isard, Hyeontaek Lim, Ruoming Pang, Sudip Roy, et al. Pathways: Asynchronous distributed dataflow for ml. arXiv preprint arXiv:2203.12533, 2022

  6. [6]

    Rich feature hierarchies for accurate object detection and semantic segmentation

    Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 580–587, 2014

  7. [7]

    Fully convolutional networks for se- mantic segmentation

    Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for se- mantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3431–3440, 2015

  8. [8]

    Two-stream convolutional networks for action recognition in videos

    Karen Simonyan and Andrew Zisserman. Two-stream convolutional networks for action recognition in videos. Advances in neural information processing systems, 27, 2014

  9. [9]

    Imagenet: A large- scale hierarchical image database

    Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large- scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009. 13

  10. [10]

    Coatnet: Marrying convolution and attention for all data sizes

    Zihang Dai, Hanxiao Liu, Quoc V Le, and Mingxing Tan. Coatnet: Marrying convolution and attention for all data sizes. Advances in Neural Information Processing Systems, 34:3965–3977, 2021

  11. [11]

    Co-training transformer with videos and images improves action recognition

    Bowen Zhang, Jiahui Yu, Christopher Fifty, Wei Han, Andrew M Dai, Ruoming Pang, and Fei Sha. Co-training transformer with videos and images improves action recognition. arXiv preprint arXiv:2112.07175, 2021

  12. [12]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, pages 8748–8763. PMLR, 2021

  13. [13]

    Scaling up visual and vision-language representation learning with noisy text supervision

    Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun- Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representation learning with noisy text supervision. In International Conference on Machine Learning, pages 4904–4916. PMLR, 2021

  14. [14]

    Florence: A new foundation model for computer vision.arXiv preprint arXiv:2111.11432, 2021

    Lu Yuan, Dongdong Chen, Yi-Ling Chen, Noel Codella, Xiyang Dai, Jianfeng Gao, Houdong Hu, Xuedong Huang, Boxin Li, Chunyuan Li, et al. Florence: A new foundation model for computer vision. arXiv preprint arXiv:2111.11432, 2021

  15. [15]

    Show and tell: A neural image caption generator

    Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. Show and tell: A neural image caption generator. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3156–3164, 2015

  16. [16]

    Simvlm: Sim- ple visual language model pretraining with weak supervision

    Zirui Wang, Jiahui Yu, Adams Wei Yu, Zihang Dai, Yulia Tsvetkov, and Yuan Cao. Simvlm: Sim- ple visual language model pretraining with weak supervision. arXiv preprint arXiv:2108.10904, 2021

  17. [17]

    Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework

    Peng Wang, An Yang, Rui Men, Junyang Lin, Shuai Bai, Zhikang Li, Jianxin Ma, Chang Zhou, Jingren Zhou, and Hongxia Yang. Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework. arXiv preprint arXiv:2202.03052, 2022

  18. [18]

    Imagenet classification with deep convolutional neural networks

    Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems, 25, 2012

  19. [19]

    Attention is all you need

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017

  20. [20]

    Exploring the limits of weakly supervised pretraining

    Dhruv Mahajan, Ross Girshick, Vignesh Ramanathan, Kaiming He, Manohar Paluri, Yixuan Li, Ashwin Bharambe, and Laurens Van Der Maaten. Exploring the limits of weakly supervised pretraining. In Proceedings of the European conference on computer vision (ECCV), pages 181–196, 2018

  21. [21]

    Scaling vision transform- ers, 2021

    Xiaohua Zhai, Alexander Kolesnikov, Neil Houlsby, and Lucas Beyer. Scaling vision transform- ers, 2021

  22. [22]

    BEiT: BERT Pre-Training of Image Transformers

    Hangbo Bao, Li Dong, and Furu Wei. Beit: Bert pre-training of image transformers. arXiv preprint arXiv:2106.08254, 2021

  23. [23]

    Masked autoencoders are scalable vision learners.arXiv:2111.06377, 2021

    Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. Masked autoencoders are scalable vision learners. arXiv preprint arXiv:2111.06377, 2021

  24. [24]

    Simmim: A simple framework for masked image modeling

    Zhenda Xie, Zheng Zhang, Yue Cao, Yutong Lin, Jianmin Bao, Zhuliang Yao, Qi Dai, and Han Hu. Simmim: A simple framework for masked image modeling. arXiv preprint arXiv:2111.09886, 2021

  25. [25]

    Lxmert: Learning cross-modality encoder representations from transformers,

    Hao Tan and Mohit Bansal. Lxmert: Learning cross-modality encoder representations from transformers. arXiv preprint arXiv:1908.07490, 2019

  26. [26]

    Uniter: Universal image-text representation learning

    Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy, Faisal Ahmed, Zhe Gan, Yu Cheng, and Jingjing Liu. Uniter: Universal image-text representation learning. In ECCV, 2020

  27. [27]

    Vinvl: Revisiting visual representations in vision-language models

    Pengchuan Zhang, Xiujun Li, Xiaowei Hu, Jianwei Yang, Lei Zhang, Lijuan Wang, Yejin Choi, and Jianfeng Gao. Vinvl: Revisiting visual representations in vision-language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5579–5588, June 2021. 14

  28. [28]

    Faster r-cnn: Towards real-time object detection with region proposal networks

    Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In C. Cortes, N. Lawrence, D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 28. Curran Associates, Inc., 2015

  29. [29]

    Vilt: Vision-and-language transformer without convolution or region supervision, 2021

    Wonjae Kim, Bokyung Son, and Ildoo Kim. Vilt: Vision-and-language transformer without convolution or region supervision, 2021

  30. [30]

    Vlmo: Unified vision-language pre- training with mixture-of-modality-experts

    Wenhui Wang, Hangbo Bao, Li Dong, and Furu Wei. Vlmo: Unified vision-language pre- training with mixture-of-modality-experts. arXiv preprint arXiv:2111.02358, 2021

  31. [31]

    Unified contrastive learning in image-text-label space, 2022

    Jianwei Yang, Chunyuan Li, Pengchuan Zhang, Bin Xiao, Ce Liu, Lu Yuan, and Jianfeng Gao. Unified contrastive learning in image-text-label space, 2022

  32. [32]

    Lit: Zero-shot transfer with locked-image text tuning

    Xiaohua Zhai, Xiao Wang, Basil Mustafa, Andreas Steiner, Daniel Keysers, Alexander Kolesnikov, and Lucas Beyer. Lit: Zero-shot transfer with locked-image text tuning. arXiv preprint arXiv:2111.07991, 2021

  33. [33]

    Hieu Pham, Zihang Dai, Golnaz Ghiasi, Kenji Kawaguchi, Hanxiao Liu, Adams Wei Yu, Jiahui Yu, Yi-Ting Chen, Minh-Thang Luong, Yonghui Wu, Mingxing Tan, and Quoc V . Le. Combined scaling for open-vocabulary image classification, 2021

  34. [34]

    Answer-me: Multi-task open-vocabulary visual question answering

    AJ Piergiovanni, Wei Li, Weicheng Kuo, Mohammad Saffar, Fred Bertsch, and Anelia An- gelova. Answer-me: Multi-task open-vocabulary visual question answering. arXiv preprint arXiv:2205.00949, 2022

  35. [35]

    Flava: A foundational language and vision alignment model

    Amanpreet Singh, Ronghang Hu, Vedanuj Goswami, Guillaume Couairon, Wojciech Galuba, Marcus Rohrbach, and Douwe Kiela. Flava: A foundational language and vision alignment model. arXiv preprint arXiv:2112.04482, 2021

  36. [36]

    Align before fuse: Vision and language representation learning with momentum distillation

    Junnan Li, Ramprasaath Selvaraju, Akhilesh Gotmare, Shafiq Joty, Caiming Xiong, and Steven Chu Hong Hoi. Align before fuse: Vision and language representation learning with momentum distillation. Advances in Neural Information Processing Systems, 34, 2021

  37. [37]

    Blip: Bootstrapping language- image pre-training for unified vision-language understanding and generation.arXiv preprint arXiv:2201.12086, 2022

    Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language- image pre-training for unified vision-language understanding and generation.arXiv preprint arXiv:2201.12086, 2022

  38. [38]

    The evolution of out-of-distribution robustness throughout fine-tuning

    Anders Andreassen, Yasaman Bahri, Behnam Neyshabur, and Rebecca Roelofs. The evolution of out-of-distribution robustness throughout fine-tuning. arXiv preprint arXiv:2106.15831, 2021

  39. [39]

    An image is worth 16x16 words: Transformers for image recognition at scale

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2021

  40. [40]

    Deep residual learning for image recognition

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016

  41. [41]

    A learning algorithm for continually running fully recurrent neural networks

    Ronald J Williams and David Zipser. A learning algorithm for continually running fully recurrent neural networks. Neural computation, 1(2):270–280, 1989

  42. [42]

    Set transformer: A framework for attention-based permutation-invariant neural networks

    Juho Lee, Yoonho Lee, Jungtaek Kim, Adam Kosiorek, Seungjin Choi, and Yee Whye Teh. Set transformer: A framework for attention-based permutation-invariant neural networks. In International Conference on Machine Learning, pages 3744–3753. PMLR, 2019

  43. [43]

    Neural Machine Translation of Rare Words with Subword Units

    Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural machine translation of rare words with subword units. arXiv preprint arXiv:1508.07909, 2015

  44. [44]

    Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates

    Taku Kudo. Subword regularization: Improving neural network translation models with multiple subword candidates. arXiv preprint arXiv:1804.10959, 2018

  45. [45]

    Lingvo: a modular and scalable framework for sequence-to-sequence modeling, 2019

    Jonathan Shen, Patrick Nguyen, Yonghui Wu, Zhifeng Chen, et al. Lingvo: a modular and scalable framework for sequence-to-sequence modeling, 2019

  46. [46]

    Gpipe: Efficient training of giant neural networks using pipeline parallelism

    Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan Firat, Dehao Chen, Mia Chen, Hy- oukJoong Lee, Jiquan Ngiam, Quoc V Le, Yonghui Wu, et al. Gpipe: Efficient training of giant neural networks using pipeline parallelism. Advances in neural information processing systems, 32, 2019. 15

  47. [47]

    Automatic cross-replica sharding of weight update in data-parallel training

    Yuanzhong Xu, HyoukJoong Lee, Dehao Chen, Hongjun Choi, Blake Hechtman, and Shibo Wang. Automatic cross-replica sharding of weight update in data-parallel training. arXiv preprint arXiv:2004.13336, 2020

  48. [48]

    GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding

    Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, and Zhifeng Chen. Gshard: Scaling giant models with condi- tional computation and automatic sharding. arXiv preprint arXiv:2006.16668, 2020

  49. [49]

    Gspmd: general and scalable parallelization for ml computation graphs

    Yuanzhong Xu, HyoukJoong Lee, Dehao Chen, Blake Hechtman, Yanping Huang, Rahul Joshi, Maxim Krikun, Dmitry Lepikhin, Andy Ly, Marcello Maggioni, et al. Gspmd: general and scalable parallelization for ml computation graphs. arXiv preprint arXiv:2105.04663, 2021

  50. [50]

    Adafactor: Adaptive learning rates with sublinear memory cost

    Noam Shazeer and Mitchell Stern. Adafactor: Adaptive learning rates with sublinear memory cost. In International Conference on Machine Learning, pages 4596–4604. PMLR, 2018

  51. [51]

    Meta pseudo labels

    Hieu Pham, Zihang Dai, Qizhe Xie, and Quoc V Le. Meta pseudo labels. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11557–11568, 2021

  52. [52]

    Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time

    Mitchell Wortsman, Gabriel Ilharco, Samir Yitzhak Gadre, Rebecca Roelofs, Raphael Gontijo- Lopes, Ari S Morcos, Hongseok Namkoong, Ali Farhadi, Yair Carmon, Simon Kornblith, et al. Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time. arXiv preprint arXiv:2203.05482, 2022

  53. [53]

    Vivit: A video vision transformer

    Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario Lu ˇci´c, and Cordelia Schmid. Vivit: A video vision transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 6836–6846, 2021

  54. [54]

    Movinets: Mobile video networks for efficient video recognition

    Dan Kondratyuk, Liangzhe Yuan, Yandong Li, Li Zhang, Mingxing Tan, Matthew Brown, and Boqing Gong. Movinets: Mobile video networks for efficient video recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16020–16030, 2021

  55. [55]

    Vatt: Transformers for multimodal self-supervised learning from raw video, audio and text

    Hassan Akbari, Liangzhe Yuan, Rui Qian, Wei-Hong Chuang, Shih-Fu Chang, Yin Cui, and Boqing Gong. Vatt: Transformers for multimodal self-supervised learning from raw video, audio and text. Advances in Neural Information Processing Systems, 34, 2021

  56. [56]

    Masked feature prediction for self-supervised visual pre-training

    Chen Wei, Haoqi Fan, Saining Xie, Chao-Yuan Wu, Alan Yuille, and Christoph Feichten- hofer. Masked feature prediction for self-supervised visual pre-training. arXiv preprint arXiv:2112.09133, 2021

  57. [57]

    The Kinetics Human Action Video Dataset

    Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijaya- narasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, et al. The kinetics human action video dataset. arXiv preprint arXiv:1705.06950, 2017

  58. [58]

    A Short Note about Kinetics-600

    Joao Carreira, Eric Noland, Andras Banki-Horvath, Chloe Hillier, and Andrew Zisserman. A short note about kinetics-600. arXiv preprint arXiv:1808.01340, 2018

  59. [59]

    A short note on the kinetics- 700 human action dataset

    Joao Carreira, Eric Noland, Chloe Hillier, and Andrew Zisserman. A short note on the kinetics- 700 human action dataset. arXiv preprint arXiv:1907.06987, 2019

  60. [60]

    Moments in time dataset: one million videos for event understanding

    Mathew Monfort, Alex Andonian, Bolei Zhou, Kandan Ramakrishnan, Sarah Adel Bargal, Tom Yan, Lisa Brown, Quanfu Fan, Dan Gutfreund, Carl V ondrick, et al. Moments in time dataset: one million videos for event understanding. IEEE transactions on pattern analysis and machine intelligence, 42(2):502–508, 2019

  61. [61]

    Filip: Fine-grained interactive language-image pre-training.ArXiv, abs/2111.07783, 2021

    Lewei Yao, Runhui Huang, Lu Hou, Guansong Lu, Minzhe Niu, Hang Xu, Xiaodan Liang, Zhenguo Li, Xin Jiang, and Chunjing Xu. Filip: Fine-grained interactive language-image pre-training. arXiv preprint arXiv:2111.07783, 2021

  62. [62]

    Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models

    Bryan A Plummer, Liwei Wang, Chris M Cervantes, Juan C Caicedo, Julia Hockenmaier, and Svetlana Lazebnik. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In Proceedings of the IEEE international conference on computer vision, pages 2641–2649, 2015

  63. [63]

    Microsoft COCO Captions: Data Collection and Evaluation Server

    Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325, 2015. 16

  64. [64]

    Natural adversarial examples

    Dan Hendrycks, Kevin Zhao, Steven Basart, Jacob Steinhardt, and Dawn Song. Natural adversarial examples. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15262–15271, 2021

  65. [65]

    The many faces of robustness: A critical analysis of out-of-distribution generalization

    Dan Hendrycks, Steven Basart, Norman Mu, Saurav Kadavath, Frank Wang, Evan Dorundo, Rahul Desai, Tyler Zhu, Samyak Parajuli, Mike Guo, et al. The many faces of robustness: A critical analysis of out-of-distribution generalization. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 8340–8349, 2021

  66. [66]

    Do imagenet classifiers generalize to imagenet? In International Conference on Machine Learning, pages 5389–5400

    Benjamin Recht, Rebecca Roelofs, Ludwig Schmidt, and Vaishaal Shankar. Do imagenet classifiers generalize to imagenet? In International Conference on Machine Learning, pages 5389–5400. PMLR, 2019

  67. [67]

    Learning robust global repre- sentations by penalizing local predictive power

    Haohan Wang, Songwei Ge, Zachary Lipton, and Eric P Xing. Learning robust global repre- sentations by penalizing local predictive power. Advances in Neural Information Processing Systems, 32, 2019

  68. [68]

    Objectnet: A large-scale bias-controlled dataset for pushing the limits of object recognition models

    Andrei Barbu, David Mayo, Julian Alverio, William Luo, Christopher Wang, Dan Gutfreund, Josh Tenenbaum, and Boris Katz. Objectnet: A large-scale bias-controlled dataset for pushing the limits of object recognition models. Advances in neural information processing systems, 32, 2019

  69. [69]

    A straightforward framework for video retrieval using clip

    Jesús Andrés Portillo-Quintero, José Carlos Ortiz-Bayliss, and Hugo Terashima-Marín. A straightforward framework for video retrieval using clip. arXiv preprint arXiv:2102.12443, 2021

  70. [70]

    Socratic models: Composing zero-shot multimodal reasoning with language

    Andy Zeng, Adrian Wong, Stefan Welker, Krzysztof Choromanski, Federico Tombari, Aveek Purohit, Michael Ryoo, Vikas Sindhwani, Johnny Lee, Vincent Vanhoucke, and Pete Florence. Socratic models: Composing zero-shot multimodal reasoning with language. arXiv preprint arXiv:2204.00598, 2022

  71. [71]

    Msr-vtt: A large video description dataset for bridging video and language

    Jun Xu, Tao Mei, Ting Yao, and Yong Rui. Msr-vtt: A large video description dataset for bridging video and language. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5288–5296, 2016

  72. [72]

    A short note on the kinetics-700-2020 human action dataset

    Lucas Smaira, João Carreira, Eric Noland, Ellen Clancy, Amy Wu, and Andrew Zisserman. A short note on the kinetics-700-2020 human action dataset. arXiv preprint arXiv:2010.10864, 2020

  73. [73]

    How much can clip benefit vision-and-language tasks?arXiv preprint arXiv:2107.06383, 2021

    Sheng Shen, Liunian Harold Li, Hao Tan, Mohit Bansal, Anna Rohrbach, Kai-Wei Chang, Zhewei Yao, and Kurt Keutzer. How much can clip benefit vision-and-language tasks?arXiv preprint arXiv:2107.06383, 2021

  74. [74]

    An empirical study of training end-to-end vision-and- language transformers

    Zi-Yi Dou, Yichong Xu, Zhe Gan, Jianfeng Wang, Shuohang Wang, Lijuan Wang, Chenguang Zhu, Zicheng Liu, Michael Zeng, et al. An empirical study of training end-to-end vision-and- language transformers. arXiv preprint arXiv:2111.02387, 2021

  75. [75]

    Making the v in vqa matter: Elevating the role of image understanding in visual question answering

    Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6904–6913, 2017

  76. [76]

    Visual Entailment: A Novel Task for Fine-Grained Image Understanding

    Ning Xie, Farley Lai, Derek Doran, and Asim Kadav. Visual entailment: A novel task for fine-grained image understanding. arXiv preprint arXiv:1901.06706, 2019

  77. [77]

    A Corpus for Reasoning About Natural Language Grounded in Photographs

    Alane Suhr, Stephanie Zhou, Ally Zhang, Iris Zhang, Huajun Bai, and Yoav Artzi. A corpus for reasoning about natural language grounded in photographs. arXiv preprint arXiv:1811.00491, 2018

  78. [78]

    nocaps: novel object captioning at scale

    Harsh Agrawal, Karan Desai, Yufei Wang, Xinlei Chen, Rishabh Jain, Mark Johnson, Dhruv Batra, Devi Parikh, Stefan Lee, and Peter Anderson. nocaps: novel object captioning at scale. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 8948–8957, 2019

  79. [79]

    Self- critical sequence training for image captioning

    Steven J Rennie, Etienne Marcheret, Youssef Mroueh, Jerret Ross, and Vaibhava Goel. Self- critical sequence training for image captioning. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7008–7024, 2017

  80. [80]

    Scaling up vision-language pre-training for image captioning

    Xiaowei Hu, Zhe Gan, Jianfeng Wang, Zhengyuan Yang, Zicheng Liu, Yumao Lu, and Li- juan Wang. Scaling up vision-language pre-training for image captioning. arXiv preprint arXiv:2111.12233, 2021. 17 A Visual Recognition Finetuning Details Hyper-parameter ImageNet Kinetics-400/600/700 Moments-in-Time Frozen-feature Finetuning Frozen-feature Finetuning Froze...