pith. machine review for the scientific record. sign in

arxiv: 2209.06794 · v4 · submitted 2022-09-14 · 💻 cs.CV · cs.CL

Recognition: 1 theorem link

· Lean Theorem

PaLI: A Jointly-Scaled Multilingual Language-Image Model

Authors on Pith no claims yet

Pith reviewed 2026-05-16 09:25 UTC · model grok-4.3

classification 💻 cs.CV cs.CL
keywords PaLIvision-language modelmultimodal scalingmultilingual image-text dataimage captioningvisual question answeringscene text understanding
0
0 comments X

The pith

PaLI jointly scales a 4-billion-parameter vision transformer with a language model on a 10B multilingual image-text set to reach state-of-the-art on captioning, VQA and scene-text tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

PaLI extends large language models to joint vision-language modeling by taking image and text inputs and generating text outputs for many tasks. The approach pairs existing pre-trained language models with a new 4-billion-parameter Vision Transformer and trains them together rather than scaling language alone. A fresh dataset supplies 10 billion image-text pairs spanning more than 100 languages to support this joint scaling. The resulting model delivers leading scores on image captioning, visual question answering, and scene-text understanding while using a simple modular design.

Core claim

PaLI is formed by pairing a 4-billion-parameter ViT vision encoder with a large pre-trained language encoder-decoder, then training both jointly on a 10-billion-example multilingual image-text corpus; this produces state-of-the-art results on captioning, visual question answering, and scene-text understanding through a unified text-generation interface.

What carries the argument

Joint scaling of the vision encoder (ViT-e, 4B parameters) and language model on the new 10B multilingual image-text dataset, using a flexible text-generation interface.

If this is right

  • Joint scaling of vision and language yields gains larger than scaling the language component by itself.
  • The same modular text-generation interface supports many different vision-language tasks without task-specific heads.
  • Training on text in over 100 languages extends strong performance across multilingual benchmarks.
  • Leveraging pre-trained language and vision components lowers the total cost of building large multimodal systems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Further enlargement of the vision encoder may continue to lift scores if language capacity is increased in step.
  • The joint-scaling pattern could transfer to other paired data such as video and text.
  • Dataset composition and language balance likely matter as much as raw model size for the observed results.

Load-bearing premise

The claimed performance gains arise mainly from jointly increasing the capacity of the vision and language components rather than from the particular quality or balance of the 10 billion training pairs.

What would settle it

Train an otherwise identical model that keeps the language component fixed but uses a much smaller vision encoder, then compare its scores on the same captioning, VQA and scene-text benchmarks.

read the original abstract

Effective scaling and a flexible task interface enable large language models to excel at many tasks. We present PaLI (Pathways Language and Image model), a model that extends this approach to the joint modeling of language and vision. PaLI generates text based on visual and textual inputs, and with this interface performs many vision, language, and multimodal tasks, in many languages. To train PaLI, we make use of large pre-trained encoder-decoder language models and Vision Transformers (ViTs). This allows us to capitalize on their existing capabilities and leverage the substantial cost of training them. We find that joint scaling of the vision and language components is important. Since existing Transformers for language are much larger than their vision counterparts, we train a large, 4-billion parameter ViT (ViT-e) to quantify the benefits from even larger-capacity vision models. To train PaLI, we create a large multilingual mix of pretraining tasks, based on a new image-text training set containing 10B images and texts in over 100 languages. PaLI achieves state-of-the-art in multiple vision and language tasks (such as captioning, visual question-answering, scene-text understanding), while retaining a simple, modular, and scalable design.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper introduces PaLI, a multimodal model that extends large language models to joint language-vision tasks by combining pre-trained encoder-decoder LMs with Vision Transformers (including a new 4B-parameter ViT-e) and training on a new 10B-image multilingual image-text dataset spanning over 100 languages. It claims that joint scaling of vision and language components yields state-of-the-art results on captioning, visual question answering, scene-text understanding, and related tasks while preserving a simple, modular, and scalable architecture.

Significance. If the empirical results hold, the work provides concrete evidence that joint scaling of vision and language components on large multilingual data produces measurable gains over separately scaled models, while leveraging existing pre-trained components to control training cost. The modular design and explicit quantification of benefits from a 4B vision encoder are strengths that could influence subsequent multimodal scaling efforts.

major comments (1)
  1. Abstract: the claim of state-of-the-art performance across multiple tasks is presented without reference to specific baselines, exact metric values, or ablations that isolate the contribution of joint scaling versus the new 10B dataset or the ViT-e size; this makes the central attribution of gains to joint scaling difficult to verify from the summary alone.
minor comments (1)
  1. The abstract mentions 'over 100 languages' but does not specify the language distribution or any balancing strategy; a brief note on this in the dataset section would improve reproducibility.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the positive evaluation and recommendation for minor revision. The feedback on the abstract is helpful, and we address it directly below.

read point-by-point responses
  1. Referee: Abstract: the claim of state-of-the-art performance across multiple tasks is presented without reference to specific baselines, exact metric values, or ablations that isolate the contribution of joint scaling versus the new 10B dataset or the ViT-e size; this makes the central attribution of gains to joint scaling difficult to verify from the summary alone.

    Authors: We agree that the abstract is high-level by design and does not include the requested quantitative details. The full manuscript addresses these points in detail: Section 4 and Tables 1–5 report exact metric values (e.g., CIDEr on COCO, VQA accuracy on VQAv2), comparisons against specific baselines (SimVLM, GIT, mT5-based models), and ablations that isolate joint scaling, the 10B multilingual dataset, and the 4B ViT-e contribution. To make the central claims easier to verify from the abstract alone, we will revise it to incorporate a small number of key metric values and baseline references while preserving conciseness. This revision will be included in the next version. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper is an empirical scaling study that trains PaLI by combining existing pre-trained ViT and language-model components with a new 10B multilingual image-text dataset. All performance claims (SOTA on captioning, VQA, scene-text tasks) are presented as measured outcomes on held-out benchmarks rather than as outputs of any closed-form derivation or self-referential equation. No equations appear that define a quantity in terms of itself, no fitted parameters are relabeled as predictions, and no load-bearing uniqueness theorem or ansatz is imported via self-citation. The derivation chain is therefore the standard training-and-evaluation pipeline; results remain externally falsifiable and do not reduce to the inputs by construction.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the effectiveness of joint scaling and the new dataset; no new entities are postulated, but the approach assumes pre-trained components transfer well when scaled together.

free parameters (1)
  • ViT-e parameter count = 4B
    4 billion parameters chosen to balance with language model scale
axioms (1)
  • domain assumption Pre-trained language models and ViTs transfer effectively when jointly scaled on large multilingual image-text data
    Invoked to justify capitalizing on existing models and the new training mix

pith-pipeline@v0.9.0 · 5646 in / 1335 out tokens · 42961 ms · 2026-05-16T09:25:21.335193+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 19 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. LoopVLA: Learning Sufficiency in Recurrent Refinement for Vision-Language-Action Models

    cs.AI 2026-05 unverdicted novelty 7.0

    LoopVLA adds recurrent refinement and learned sufficiency estimation to VLA models, cutting parameters 45% and raising throughput 1.7x while matching baseline task success on LIBERO and VLA-Arena.

  2. WikiCLIP: An Efficient Contrastive Baseline for Open-domain Visual Entity Recognition

    cs.CV 2026-03 unverdicted novelty 7.0

    WikiCLIP delivers an efficient contrastive baseline for open-domain visual entity recognition that improves accuracy by 16% on OVEN unseen entities and runs nearly 100 times faster than leading generative models.

  3. Mema: Memory-Augmented Adapter for Enhanced Vision-Language Understanding

    cs.CV 2026-02 unverdicted novelty 7.0

    Mema adds a stateful memory module to vision encoders that accumulates hierarchical visual features across layers and selectively injects portions back via feedback to preserve fine-grained cues, yielding consistent g...

  4. Learning Interactive Real-World Simulators

    cs.AI 2023-10 conditional novelty 7.0

    UniSim learns a universal real-world simulator from orchestrated diverse datasets, enabling zero-shot deployment of policies trained purely in simulation.

  5. BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

    cs.CV 2023-01 unverdicted novelty 7.0

    BLIP-2 bootstraps vision-language pre-training from frozen image encoders and LLMs via a lightweight two-stage Querying Transformer, delivering SOTA results with 54x fewer trainable parameters than Flamingo80B on zero...

  6. Probing CLIP's Comprehension of 360-Degree Textual and Visual Semantics

    cs.CV 2026-04 conditional novelty 6.0

    CLIP models understand 360-degree textual semantics via explicit identifiers but show limited comprehension of visual semantics under horizontal circular shifts, which a LoRA fine-tuning approach improves with a noted...

  7. UHR-BAT: Budget-Aware Token Compression Vision-Language model for Ultra-High-Resolution Remote Sensing

    cs.CV 2026-04 unverdicted novelty 6.0

    UHR-BAT is a budget-aware framework that uses text-guided multi-scale importance estimation plus region-wise preserve and merge strategies to compress visual tokens in ultra-high-resolution remote sensing vision-langu...

  8. Simplicity Prevails: The Emergence of Generalizable AIGI Detection in Visual Foundation Models

    cs.CV 2026-02 conditional novelty 6.0

    Frozen features from vision foundation models enable a linear probe to outperform specialized AIGI detectors by over 30% on in-the-wild data due to emergent forgery knowledge from pre-training.

  9. SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics

    cs.LG 2025-06 unverdicted novelty 6.0

    SmolVLA is a small efficient VLA model that achieves performance comparable to 10x larger models while training on one GPU and deploying on consumer hardware via community data and chunked asynchronous action prediction.

  10. OpenVLA: An Open-Source Vision-Language-Action Model

    cs.RO 2024-06 unverdicted novelty 6.0

    OpenVLA achieves 16.5% higher task success than the 55B RT-2-X model across 29 tasks with 7x fewer parameters while enabling effective fine-tuning and quantization without performance loss.

  11. Capabilities of Gemini Models in Medicine

    cs.AI 2024-04 unverdicted novelty 6.0

    Med-Gemini sets new records on 10 of 14 medical benchmarks including 91.1% on MedQA-USMLE, beats GPT-4V by 44.5% on multimodal tasks, and surpasses humans on medical text summarization.

  12. Sigmoid Loss for Language Image Pre-Training

    cs.CV 2023-03 conditional novelty 6.0

    SigLIP replaces softmax-based contrastive loss with a simple pairwise sigmoid loss for vision-language pre-training, decoupling batch size from normalization and reaching strong zero-shot performance with limited compute.

  13. PaLM-E: An Embodied Multimodal Language Model

    cs.LG 2023-03 conditional novelty 6.0

    PaLM-E is a single 562B-parameter multimodal model that performs embodied reasoning tasks like robotic manipulation planning and visual question answering by interleaving vision, state, and text inputs with positive t...

  14. BiCLIP: Domain Canonicalization via Structured Geometric Transformation

    cs.CV 2026-03 unverdicted novelty 5.0

    BiCLIP recovers a structured geometric transformation from few-shot anchors to canonicalize domain features in VLMs and reports state-of-the-art results on 11 benchmarks.

  15. On The Application of Linear Attention in Multimodal Transformers

    cs.CV 2026-04 unverdicted novelty 4.0

    Linear attention delivers significant computational savings in multimodal transformers and follows the same scaling laws as softmax attention on ViT models trained on LAION-400M with ImageNet-21K zero-shot validation.

  16. SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

    cs.CV 2025-02 unverdicted novelty 4.0

    SigLIP 2 models trained with a unified recipe of captioning, self-supervised losses, and curated diverse data outperform prior SigLIP versions on classification, retrieval, localization, dense prediction, and multilin...

  17. PaliGemma 2: A Family of Versatile VLMs for Transfer

    cs.CV 2024-12 unverdicted novelty 4.0

    PaliGemma 2 is a family of vision-language models that achieves state-of-the-art results on transfer tasks like table structure recognition and radiography report generation by combining SigLIP with Gemma 2 models at ...

  18. PaliGemma: A versatile 3B VLM for transfer

    cs.CV 2024-07 unverdicted novelty 4.0

    PaliGemma is an open 3B VLM based on SigLIP and Gemma that achieves strong performance on nearly 40 diverse open-world tasks including benchmarks, remote-sensing, and segmentation.

  19. OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models

    cs.CV 2023-08 unverdicted novelty 4.0

    OpenFlamingo provides open-source autoregressive vision-language models that achieve 80-89% of Flamingo performance on seven vision-language datasets.

Reference graph

Works this paper leans on

185 extracted references · 185 canonical work pages · cited by 19 Pith papers · 10 internal anchors

  1. [1]

    Tallyqa: Answering complex counting questions

    Manoj Acharya, Kushal Kafle, and Christopher Kanan. Tallyqa: Answering complex counting questions. In Proceedings of the AAAI conference on artificial intelligence, volume 33, pp.\ 8076--8084, 2019

  2. [2]

    nocaps : Novel object captioning at scale

    Harsh Agrawal, Karan Desai, Yufei Wang, Xinlei Chen, Rishabh Jain, Mark Johnson, Dhruv Batra, Devi Parikh, Stefan Lee, and Peter Anderson. nocaps : Novel object captioning at scale. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.\ 8948--8957, 2019

  3. [3]

    Crossvqa: Scalably generating benchmarks for systematically testing vqa generalization

    Arjun Akula, Soravit Changpinyo, Boqing Gong, Piyush Sharma, Song-Chun Zhu, and Radu Soricut. Crossvqa: Scalably generating benchmarks for systematically testing vqa generalization. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp.\ 2148--2166, 2021

  4. [5]

    On the cross-lingual transferability of monolingual representations

    Mikel Artetxe, Sebastian Ruder, and Dani Yogatama. On the cross-lingual transferability of monolingual representations. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp.\ 4623--4637, 2020

  5. [6]

    ObjectNet : a large-scale bias-controlled dataset for pushing the limits of object recognition models

    Andrei Barbu, David Mayo, Julian Alverio, William Luo, Christopher Wang, Dan Gutfreund, Joshua Tenenbaum, and Boris Katz. ObjectNet : a large-scale bias-controlled dataset for pushing the limits of object recognition models. In Proceedings of the 33rd International Conference on Neural Information Processing Systems, pp.\ 9453--9463, 2019

  6. [8]

    Big Vision

    Lucas Beyer, Xiaohua Zhai, and Alexander Kolesnikov. Big Vision . https://github.com/google-research/big_vision, 2022

  7. [9]

    Jawahar, Ernest Valveny, and Dimosthenis Karatzas

    Ali Furkan Biten, Rubèn Tito, Andrés Mafla, Lluis Gomez, Marçal Rusiñol, C.V. Jawahar, Ernest Valveny, and Dimosthenis Karatzas. Scene text visual question answering. In 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pp.\ 4290--4300, 2019. doi:10.1109/ICCV.2019.00439

  8. [10]

    JAX : composable transformations of P ython+ N um P y programs, 2018

    James Bradbury, Roy Frostig, Peter Hawkins, Matthew James Johnson, Chris Leary, Dougal Maclaurin, George Necula, Adam Paszke, Jake Vander P las, Skye Wanderman- M ilne, and Qiao Zhang. JAX : composable transformations of P ython+ N um P y programs, 2018. URL http://github.com/google/jax

  9. [11]

    Language models are few-shot learners

    Tom B Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. In Proceedings of the 34th International Conference on Neural Information Processing Systems, pp.\ 1877--1901, 2020

  10. [12]

    Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts

    Soravit Changpinyo, Piyush Sharma, Nan Ding, and Radu Soricut. Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 3558--3568, 2021

  11. [13]

    All you may need for VQA are image captions

    Soravit Changpinyo, Doron Kukliansky, Idan Szpektor, Xi Chen, Nan Ding, and Radu Soricut. All you may need for VQA are image captions. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp.\ 1947--1963, Jul 2022 a

  12. [14]

    Thapliyal, Julien Amelot, Michal Yarom, Xi Chen, and Radu Soricut

    Soravit Changpinyo, Linting Xue, Idan Szpektor, Ashish V. Thapliyal, Julien Amelot, Michal Yarom, Xi Chen, and Radu Soricut. MaXM : Towards multilingual visual question answering. arXiv preprint arXiv:2209.05401, 2022 b

  13. [18]

    Uniter: Universal image-text representation learning

    Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy, Faisal Ahmed, Zhe Gan, Yu Cheng, and Jingjing Liu. Uniter: Universal image-text representation learning. In European conference on computer vision, pp.\ 104--120, 2020

  14. [19]

    Unifying vision-and-language tasks via text generation

    Jaemin Cho, Jie Lei, Hao Tan, and Mohit Bansal. Unifying vision-and-language tasks via text generation. In International Conference on Machine Learning, pp.\ 1931--1942, 2021

  15. [21]

    Ty D i QA : A benchmark for information-seeking question answering in typologically diverse languages

    Jonathan H Clark, Eunsol Choi, Michael Collins, Dan Garrette, Tom Kwiatkowski, Vitaly Nikolaev, and Jennimaria Palomaki. Ty D i QA : A benchmark for information-seeking question answering in typologically diverse languages. Transactions of the Association for Computational Linguistics, 8: 0 454--470, 2020

  16. [22]

    XNLI : Evaluating cross-lingual sentence representations

    Alexis Conneau, Ruty Rinott, Guillaume Lample, Adina Williams, Samuel Bowman, Holger Schwenk, and Veselin Stoyanov. XNLI : Evaluating cross-lingual sentence representations. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp.\ 2475--2485, 2018

  17. [23]

    ImageNet : A large-scale hierarchical image database

    Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. ImageNet : A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp.\ 248--255, 2009

  18. [24]

    Long-term recurrent convolutional networks for visual recognition and description

    Jeffrey Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko, and Trevor Darrell. Long-term recurrent convolutional networks for visual recognition and description. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.\ 2625--2634, 2015

  19. [25]

    An image is worth 16x16 words: Transformers for image recognition at scale

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. In 9th International Conference on Learning Representations, ICLR 2021, , 2021

  20. [26]

    GLaM : Efficient scaling of language models with mixture-of-experts

    Nan Du, Yanping Huang, Andrew M Dai, Simon Tong, Dmitry Lepikhin, Yuanzhong Xu, Maxim Krikun, Yanqi Zhou, Adams Wei Yu, Orhan Firat, et al. GLaM : Efficient scaling of language models with mixture-of-experts. In International Conference on Machine Learning, pp.\ 5547--5569, 2022

  21. [27]

    Datasheets for datasets

    Timnit Gebru, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman Vaughan, Hanna Wallach, Hal Daum \'e Iii, and Kate Crawford. Datasheets for datasets. Communications of the ACM, 64 0 (12): 0 86--92, 2021

  22. [28]

    Deep Learning

    Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. MIT Press, 2016. http://www.deeplearningbook.org

  23. [29]

    Making the V in VQA matter: Elevating the role of image understanding in visual question answering

    Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the V in VQA matter: Elevating the role of image understanding in visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.\ 6904--6913, 2017

  24. [30]

    KAT : A knowledge augmented transformer for vision-and-language

    Liangke Gui, Borui Wang, Qiuyuan Huang, Alex Hauptmann, Yonatan Bisk, and Jianfeng Gao. KAT : A knowledge augmented transformer for vision-and-language. arXiv preprint arXiv:2112.08614, 2021

  25. [31]

    VizWiz grand challenge: Answering visual questions from blind people

    Danna Gurari, Qing Li, Abigale J Stangl, Anhong Guo, Chi Lin, Kristen Grauman, Jiebo Luo, and Jeffrey P Bigham. VizWiz grand challenge: Answering visual questions from blind people. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.\ 3608--3617, 2018

  26. [32]

    Captioning images taken by people who are blind

    Danna Gurari, Yinan Zhao, Meng Zhang, and Nilavra Bhattacharya. Captioning images taken by people who are blind. In European Conference on Computer Vision, pp.\ 417--434. Springer, 2020

  27. [33]

    F lax: A neural network library and ecosystem for JAX , 2020

    Jonathan Heek, Anselm Levskaya, Avital Oliver, Marvin Ritter, Bertrand Rondepierre, Andreas Steiner, and Marc van Z ee. F lax: A neural network library and ecosystem for JAX , 2020. URL http://github.com/google/flax

  28. [34]

    The many faces of robustness: A critical analysis of out-of-distribution generalization

    Dan Hendrycks, Steven Basart, Norman Mu, Saurav Kadavath, Frank Wang, Evan Dorundo, Rahul Desai, Tyler Zhu, Samyak Parajuli, Mike Guo, et al. The many faces of robustness: A critical analysis of out-of-distribution generalization. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.\ 8340--8349, 2021 a

  29. [35]

    Natural adversarial examples

    Dan Hendrycks, Kevin Zhao, Steven Basart, Jacob Steinhardt, and Dawn Song. Natural adversarial examples. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 15262--15271, 2021 b

  30. [37]

    XTREME : A massively multilingual multi-task benchmark for evaluating cross-lingual generalisation

    Junjie Hu, Sebastian Ruder, Aditya Siddhant, Graham Neubig, Orhan Firat, and Melvin Johnson. XTREME : A massively multilingual multi-task benchmark for evaluating cross-lingual generalisation. In International Conference on Machine Learning, pp.\ 4411--4421, 2020

  31. [38]

    Scaling up vision-language pre-training for image captioning

    Xiaowei Hu, Zhe Gan, Jianfeng Wang, Zhengyuan Yang, Zicheng Liu, Yumao Lu, and Lijuan Wang. Scaling up vision-language pre-training for image captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 17980--17989, 2022

  32. [39]

    GPipe : efficient training of giant neural networks using pipeline parallelism

    Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan Firat, Mia Xu Chen, Dehao Chen, HyoukJoong Lee, Jiquan Ngiam, Quoc V Le, Yonghui Wu, et al. GPipe : efficient training of giant neural networks using pipeline parallelism. In Proceedings of the 33rd International Conference on Neural Information Processing Systems, pp.\ 103--112, 2019

  33. [40]

    Scaling up visual and vision-language representation learning with noisy text supervision

    Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representation learning with noisy text supervision. In International Conference on Machine Learning, pp.\ 4904--4916, 2021

  34. [41]

    A domain-specific supercomputer for training deep neural networks

    Norman P Jouppi, Doe Hyun Yoon, George Kurian, Sheng Li, Nishant Patil, James Laudon, Cliff Young, and David Patterson. A domain-specific supercomputer for training deep neural networks. Communications of the ACM, 63 0 (7): 0 67--78, 2020

  35. [44]

    Deep visual-semantic alignments for generating image descriptions

    Andrej Karpathy and Li Fei-Fei. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.\ 3128--3137, 2015

  36. [45]

    PreSTU : Pre-training for scene-text understanding

    Jihyung Kil, Soravit Changpinyo, Xi Chen, Hexiang Hu, Sebastian Goodman, Wei-Lun Chao, and Radu Soricut. PreSTU : Pre-training for scene-text understanding. arXiv preprint arXiv:2209.05534, 2022

  37. [46]

    Big Transfer (BiT) : General visual representation learning

    Alexander Kolesnikov, Lucas Beyer, Xiaohua Zhai, Joan Puigcerver, Jessica Yung, Sylvain Gelly, and Neil Houlsby. Big Transfer (BiT) : General visual representation learning. Lecture Notes in Computer Science, pp.\ 491--507, 2020. ISSN 1611-3349

  38. [47]

    Visual Genome : Connecting language and vision using crowdsourced dense image annotations

    Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A Shamma, et al. Visual Genome : Connecting language and vision using crowdsourced dense image annotations. International journal of computer vision, 123 0 (1): 0 32--73, 2017

  39. [48]

    The Open Images dataset v4

    Alina Kuznetsova, Hassan Rom, Neil Alldrin, Jasper Uijlings, Ivan Krasin, Jordi Pont-Tuset, Shahab Kamali, Stefan Popov, Matteo Malloci, Alexander Kolesnikov, et al. The Open Images dataset v4. International Journal of Computer Vision, 128 0 (7): 0 1956--1981, 2020

  40. [50]

    Exploring the limits of weakly supervised pretraining

    Dhruv Mahajan, Ross Girshick, Vignesh Ramanathan, Kaiming He, Manohar Paluri, Yixuan Li, Ashwin Bharambe, and Laurens Van Der Maaten. Exploring the limits of weakly supervised pretraining. In Proceedings of the European conference on computer vision (ECCV), pp.\ 181--196, 2018

  41. [51]

    Model cards for model reporting

    Margaret Mitchell, Simone Wu, Andrew Zaldivar, Parker Barnes, Lucy Vasserman, Ben Hutchinson, Elena Spitzer, Inioluwa Deborah Raji, and Timnit Gebru. Model cards for model reporting. In Proceedings of the conference on fairness, accountability, and transparency, pp.\ 220--229, 2019

  42. [52]

    x GQA : Cross-lingual visual question answering

    Jonas Pfeiffer, Gregor Geigle, Aishwarya Kamath, Jan-Martin Steitz, Stefan Roth, Ivan Vuli \'c , and Iryna Gurevych. x GQA : Cross-lingual visual question answering. In Findings of the Association for Computational Linguistics: ACL 2022, pp.\ 2497--2511, 2022

  43. [54]

    Pre-training image-language transformers for open-vocabulary tasks

    AJ Piergiovanni, Weicheng Kuo, and Anelia Angelova. Pre-training image-language transformers for open-vocabulary tasks. In T4V: Transformers for Vision Workshop, Conference on Computer Vision and Pattern Recognition, 2022 a

  44. [55]

    Answer-Me : Multi-task learning for generalization to many question-answering tasks

    AJ Piergiovanni, Wei Li, Weicheng Kuo, Mohammad Saffar, Fred Bertsch, and Anelia Angelova. Answer-Me : Multi-task learning for generalization to many question-answering tasks. arXiv preprint arXiv:2205.00949, 2022 b

  45. [56]

    Data Cards : Purposeful and transparent dataset documentation for responsible AI

    Mahima Pushkarna, Andrew Zaldivar, and Oddur Kjartansson. Data Cards : Purposeful and transparent dataset documentation for responsible AI . In FAccT '22: 2022 ACM Conference on Fairness, Accountability, and Transparency , pp.\ 1776--1826, 2022

  46. [57]

    Winner team Mia at TextVQA challenge 2021: Vision-and-language representation learning with pre-trained sequence-to-sequence model

    Yixuan Qiao, Hao Chen, Jun Wang, Yihao Chen, Xianbin Ye, Ziliang Li, Xianbiao Qi, Peng Gao, and Guotong Xie. Winner team Mia at TextVQA challenge 2021: Vision-and-language representation learning with pre-trained sequence-to-sequence model. arXiv preprint arXiv:2106.15332, 2021

  47. [58]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, pp.\ 8748--8763, 2021

  48. [59]

    Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21 0 (140): 0 1--67, 2020

  49. [60]

    Do ImageNet classifiers generalize to ImageNet ? In International Conference on Machine Learning, pp.\ 5389--5400, 2019

    Benjamin Recht, Rebecca Roelofs, Ludwig Schmidt, and Vaishaal Shankar. Do ImageNet classifiers generalize to ImageNet ? In International Conference on Machine Learning, pp.\ 5389--5400, 2019

  50. [61]

    Faster R-CNN : Towards real-time object detection with region proposal networks

    Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster R-CNN : Towards real-time object detection with region proposal networks. Advances in neural information processing systems, 28, 2015

  51. [62]

    Scaling vision with sparse mixture of experts

    Carlos Riquelme, Joan Puigcerver, Basil Mustafa, Maxim Neumann, Rodolphe Jenatton, Andr \'e Susano Pinto, Daniel Keysers, and Neil Houlsby. Scaling vision with sparse mixture of experts. Advances in Neural Information Processing Systems, 34: 0 8583--8595, 2021

  52. [63]

    Scaling up models and data with t5x and seqio

    Adam Roberts, Hyung Won Chung, Anselm Levskaya, Gaurav Mishra, James Bradbury, Daniel Andor, Sharan Narang, Brian Lester, Colin Gaffney, Afroz Mohiuddin, et al. Scaling up models and data with t5x and seqio . arXiv preprint arXiv:2203.17189, 2022

  53. [64]

    A step toward more inclusive people annotations for fairness

    Candice Schumann, Susanna Ricco, Utsav Prabhu, Vittorio Ferrari, and Caroline Pantofaru. A step toward more inclusive people annotations for fairness. In Proceedings of the 2021 AAAI/ACM Conference on AI, Ethics, and Society, AIES '21, pp.\ 916--925, 2021. ISBN 9781450384735

  54. [65]

    Objects365: A large-scale, high-quality dataset for object detection

    Shuai Shao, Zeming Li, Tianyuan Zhang, Chao Peng, Gang Yu, Xiangyu Zhang, Jing Li, and Jian Sun. Objects365: A large-scale, high-quality dataset for object detection. In Proceedings of the IEEE/CVF international conference on computer vision, pp.\ 8430--8439, 2019

  55. [66]

    Conceptual Captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning

    Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu Soricut. Conceptual Captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning . In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, pp.\ 2556--2565, 2018

  56. [67]

    Adafactor: Adaptive learning rates with sublinear memory cost

    Noam Shazeer and Mitchell Stern. Adafactor: Adaptive learning rates with sublinear memory cost. In International Conference on Machine Learning, pp.\ 4596--4604, 2018

  57. [68]

    Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

    Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. Megatron- LM : Training multi-billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053, 2019

  58. [69]

    Text C aps: a dataset for image captioning with reading comprehension

    Oleksii Sidorov, Ronghang Hu, Marcus Rohrbach, and Amanpreet Singh. Text C aps: a dataset for image captioning with reading comprehension. In European conference on computer vision, pp.\ 742--758, 2020

  59. [70]

    Towards VQA models that can read

    Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. Towards VQA models that can read. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.\ 8317--8326, 2019

  60. [71]

    LXMERT : Learning cross-modality encoder representations from transformers

    Hao Tan and Mohit Bansal. LXMERT : Learning cross-modality encoder representations from transformers. arXiv preprint arXiv:1908.07490, 2019

  61. [73]

    MLP-Mixer : An all- MLP architecture for vision

    Ilya O Tolstikhin, Neil Houlsby, Alexander Kolesnikov, Lucas Beyer, Xiaohua Zhai, Thomas Unterthiner, Jessica Yung, Andreas Steiner, Daniel Keysers, Jakob Uszkoreit, et al. MLP-Mixer : An all- MLP architecture for vision. Advances in Neural Information Processing Systems, 34: 0 24261--24272, 2021

  62. [74]

    Fixing the train-test resolution discrepancy

    Hugo Touvron, Andrea Vedaldi, Matthijs Douze, and Herve Jegou. Fixing the train-test resolution discrepancy. In Proceedings of the 33rd International Conference on Neural Information Processing Systems, pp.\ 8252--8262, 2019

  63. [75]

    Attention is all you need

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems, pp.\ 6000--6010, 2017

  64. [76]

    CIDEr : Consensus-based image description evaluation

    Ramakrishna Vedantam, C Lawrence Zitnick, and Devi Parikh. CIDEr : Consensus-based image description evaluation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.\ 4566--4575, 2015

  65. [77]

    Show and tell: A neural image caption generator

    Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. Show and tell: A neural image caption generator. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.\ 3156--3164, 2015

  66. [78]

    SuperGLUE : a stickier benchmark for general-purpose language understanding systems

    Alex Wang, Yada Pruksachatkun, Nikita Nangia, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R Bowman. SuperGLUE : a stickier benchmark for general-purpose language understanding systems. In Proceedings of the 33rd International Conference on Neural Information Processing Systems, pp.\ 3266--3280, 2019 a

  67. [79]

    Learning robust global representations by penalizing local predictive power

    Haohan Wang, Songwei Ge, Zachary Lipton, and Eric P Xing. Learning robust global representations by penalizing local predictive power. In Advances in Neural Information Processing Systems, pp.\ 10506--10518, 2019 b

  68. [80]

    GIT: A generative image-to-text transformer for vision and language

    Jianfeng Wang, Zhengyuan Yang, Xiaowei Hu, Linjie Li, Kevin Lin, Zhe Gan, Zicheng Liu, Ce Liu, and Lijuan Wang. GIT : A generative image-to-text transformer for vision and language. arXiv preprint arXiv:2205.14100, 2022 a

  69. [82]

    Image as a foreign language: BEiT pretraining for all vision and vision-language tasks

    Wenhui Wang, Hangbo Bao, Li Dong, Johan Bjorck, Zhiliang Peng, Qiang Liu, Kriti Aggarwal, Owais Khan Mohammed, Saksham Singhal, Subhojit Som, et al. Image as a foreign language: BEiT pretraining for all vision and vision-language tasks. arXiv preprint arXiv:2208.10442, 2022 c

  70. [83]

    W., Dai, Z., Tsvetkov, Y ., and Cao, Y

    Zirui Wang, Jiahui Yu, Adams Wei Yu, Zihang Dai, Yulia Tsvetkov, and Yuan Cao. SimVLM : Simple visual language model pretraining with weak supervision. arXiv preprint arXiv:2108.10904, 2021

  71. [84]

    Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time

    Mitchell Wortsman, Gabriel Ilharco, Samir Ya Gadre, Rebecca Roelofs, Raphael Gontijo-Lopes, Ari S Morcos, Hongseok Namkoong, Ali Farhadi, Yair Carmon, Simon Kornblith, et al. Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time. In International Conference on Machine Learning, pp.\ 23965--23998, 2022

  72. [85]

    m T 5: A massively multilingual pre-trained text-to-text transformer

    Linting Xue, Noah Constant, Adam Roberts, Mihir Kale, Rami Al-Rfou, Aditya Siddhant, Aditya Barua, and Colin Raffel. m T 5: A massively multilingual pre-trained text-to-text transformer. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp.\ 483--498, Jun 2021

  73. [86]

    ByT5 : Towards a token-free future with pre-trained byte-to-byte models

    Linting Xue, Aditya Barua, Noah Constant, Rami Al-Rfou, Sharan Narang, Mihir Kale, Adam Roberts, and Colin Raffel. ByT5 : Towards a token-free future with pre-trained byte-to-byte models. Transactions of the Association for Computational Linguistics, 10: 0 291--306, 2022

  74. [87]

    TAP : Text-aware pre-training for Text-VQA and Text-Caption

    Zhengyuan Yang, Yijuan Lu, Jianfeng Wang, Xi Yin, Dinei Florencio, Lijuan Wang, Cha Zhang, Lei Zhang, and Jiebo Luo. TAP : Text-aware pre-training for Text-VQA and Text-Caption . In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.\ 8751--8761, 2021

  75. [88]

    CoCa: Contrastive Captioners are Image-Text Foundation Models

    Jiahui Yu, Zirui Wang, Vijay Vasudevan, Legg Yeung, Mojtaba Seyedhosseini, and Yonghui Wu. CoCa : Contrastive captioners are image-text foundation models. arXiv preprint arXiv:2205.01917, 2022

  76. [90]

    Merlot: Multimodal neural script knowledge models

    Rowan Zellers, Ximing Lu, Jack Hessel, Youngjae Yu, Jae Sung Park, Jize Cao, Ali Farhadi, and Yejin Choi. Merlot: Multimodal neural script knowledge models. Advances in Neural Information Processing Systems, 34: 0 23634--23651, 2021

  77. [92]

    Scaling vision transformers

    Xiaohua Zhai, Alexander Kolesnikov, Neil Houlsby, and Lucas Beyer. Scaling vision transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 12104--12113, 2022 a

  78. [93]

    LiT : Zero-shot transfer with locked-image text tuning

    Xiaohua Zhai, Xiao Wang, Basil Mustafa, Andreas Steiner, Daniel Keysers, Alexander Kolesnikov, and Lucas Beyer. LiT : Zero-shot transfer with locked-image text tuning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 18123--18133, 2022 b

  79. [94]

    VinVL : Revisiting visual representations in vision-language models

    Pengchuan Zhang, Xiujun Li, Xiaowei Hu, Jianwei Yang, Lei Zhang, Lijuan Wang, Yejin Choi, and Jianfeng Gao. VinVL : Revisiting visual representations in vision-language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 5579--5588, 2021

  80. [95]

    Deep Learning , author=

Showing first 80 references.