arxiv: 2209.06794 · v4 · submitted 2022-09-14 · 💻 cs.CV · cs.CL

Recognition: 1 theorem link

· Lean Theorem

PaLI: A Jointly-Scaled Multilingual Language-Image Model

Xi Chen , Xiao Wang , Soravit Changpinyo , AJ Piergiovanni , Piotr Padlewski , Daniel Salz , Sebastian Goodman , Adam Grycner

show 21 more authors

Basil Mustafa Lucas Beyer Alexander Kolesnikov Joan Puigcerver Nan Ding Keran Rong Hassan Akbari Gaurav Mishra Linting Xue Ashish Thapliyal James Bradbury Weicheng Kuo Mojtaba Seyedhosseini Chao Jia Burcu Karagol Ayan Carlos Riquelme Andreas Steiner Anelia Angelova Xiaohua Zhai Neil Houlsby Radu Soricut

Authors on Pith no claims yet

Pith reviewed 2026-05-16 09:25 UTC · model grok-4.3

classification 💻 cs.CV cs.CL

keywords PaLIvision-language modelmultimodal scalingmultilingual image-text dataimage captioningvisual question answeringscene text understanding

0 comments

The pith

PaLI jointly scales a 4-billion-parameter vision transformer with a language model on a 10B multilingual image-text set to reach state-of-the-art on captioning, VQA and scene-text tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

PaLI extends large language models to joint vision-language modeling by taking image and text inputs and generating text outputs for many tasks. The approach pairs existing pre-trained language models with a new 4-billion-parameter Vision Transformer and trains them together rather than scaling language alone. A fresh dataset supplies 10 billion image-text pairs spanning more than 100 languages to support this joint scaling. The resulting model delivers leading scores on image captioning, visual question answering, and scene-text understanding while using a simple modular design.

Core claim

PaLI is formed by pairing a 4-billion-parameter ViT vision encoder with a large pre-trained language encoder-decoder, then training both jointly on a 10-billion-example multilingual image-text corpus; this produces state-of-the-art results on captioning, visual question answering, and scene-text understanding through a unified text-generation interface.

What carries the argument

Joint scaling of the vision encoder (ViT-e, 4B parameters) and language model on the new 10B multilingual image-text dataset, using a flexible text-generation interface.

If this is right

Joint scaling of vision and language yields gains larger than scaling the language component by itself.
The same modular text-generation interface supports many different vision-language tasks without task-specific heads.
Training on text in over 100 languages extends strong performance across multilingual benchmarks.
Leveraging pre-trained language and vision components lowers the total cost of building large multimodal systems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Further enlargement of the vision encoder may continue to lift scores if language capacity is increased in step.
The joint-scaling pattern could transfer to other paired data such as video and text.
Dataset composition and language balance likely matter as much as raw model size for the observed results.

Load-bearing premise

The claimed performance gains arise mainly from jointly increasing the capacity of the vision and language components rather than from the particular quality or balance of the 10 billion training pairs.

What would settle it

Train an otherwise identical model that keeps the language component fixed but uses a much smaller vision encoder, then compare its scores on the same captioning, VQA and scene-text benchmarks.

read the original abstract

Effective scaling and a flexible task interface enable large language models to excel at many tasks. We present PaLI (Pathways Language and Image model), a model that extends this approach to the joint modeling of language and vision. PaLI generates text based on visual and textual inputs, and with this interface performs many vision, language, and multimodal tasks, in many languages. To train PaLI, we make use of large pre-trained encoder-decoder language models and Vision Transformers (ViTs). This allows us to capitalize on their existing capabilities and leverage the substantial cost of training them. We find that joint scaling of the vision and language components is important. Since existing Transformers for language are much larger than their vision counterparts, we train a large, 4-billion parameter ViT (ViT-e) to quantify the benefits from even larger-capacity vision models. To train PaLI, we create a large multilingual mix of pretraining tasks, based on a new image-text training set containing 10B images and texts in over 100 languages. PaLI achieves state-of-the-art in multiple vision and language tasks (such as captioning, visual question-answering, scene-text understanding), while retaining a simple, modular, and scalable design.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PaLI shows that jointly scaling a 4B ViT with language models on a new 10B multilingual dataset produces solid SOTA gains on captioning, VQA, and scene-text tasks.

read the letter

The main takeaway is that joint scaling works here. They built a 4-billion-parameter ViT-e to match the size of the language side, trained everything on a fresh 10B image-text mix spanning over 100 languages, and got state-of-the-art numbers on several multimodal tasks while keeping the interface as simple text generation. That combination of scale, multilingual coverage, and modularity is the actual advance over earlier separate scaling efforts on vision or language alone. They reuse pre-trained encoders and decoders to control cost, which is a practical move, and the experiments back the claim that both components need to grow together rather than one staying small. The results on captioning, visual question answering, and scene-text understanding look competitive, and the design stays clean enough that it could be extended without major rewrites. The citation pattern is straightforward and builds on established ViT and encoder-decoder work without circular claims. The central argument holds up on the reported evidence. The main soft spots are the usual ones for this style of paper: the abstract is light on exact baselines and ablations, so the full text needs checking for how much the new ViT and dataset actually move the needle versus the base models. Data balance across languages is plausible but could use more error analysis on lower-resource ones. Nothing load-bearing looks broken. This is for people working on scalable multimodal systems or multilingual vision-language models. Anyone tracking large-scale empirical results will get value from the dataset details and the scaling observations. It deserves a serious referee because the scale is ambitious, the setup is reproducible in principle, and the joint-scaling point is worth verifying in detail.

Referee Report

1 major / 1 minor

Summary. The paper introduces PaLI, a multimodal model that extends large language models to joint language-vision tasks by combining pre-trained encoder-decoder LMs with Vision Transformers (including a new 4B-parameter ViT-e) and training on a new 10B-image multilingual image-text dataset spanning over 100 languages. It claims that joint scaling of vision and language components yields state-of-the-art results on captioning, visual question answering, scene-text understanding, and related tasks while preserving a simple, modular, and scalable architecture.

Significance. If the empirical results hold, the work provides concrete evidence that joint scaling of vision and language components on large multilingual data produces measurable gains over separately scaled models, while leveraging existing pre-trained components to control training cost. The modular design and explicit quantification of benefits from a 4B vision encoder are strengths that could influence subsequent multimodal scaling efforts.

major comments (1)

Abstract: the claim of state-of-the-art performance across multiple tasks is presented without reference to specific baselines, exact metric values, or ablations that isolate the contribution of joint scaling versus the new 10B dataset or the ViT-e size; this makes the central attribution of gains to joint scaling difficult to verify from the summary alone.

minor comments (1)

The abstract mentions 'over 100 languages' but does not specify the language distribution or any balancing strategy; a brief note on this in the dataset section would improve reproducibility.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the positive evaluation and recommendation for minor revision. The feedback on the abstract is helpful, and we address it directly below.

read point-by-point responses

Referee: Abstract: the claim of state-of-the-art performance across multiple tasks is presented without reference to specific baselines, exact metric values, or ablations that isolate the contribution of joint scaling versus the new 10B dataset or the ViT-e size; this makes the central attribution of gains to joint scaling difficult to verify from the summary alone.

Authors: We agree that the abstract is high-level by design and does not include the requested quantitative details. The full manuscript addresses these points in detail: Section 4 and Tables 1–5 report exact metric values (e.g., CIDEr on COCO, VQA accuracy on VQAv2), comparisons against specific baselines (SimVLM, GIT, mT5-based models), and ablations that isolate joint scaling, the 10B multilingual dataset, and the 4B ViT-e contribution. To make the central claims easier to verify from the abstract alone, we will revise it to incorporate a small number of key metric values and baseline references while preserving conciseness. This revision will be included in the next version. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper is an empirical scaling study that trains PaLI by combining existing pre-trained ViT and language-model components with a new 10B multilingual image-text dataset. All performance claims (SOTA on captioning, VQA, scene-text tasks) are presented as measured outcomes on held-out benchmarks rather than as outputs of any closed-form derivation or self-referential equation. No equations appear that define a quantity in terms of itself, no fitted parameters are relabeled as predictions, and no load-bearing uniqueness theorem or ansatz is imported via self-citation. The derivation chain is therefore the standard training-and-evaluation pipeline; results remain externally falsifiable and do not reduce to the inputs by construction.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the effectiveness of joint scaling and the new dataset; no new entities are postulated, but the approach assumes pre-trained components transfer well when scaled together.

free parameters (1)

ViT-e parameter count = 4B
4 billion parameters chosen to balance with language model scale

axioms (1)

domain assumption Pre-trained language models and ViTs transfer effectively when jointly scaled on large multilingual image-text data
Invoked to justify capitalizing on existing models and the new training mix

pith-pipeline@v0.9.0 · 5646 in / 1335 out tokens · 42961 ms · 2026-05-16T09:25:21.335193+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We find that joint scaling of the vision and language components is important... PaLI achieves state-of-the-art in multiple vision and language tasks... while retaining a simple, modular, and scalable design.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 19 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

LoopVLA: Learning Sufficiency in Recurrent Refinement for Vision-Language-Action Models
cs.AI 2026-05 unverdicted novelty 7.0

LoopVLA adds recurrent refinement and learned sufficiency estimation to VLA models, cutting parameters 45% and raising throughput 1.7x while matching baseline task success on LIBERO and VLA-Arena.
WikiCLIP: An Efficient Contrastive Baseline for Open-domain Visual Entity Recognition
cs.CV 2026-03 unverdicted novelty 7.0

WikiCLIP delivers an efficient contrastive baseline for open-domain visual entity recognition that improves accuracy by 16% on OVEN unseen entities and runs nearly 100 times faster than leading generative models.
Mema: Memory-Augmented Adapter for Enhanced Vision-Language Understanding
cs.CV 2026-02 unverdicted novelty 7.0

Mema adds a stateful memory module to vision encoders that accumulates hierarchical visual features across layers and selectively injects portions back via feedback to preserve fine-grained cues, yielding consistent g...
Learning Interactive Real-World Simulators
cs.AI 2023-10 conditional novelty 7.0

UniSim learns a universal real-world simulator from orchestrated diverse datasets, enabling zero-shot deployment of policies trained purely in simulation.
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
cs.CV 2023-01 unverdicted novelty 7.0

BLIP-2 bootstraps vision-language pre-training from frozen image encoders and LLMs via a lightweight two-stage Querying Transformer, delivering SOTA results with 54x fewer trainable parameters than Flamingo80B on zero...
Probing CLIP's Comprehension of 360-Degree Textual and Visual Semantics
cs.CV 2026-04 conditional novelty 6.0

CLIP models understand 360-degree textual semantics via explicit identifiers but show limited comprehension of visual semantics under horizontal circular shifts, which a LoRA fine-tuning approach improves with a noted...
UHR-BAT: Budget-Aware Token Compression Vision-Language model for Ultra-High-Resolution Remote Sensing
cs.CV 2026-04 unverdicted novelty 6.0

UHR-BAT is a budget-aware framework that uses text-guided multi-scale importance estimation plus region-wise preserve and merge strategies to compress visual tokens in ultra-high-resolution remote sensing vision-langu...
Simplicity Prevails: The Emergence of Generalizable AIGI Detection in Visual Foundation Models
cs.CV 2026-02 conditional novelty 6.0

Frozen features from vision foundation models enable a linear probe to outperform specialized AIGI detectors by over 30% on in-the-wild data due to emergent forgery knowledge from pre-training.
SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics
cs.LG 2025-06 unverdicted novelty 6.0

SmolVLA is a small efficient VLA model that achieves performance comparable to 10x larger models while training on one GPU and deploying on consumer hardware via community data and chunked asynchronous action prediction.
OpenVLA: An Open-Source Vision-Language-Action Model
cs.RO 2024-06 unverdicted novelty 6.0

OpenVLA achieves 16.5% higher task success than the 55B RT-2-X model across 29 tasks with 7x fewer parameters while enabling effective fine-tuning and quantization without performance loss.
Capabilities of Gemini Models in Medicine
cs.AI 2024-04 unverdicted novelty 6.0

Med-Gemini sets new records on 10 of 14 medical benchmarks including 91.1% on MedQA-USMLE, beats GPT-4V by 44.5% on multimodal tasks, and surpasses humans on medical text summarization.
Sigmoid Loss for Language Image Pre-Training
cs.CV 2023-03 conditional novelty 6.0

SigLIP replaces softmax-based contrastive loss with a simple pairwise sigmoid loss for vision-language pre-training, decoupling batch size from normalization and reaching strong zero-shot performance with limited compute.
PaLM-E: An Embodied Multimodal Language Model
cs.LG 2023-03 conditional novelty 6.0

PaLM-E is a single 562B-parameter multimodal model that performs embodied reasoning tasks like robotic manipulation planning and visual question answering by interleaving vision, state, and text inputs with positive t...
BiCLIP: Domain Canonicalization via Structured Geometric Transformation
cs.CV 2026-03 unverdicted novelty 5.0

BiCLIP recovers a structured geometric transformation from few-shot anchors to canonicalize domain features in VLMs and reports state-of-the-art results on 11 benchmarks.
On The Application of Linear Attention in Multimodal Transformers
cs.CV 2026-04 unverdicted novelty 4.0

Linear attention delivers significant computational savings in multimodal transformers and follows the same scaling laws as softmax attention on ViT models trained on LAION-400M with ImageNet-21K zero-shot validation.
SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features
cs.CV 2025-02 unverdicted novelty 4.0

SigLIP 2 models trained with a unified recipe of captioning, self-supervised losses, and curated diverse data outperform prior SigLIP versions on classification, retrieval, localization, dense prediction, and multilin...
PaliGemma 2: A Family of Versatile VLMs for Transfer
cs.CV 2024-12 unverdicted novelty 4.0

PaliGemma 2 is a family of vision-language models that achieves state-of-the-art results on transfer tasks like table structure recognition and radiography report generation by combining SigLIP with Gemma 2 models at ...
PaliGemma: A versatile 3B VLM for transfer
cs.CV 2024-07 unverdicted novelty 4.0

PaliGemma is an open 3B VLM based on SigLIP and Gemma that achieves strong performance on nearly 40 diverse open-world tasks including benchmarks, remote-sensing, and segmentation.
OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models
cs.CV 2023-08 unverdicted novelty 4.0

OpenFlamingo provides open-source autoregressive vision-language models that achieve 80-89% of Flamingo performance on seven vision-language datasets.

Reference graph

Works this paper leans on

185 extracted references · 185 canonical work pages · cited by 19 Pith papers · 10 internal anchors

[1]

Tallyqa: Answering complex counting questions

Manoj Acharya, Kushal Kafle, and Christopher Kanan. Tallyqa: Answering complex counting questions. In Proceedings of the AAAI conference on artificial intelligence, volume 33, pp.\ 8076--8084, 2019

work page 2019
[2]

nocaps : Novel object captioning at scale

Harsh Agrawal, Karan Desai, Yufei Wang, Xinlei Chen, Rishabh Jain, Mark Johnson, Dhruv Batra, Devi Parikh, Stefan Lee, and Peter Anderson. nocaps : Novel object captioning at scale. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.\ 8948--8957, 2019

work page 2019
[3]

Crossvqa: Scalably generating benchmarks for systematically testing vqa generalization

Arjun Akula, Soravit Changpinyo, Boqing Gong, Piyush Sharma, Song-Chun Zhu, and Radu Soricut. Crossvqa: Scalably generating benchmarks for systematically testing vqa generalization. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp.\ 2148--2166, 2021

work page 2021
[5]

On the cross-lingual transferability of monolingual representations

Mikel Artetxe, Sebastian Ruder, and Dani Yogatama. On the cross-lingual transferability of monolingual representations. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp.\ 4623--4637, 2020

work page 2020
[6]

ObjectNet : a large-scale bias-controlled dataset for pushing the limits of object recognition models

Andrei Barbu, David Mayo, Julian Alverio, William Luo, Christopher Wang, Dan Gutfreund, Joshua Tenenbaum, and Boris Katz. ObjectNet : a large-scale bias-controlled dataset for pushing the limits of object recognition models. In Proceedings of the 33rd International Conference on Neural Information Processing Systems, pp.\ 9453--9463, 2019

work page 2019
[8]

Big Vision

Lucas Beyer, Xiaohua Zhai, and Alexander Kolesnikov. Big Vision . https://github.com/google-research/big_vision, 2022

work page 2022
[9]

Jawahar, Ernest Valveny, and Dimosthenis Karatzas

Ali Furkan Biten, Rubèn Tito, Andrés Mafla, Lluis Gomez, Marçal Rusiñol, C.V. Jawahar, Ernest Valveny, and Dimosthenis Karatzas. Scene text visual question answering. In 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pp.\ 4290--4300, 2019. doi:10.1109/ICCV.2019.00439

work page doi:10.1109/iccv.2019.00439 2019
[10]

JAX : composable transformations of P ython+ N um P y programs, 2018

James Bradbury, Roy Frostig, Peter Hawkins, Matthew James Johnson, Chris Leary, Dougal Maclaurin, George Necula, Adam Paszke, Jake Vander P las, Skye Wanderman- M ilne, and Qiao Zhang. JAX : composable transformations of P ython+ N um P y programs, 2018. URL http://github.com/google/jax

work page 2018
[11]

Language models are few-shot learners

Tom B Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. In Proceedings of the 34th International Conference on Neural Information Processing Systems, pp.\ 1877--1901, 2020

work page 1901
[12]

Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts

Soravit Changpinyo, Piyush Sharma, Nan Ding, and Radu Soricut. Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 3558--3568, 2021

work page 2021
[13]

All you may need for VQA are image captions

Soravit Changpinyo, Doron Kukliansky, Idan Szpektor, Xi Chen, Nan Ding, and Radu Soricut. All you may need for VQA are image captions. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp.\ 1947--1963, Jul 2022 a

work page 2022
[14]

Thapliyal, Julien Amelot, Michal Yarom, Xi Chen, and Radu Soricut

Soravit Changpinyo, Linting Xue, Idan Szpektor, Ashish V. Thapliyal, Julien Amelot, Michal Yarom, Xi Chen, and Radu Soricut. MaXM : Towards multilingual visual question answering. arXiv preprint arXiv:2209.05401, 2022 b

work page arXiv 2022
[18]

Uniter: Universal image-text representation learning

Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy, Faisal Ahmed, Zhe Gan, Yu Cheng, and Jingjing Liu. Uniter: Universal image-text representation learning. In European conference on computer vision, pp.\ 104--120, 2020

work page 2020
[19]

Unifying vision-and-language tasks via text generation

Jaemin Cho, Jie Lei, Hao Tan, and Mohit Bansal. Unifying vision-and-language tasks via text generation. In International Conference on Machine Learning, pp.\ 1931--1942, 2021

work page 1931
[21]

Ty D i QA : A benchmark for information-seeking question answering in typologically diverse languages

Jonathan H Clark, Eunsol Choi, Michael Collins, Dan Garrette, Tom Kwiatkowski, Vitaly Nikolaev, and Jennimaria Palomaki. Ty D i QA : A benchmark for information-seeking question answering in typologically diverse languages. Transactions of the Association for Computational Linguistics, 8: 0 454--470, 2020

work page 2020
[22]

XNLI : Evaluating cross-lingual sentence representations

Alexis Conneau, Ruty Rinott, Guillaume Lample, Adina Williams, Samuel Bowman, Holger Schwenk, and Veselin Stoyanov. XNLI : Evaluating cross-lingual sentence representations. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp.\ 2475--2485, 2018

work page 2018
[23]

ImageNet : A large-scale hierarchical image database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. ImageNet : A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp.\ 248--255, 2009

work page 2009
[24]

Long-term recurrent convolutional networks for visual recognition and description

Jeffrey Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko, and Trevor Darrell. Long-term recurrent convolutional networks for visual recognition and description. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.\ 2625--2634, 2015

work page 2015
[25]

An image is worth 16x16 words: Transformers for image recognition at scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. In 9th International Conference on Learning Representations, ICLR 2021, , 2021

work page 2021
[26]

GLaM : Efficient scaling of language models with mixture-of-experts

Nan Du, Yanping Huang, Andrew M Dai, Simon Tong, Dmitry Lepikhin, Yuanzhong Xu, Maxim Krikun, Yanqi Zhou, Adams Wei Yu, Orhan Firat, et al. GLaM : Efficient scaling of language models with mixture-of-experts. In International Conference on Machine Learning, pp.\ 5547--5569, 2022

work page 2022
[27]

Datasheets for datasets

Timnit Gebru, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman Vaughan, Hanna Wallach, Hal Daum \'e Iii, and Kate Crawford. Datasheets for datasets. Communications of the ACM, 64 0 (12): 0 86--92, 2021

work page 2021
[28]

Deep Learning

Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. MIT Press, 2016. http://www.deeplearningbook.org

work page 2016
[29]

Making the V in VQA matter: Elevating the role of image understanding in visual question answering

Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the V in VQA matter: Elevating the role of image understanding in visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.\ 6904--6913, 2017

work page 2017
[30]

KAT : A knowledge augmented transformer for vision-and-language

Liangke Gui, Borui Wang, Qiuyuan Huang, Alex Hauptmann, Yonatan Bisk, and Jianfeng Gao. KAT : A knowledge augmented transformer for vision-and-language. arXiv preprint arXiv:2112.08614, 2021

work page arXiv 2021
[31]

VizWiz grand challenge: Answering visual questions from blind people

Danna Gurari, Qing Li, Abigale J Stangl, Anhong Guo, Chi Lin, Kristen Grauman, Jiebo Luo, and Jeffrey P Bigham. VizWiz grand challenge: Answering visual questions from blind people. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.\ 3608--3617, 2018

work page 2018
[32]

Captioning images taken by people who are blind

Danna Gurari, Yinan Zhao, Meng Zhang, and Nilavra Bhattacharya. Captioning images taken by people who are blind. In European Conference on Computer Vision, pp.\ 417--434. Springer, 2020

work page 2020
[33]

F lax: A neural network library and ecosystem for JAX , 2020

Jonathan Heek, Anselm Levskaya, Avital Oliver, Marvin Ritter, Bertrand Rondepierre, Andreas Steiner, and Marc van Z ee. F lax: A neural network library and ecosystem for JAX , 2020. URL http://github.com/google/flax

work page 2020
[34]

The many faces of robustness: A critical analysis of out-of-distribution generalization

Dan Hendrycks, Steven Basart, Norman Mu, Saurav Kadavath, Frank Wang, Evan Dorundo, Rahul Desai, Tyler Zhu, Samyak Parajuli, Mike Guo, et al. The many faces of robustness: A critical analysis of out-of-distribution generalization. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.\ 8340--8349, 2021 a

work page 2021
[35]

Natural adversarial examples

Dan Hendrycks, Kevin Zhao, Steven Basart, Jacob Steinhardt, and Dawn Song. Natural adversarial examples. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 15262--15271, 2021 b

work page 2021
[37]

XTREME : A massively multilingual multi-task benchmark for evaluating cross-lingual generalisation

Junjie Hu, Sebastian Ruder, Aditya Siddhant, Graham Neubig, Orhan Firat, and Melvin Johnson. XTREME : A massively multilingual multi-task benchmark for evaluating cross-lingual generalisation. In International Conference on Machine Learning, pp.\ 4411--4421, 2020

work page 2020
[38]

Scaling up vision-language pre-training for image captioning

Xiaowei Hu, Zhe Gan, Jianfeng Wang, Zhengyuan Yang, Zicheng Liu, Yumao Lu, and Lijuan Wang. Scaling up vision-language pre-training for image captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 17980--17989, 2022

work page 2022
[39]

GPipe : efficient training of giant neural networks using pipeline parallelism

Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan Firat, Mia Xu Chen, Dehao Chen, HyoukJoong Lee, Jiquan Ngiam, Quoc V Le, Yonghui Wu, et al. GPipe : efficient training of giant neural networks using pipeline parallelism. In Proceedings of the 33rd International Conference on Neural Information Processing Systems, pp.\ 103--112, 2019

work page 2019
[40]

Scaling up visual and vision-language representation learning with noisy text supervision

Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representation learning with noisy text supervision. In International Conference on Machine Learning, pp.\ 4904--4916, 2021

work page 2021
[41]

A domain-specific supercomputer for training deep neural networks

Norman P Jouppi, Doe Hyun Yoon, George Kurian, Sheng Li, Nishant Patil, James Laudon, Cliff Young, and David Patterson. A domain-specific supercomputer for training deep neural networks. Communications of the ACM, 63 0 (7): 0 67--78, 2020

work page 2020
[44]

Deep visual-semantic alignments for generating image descriptions

Andrej Karpathy and Li Fei-Fei. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.\ 3128--3137, 2015

work page 2015
[45]

PreSTU : Pre-training for scene-text understanding

Jihyung Kil, Soravit Changpinyo, Xi Chen, Hexiang Hu, Sebastian Goodman, Wei-Lun Chao, and Radu Soricut. PreSTU : Pre-training for scene-text understanding. arXiv preprint arXiv:2209.05534, 2022

work page arXiv 2022
[46]

Big Transfer (BiT) : General visual representation learning

Alexander Kolesnikov, Lucas Beyer, Xiaohua Zhai, Joan Puigcerver, Jessica Yung, Sylvain Gelly, and Neil Houlsby. Big Transfer (BiT) : General visual representation learning. Lecture Notes in Computer Science, pp.\ 491--507, 2020. ISSN 1611-3349

work page 2020
[47]

Visual Genome : Connecting language and vision using crowdsourced dense image annotations

Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A Shamma, et al. Visual Genome : Connecting language and vision using crowdsourced dense image annotations. International journal of computer vision, 123 0 (1): 0 32--73, 2017

work page 2017
[48]

The Open Images dataset v4

Alina Kuznetsova, Hassan Rom, Neil Alldrin, Jasper Uijlings, Ivan Krasin, Jordi Pont-Tuset, Shahab Kamali, Stefan Popov, Matteo Malloci, Alexander Kolesnikov, et al. The Open Images dataset v4. International Journal of Computer Vision, 128 0 (7): 0 1956--1981, 2020

work page 1956
[50]

Exploring the limits of weakly supervised pretraining

Dhruv Mahajan, Ross Girshick, Vignesh Ramanathan, Kaiming He, Manohar Paluri, Yixuan Li, Ashwin Bharambe, and Laurens Van Der Maaten. Exploring the limits of weakly supervised pretraining. In Proceedings of the European conference on computer vision (ECCV), pp.\ 181--196, 2018

work page 2018
[51]

Model cards for model reporting

Margaret Mitchell, Simone Wu, Andrew Zaldivar, Parker Barnes, Lucy Vasserman, Ben Hutchinson, Elena Spitzer, Inioluwa Deborah Raji, and Timnit Gebru. Model cards for model reporting. In Proceedings of the conference on fairness, accountability, and transparency, pp.\ 220--229, 2019

work page 2019
[52]

x GQA : Cross-lingual visual question answering

Jonas Pfeiffer, Gregor Geigle, Aishwarya Kamath, Jan-Martin Steitz, Stefan Roth, Ivan Vuli \'c , and Iryna Gurevych. x GQA : Cross-lingual visual question answering. In Findings of the Association for Computational Linguistics: ACL 2022, pp.\ 2497--2511, 2022

work page 2022
[54]

Pre-training image-language transformers for open-vocabulary tasks

AJ Piergiovanni, Weicheng Kuo, and Anelia Angelova. Pre-training image-language transformers for open-vocabulary tasks. In T4V: Transformers for Vision Workshop, Conference on Computer Vision and Pattern Recognition, 2022 a

work page 2022
[55]

Answer-Me : Multi-task learning for generalization to many question-answering tasks

AJ Piergiovanni, Wei Li, Weicheng Kuo, Mohammad Saffar, Fred Bertsch, and Anelia Angelova. Answer-Me : Multi-task learning for generalization to many question-answering tasks. arXiv preprint arXiv:2205.00949, 2022 b

work page arXiv 2022
[56]

Data Cards : Purposeful and transparent dataset documentation for responsible AI

Mahima Pushkarna, Andrew Zaldivar, and Oddur Kjartansson. Data Cards : Purposeful and transparent dataset documentation for responsible AI . In FAccT '22: 2022 ACM Conference on Fairness, Accountability, and Transparency , pp.\ 1776--1826, 2022

work page 2022
[57]

Winner team Mia at TextVQA challenge 2021: Vision-and-language representation learning with pre-trained sequence-to-sequence model

Yixuan Qiao, Hao Chen, Jun Wang, Yihao Chen, Xianbin Ye, Ziliang Li, Xianbiao Qi, Peng Gao, and Guotong Xie. Winner team Mia at TextVQA challenge 2021: Vision-and-language representation learning with pre-trained sequence-to-sequence model. arXiv preprint arXiv:2106.15332, 2021

work page arXiv 2021
[58]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, pp.\ 8748--8763, 2021

work page 2021
[59]

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21 0 (140): 0 1--67, 2020

work page 2020
[60]

Do ImageNet classifiers generalize to ImageNet ? In International Conference on Machine Learning, pp.\ 5389--5400, 2019

Benjamin Recht, Rebecca Roelofs, Ludwig Schmidt, and Vaishaal Shankar. Do ImageNet classifiers generalize to ImageNet ? In International Conference on Machine Learning, pp.\ 5389--5400, 2019

work page 2019
[61]

Faster R-CNN : Towards real-time object detection with region proposal networks

Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster R-CNN : Towards real-time object detection with region proposal networks. Advances in neural information processing systems, 28, 2015

work page 2015
[62]

Scaling vision with sparse mixture of experts

Carlos Riquelme, Joan Puigcerver, Basil Mustafa, Maxim Neumann, Rodolphe Jenatton, Andr \'e Susano Pinto, Daniel Keysers, and Neil Houlsby. Scaling vision with sparse mixture of experts. Advances in Neural Information Processing Systems, 34: 0 8583--8595, 2021

work page 2021
[63]

Scaling up models and data with t5x and seqio

Adam Roberts, Hyung Won Chung, Anselm Levskaya, Gaurav Mishra, James Bradbury, Daniel Andor, Sharan Narang, Brian Lester, Colin Gaffney, Afroz Mohiuddin, et al. Scaling up models and data with t5x and seqio . arXiv preprint arXiv:2203.17189, 2022

work page arXiv 2022
[64]

A step toward more inclusive people annotations for fairness

Candice Schumann, Susanna Ricco, Utsav Prabhu, Vittorio Ferrari, and Caroline Pantofaru. A step toward more inclusive people annotations for fairness. In Proceedings of the 2021 AAAI/ACM Conference on AI, Ethics, and Society, AIES '21, pp.\ 916--925, 2021. ISBN 9781450384735

work page 2021
[65]

Objects365: A large-scale, high-quality dataset for object detection

Shuai Shao, Zeming Li, Tianyuan Zhang, Chao Peng, Gang Yu, Xiangyu Zhang, Jing Li, and Jian Sun. Objects365: A large-scale, high-quality dataset for object detection. In Proceedings of the IEEE/CVF international conference on computer vision, pp.\ 8430--8439, 2019

work page 2019
[66]

Conceptual Captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning

Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu Soricut. Conceptual Captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning . In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, pp.\ 2556--2565, 2018

work page 2018
[67]

Adafactor: Adaptive learning rates with sublinear memory cost

Noam Shazeer and Mitchell Stern. Adafactor: Adaptive learning rates with sublinear memory cost. In International Conference on Machine Learning, pp.\ 4596--4604, 2018

work page 2018
[68]

Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. Megatron- LM : Training multi-billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1909
[69]

Text C aps: a dataset for image captioning with reading comprehension

Oleksii Sidorov, Ronghang Hu, Marcus Rohrbach, and Amanpreet Singh. Text C aps: a dataset for image captioning with reading comprehension. In European conference on computer vision, pp.\ 742--758, 2020

work page 2020
[70]

Towards VQA models that can read

Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. Towards VQA models that can read. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.\ 8317--8326, 2019

work page 2019
[71]

LXMERT : Learning cross-modality encoder representations from transformers

Hao Tan and Mohit Bansal. LXMERT : Learning cross-modality encoder representations from transformers. arXiv preprint arXiv:1908.07490, 2019

work page arXiv 1908
[73]

MLP-Mixer : An all- MLP architecture for vision

Ilya O Tolstikhin, Neil Houlsby, Alexander Kolesnikov, Lucas Beyer, Xiaohua Zhai, Thomas Unterthiner, Jessica Yung, Andreas Steiner, Daniel Keysers, Jakob Uszkoreit, et al. MLP-Mixer : An all- MLP architecture for vision. Advances in Neural Information Processing Systems, 34: 0 24261--24272, 2021

work page 2021
[74]

Fixing the train-test resolution discrepancy

Hugo Touvron, Andrea Vedaldi, Matthijs Douze, and Herve Jegou. Fixing the train-test resolution discrepancy. In Proceedings of the 33rd International Conference on Neural Information Processing Systems, pp.\ 8252--8262, 2019

work page 2019
[75]

Attention is all you need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems, pp.\ 6000--6010, 2017

work page 2017
[76]

CIDEr : Consensus-based image description evaluation

Ramakrishna Vedantam, C Lawrence Zitnick, and Devi Parikh. CIDEr : Consensus-based image description evaluation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.\ 4566--4575, 2015

work page 2015
[77]

Show and tell: A neural image caption generator

Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. Show and tell: A neural image caption generator. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.\ 3156--3164, 2015

work page 2015
[78]

SuperGLUE : a stickier benchmark for general-purpose language understanding systems

Alex Wang, Yada Pruksachatkun, Nikita Nangia, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R Bowman. SuperGLUE : a stickier benchmark for general-purpose language understanding systems. In Proceedings of the 33rd International Conference on Neural Information Processing Systems, pp.\ 3266--3280, 2019 a

work page 2019
[79]

Learning robust global representations by penalizing local predictive power

Haohan Wang, Songwei Ge, Zachary Lipton, and Eric P Xing. Learning robust global representations by penalizing local predictive power. In Advances in Neural Information Processing Systems, pp.\ 10506--10518, 2019 b

work page 2019
[80]

GIT: A generative image-to-text transformer for vision and language

Jianfeng Wang, Zhengyuan Yang, Xiaowei Hu, Linjie Li, Kevin Lin, Zhe Gan, Zicheng Liu, Ce Liu, and Lijuan Wang. GIT : A generative image-to-text transformer for vision and language. arXiv preprint arXiv:2205.14100, 2022 a

work page arXiv 2022
[82]

Image as a foreign language: BEiT pretraining for all vision and vision-language tasks

Wenhui Wang, Hangbo Bao, Li Dong, Johan Bjorck, Zhiliang Peng, Qiang Liu, Kriti Aggarwal, Owais Khan Mohammed, Saksham Singhal, Subhojit Som, et al. Image as a foreign language: BEiT pretraining for all vision and vision-language tasks. arXiv preprint arXiv:2208.10442, 2022 c

work page arXiv 2022
[83]

W., Dai, Z., Tsvetkov, Y ., and Cao, Y

Zirui Wang, Jiahui Yu, Adams Wei Yu, Zihang Dai, Yulia Tsvetkov, and Yuan Cao. SimVLM : Simple visual language model pretraining with weak supervision. arXiv preprint arXiv:2108.10904, 2021

work page arXiv 2021
[84]

Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time

Mitchell Wortsman, Gabriel Ilharco, Samir Ya Gadre, Rebecca Roelofs, Raphael Gontijo-Lopes, Ari S Morcos, Hongseok Namkoong, Ali Farhadi, Yair Carmon, Simon Kornblith, et al. Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time. In International Conference on Machine Learning, pp.\ 23965--23998, 2022

work page 2022
[85]

m T 5: A massively multilingual pre-trained text-to-text transformer

Linting Xue, Noah Constant, Adam Roberts, Mihir Kale, Rami Al-Rfou, Aditya Siddhant, Aditya Barua, and Colin Raffel. m T 5: A massively multilingual pre-trained text-to-text transformer. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp.\ 483--498, Jun 2021

work page 2021
[86]

ByT5 : Towards a token-free future with pre-trained byte-to-byte models

Linting Xue, Aditya Barua, Noah Constant, Rami Al-Rfou, Sharan Narang, Mihir Kale, Adam Roberts, and Colin Raffel. ByT5 : Towards a token-free future with pre-trained byte-to-byte models. Transactions of the Association for Computational Linguistics, 10: 0 291--306, 2022

work page 2022
[87]

TAP : Text-aware pre-training for Text-VQA and Text-Caption

Zhengyuan Yang, Yijuan Lu, Jianfeng Wang, Xi Yin, Dinei Florencio, Lijuan Wang, Cha Zhang, Lei Zhang, and Jiebo Luo. TAP : Text-aware pre-training for Text-VQA and Text-Caption . In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.\ 8751--8761, 2021

work page 2021
[88]

CoCa: Contrastive Captioners are Image-Text Foundation Models

Jiahui Yu, Zirui Wang, Vijay Vasudevan, Legg Yeung, Mojtaba Seyedhosseini, and Yonghui Wu. CoCa : Contrastive captioners are image-text foundation models. arXiv preprint arXiv:2205.01917, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[90]

Merlot: Multimodal neural script knowledge models

Rowan Zellers, Ximing Lu, Jack Hessel, Youngjae Yu, Jae Sung Park, Jize Cao, Ali Farhadi, and Yejin Choi. Merlot: Multimodal neural script knowledge models. Advances in Neural Information Processing Systems, 34: 0 23634--23651, 2021

work page 2021
[92]

Scaling vision transformers

Xiaohua Zhai, Alexander Kolesnikov, Neil Houlsby, and Lucas Beyer. Scaling vision transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 12104--12113, 2022 a

work page 2022
[93]

LiT : Zero-shot transfer with locked-image text tuning

Xiaohua Zhai, Xiao Wang, Basil Mustafa, Andreas Steiner, Daniel Keysers, Alexander Kolesnikov, and Lucas Beyer. LiT : Zero-shot transfer with locked-image text tuning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 18123--18133, 2022 b

work page 2022
[94]

VinVL : Revisiting visual representations in vision-language models

Pengchuan Zhang, Xiujun Li, Xiaowei Hu, Jianwei Yang, Lei Zhang, Lijuan Wang, Yejin Choi, and Jianfeng Gao. VinVL : Revisiting visual representations in vision-language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 5579--5588, 2021

work page 2021
[95]

Deep Learning , author=

work page

Showing first 80 references.