{"paper":{"title":"PaLI: A Jointly-Scaled Multilingual Language-Image Model","license":"http://creativecommons.org/licenses/by/4.0/","headline":"PaLI jointly scales a 4-billion-parameter vision transformer with a language model on a 10B multilingual image-text set to reach state-of-the-art on captioning, VQA and scene-text tasks.","cross_cats":["cs.CL"],"primary_cat":"cs.CV","authors_text":"Adam Grycner, AJ Piergiovanni, Alexander Kolesnikov, Andreas Steiner, Anelia Angelova, Ashish Thapliyal, Basil Mustafa, Burcu Karagol Ayan, Carlos Riquelme, Chao Jia, Daniel Salz, Gaurav Mishra, Hassan Akbari, James Bradbury, Joan Puigcerver, Keran Rong, Linting Xue, Lucas Beyer, Mojtaba Seyedhosseini, Nan Ding, Neil Houlsby, Piotr Padlewski, Radu Soricut, Sebastian Goodman, Soravit Changpinyo, Weicheng Kuo, Xiaohua Zhai, Xiao Wang, Xi Chen","submitted_at":"2022-09-14T17:24:07Z","abstract_excerpt":"Effective scaling and a flexible task interface enable large language models to excel at many tasks. We present PaLI (Pathways Language and Image model), a model that extends this approach to the joint modeling of language and vision. PaLI generates text based on visual and textual inputs, and with this interface performs many vision, language, and multimodal tasks, in many languages. To train PaLI, we make use of large pre-trained encoder-decoder language models and Vision Transformers (ViTs). This allows us to capitalize on their existing capabilities and leverage the substantial cost of tra"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"PaLI achieves state-of-the-art in multiple vision and language tasks (such as captioning, visual question-answering, scene-text understanding), while retaining a simple, modular, and scalable design.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"That joint scaling of the vision and language components on the new 10B multilingual dataset will produce the claimed performance gains without major issues from data quality, language imbalance, or overfitting.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"PaLI jointly scales a 4B-parameter vision transformer with language models on a new 10B multilingual image-text dataset to reach state-of-the-art results on vision-language tasks while keeping a simple modular design.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"PaLI jointly scales a 4-billion-parameter vision transformer with a language model on a 10B multilingual image-text set to reach state-of-the-art on captioning, VQA and scene-text tasks.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"403798ecb94f7cc80e113e1a6d067177d84ed9dceb0ac53306ffc5d55b618eae"},"source":{"id":"2209.06794","kind":"arxiv","version":4},"verdict":{"id":"4e01e5e3-5592-48cb-ad4d-82aa25360fc7","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-16T09:25:21.335193Z","strongest_claim":"PaLI achieves state-of-the-art in multiple vision and language tasks (such as captioning, visual question-answering, scene-text understanding), while retaining a simple, modular, and scalable design.","one_line_summary":"PaLI jointly scales a 4B-parameter vision transformer with language models on a new 10B multilingual image-text dataset to reach state-of-the-art results on vision-language tasks while keeping a simple modular design.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"That joint scaling of the vision and language components on the new 10B multilingual dataset will produce the claimed performance gains without major issues from data quality, language imbalance, or overfitting.","pith_extraction_headline":"PaLI jointly scales a 4-billion-parameter vision transformer with a language model on a 10B multilingual image-text set to reach state-of-the-art on captioning, VQA and scene-text tasks."},"references":{"count":185,"sample":[{"doi":"","year":2019,"title":"Tallyqa: Answering complex counting questions","work_id":"70f8998a-d9b0-40d5-b598-a9c726ba4c8e","ref_index":1,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2019,"title":"nocaps : Novel object captioning at scale","work_id":"ec4e26ac-4d58-40fc-ac24-6dfaecc71822","ref_index":2,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2021,"title":"Crossvqa: Scalably generating benchmarks for systematically testing vqa generalization","work_id":"11d18ed5-36cb-4baa-9f46-6fa140666ecd","ref_index":3,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2020,"title":"On the cross-lingual transferability of monolingual representations","work_id":"0b1b73b4-170f-4bcb-83fd-61f3cab7db23","ref_index":5,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2019,"title":"ObjectNet : a large-scale bias-controlled dataset for pushing the limits of object recognition models","work_id":"a9a36037-ee68-4859-ba58-22f9e38dd293","ref_index":6,"cited_arxiv_id":"","is_internal_anchor":false}],"resolved_work":185,"snapshot_sha256":"2e7f278ed328831f0c03dfa4cc8f32ab6bcf58a31ec615c3c377584f629855eb","internal_anchors":12},"formal_canon":{"evidence_count":1,"snapshot_sha256":"5473604ba031f2fd17202c1d2e5c4f9638d8cfacf0abbb3df412dc5f84ef3b64"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"}