{"record_type":"pith_number_record","schema_url":"https://pith.science/schemas/pith-number/v1.json","pith_number":"pith:2023:W4TPLI53LVIHYZ6UCVBRL3443S","short_pith_number":"pith:W4TPLI53","schema_version":"1.0","canonical_sha256":"b726f5a3bb5d507c67d4154315ef9cdc823a705c1f6de9de4a167c79eed8008a","source":{"kind":"arxiv","id":"2305.18565","version":1},"attestation_state":"computed","paper":{"title":"PaLI-X: On Scaling up a Multilingual Vision and Language Model","license":"http://creativecommons.org/licenses/by/4.0/","headline":"Scaling up PaLI-X sets new state-of-the-art on most vision and language benchmarks and shows emergent capabilities.","cross_cats":["cs.CL","cs.LG"],"primary_cat":"cs.CV","authors_text":"AJ Piergiovanni, Alexander Kolesnikov, Andreas Peter Steiner, Anelia Angelova, Anurag Arnab, Arsha Nagrani, Austin Waters, Basil Mustafa, Bo Pang, Carlos Riquelme Ruiz, Ceslee Montgomery, Daniel Keysers, Daniel Salz, Filip Pavetic, Gang Li, Hexiang Hu, Ibrahim Alabdulmohsin, Jialin Wu, Josip Djolonga, Julien Amelot, Kenton Lee, Keran Rong, Lucas Beyer, Mandar Joshi, Mario Lucic, Marvin Ritter, Matthias Minderer, Michael Tschannen, Mojtaba Seyedhosseini, Mostafa Dehghani, Neil Houlsby, Paulina Pietrzyk, Piotr Padlewski, Radu Soricut, Sebastian Goodman, Siamak Shakeri, Soravit Changpinyo, Xiaohua Zhai, Xiao Wang, Xi Chen, Yang Li, Yi Tay, Yuanzhong Xu","submitted_at":"2023-05-29T18:58:38Z","abstract_excerpt":"We present the training recipe and results of scaling up PaLI-X, a multilingual vision and language model, both in terms of size of the components and the breadth of its training task mixture. Our model achieves new levels of performance on a wide-range of varied and complex tasks, including multiple image-based captioning and question-answering tasks, image-based document understanding and few-shot (in-context) learning, as well as object detection, video question answering, and video captioning. PaLI-X advances the state-of-the-art on most vision-and-language benchmarks considered (25+ of th"},"verification_status":{"content_addressed":true,"pith_receipt":true,"author_attested":false,"weak_author_claims":0,"strong_author_claims":0,"externally_anchored":false,"storage_verified":false,"citation_signatures":0,"replication_records":0,"graph_snapshot":true,"references_resolved":true,"formal_links_present":true},"canonical_record":{"source":{"id":"2305.18565","kind":"arxiv","version":1},"metadata":{"license":"http://creativecommons.org/licenses/by/4.0/","primary_cat":"cs.CV","submitted_at":"2023-05-29T18:58:38Z","cross_cats_sorted":["cs.CL","cs.LG"],"title_canon_sha256":"3c7150397808e13ff72bab5cc440587de27d0854e43b1fc23a199ea026595a24","abstract_canon_sha256":"94e18e05584083e5b1cbbe254aba4a9c4cbb4c6b8dc63d01ec220ef24117dc7a"},"schema_version":"1.0"},"receipt":{"kind":"pith_receipt","key_id":"pith-v1-2026-05","algorithm":"ed25519","signed_at":"2026-05-17T23:38:13.815083Z","signature_b64":"yHvzZQO8iLx5fyEV/8vyH13T4gwrKkV2Mm9wudT5kVc/svW27RbF5XJ+OfPzxUkeYle0vawi65nhE+UWYmQ2Bw==","signed_message":"canonical_sha256_bytes","builder_version":"pith-number-builder-2026-05-17-v1","receipt_version":"0.3","canonical_sha256":"b726f5a3bb5d507c67d4154315ef9cdc823a705c1f6de9de4a167c79eed8008a","last_reissued_at":"2026-05-17T23:38:13.814348Z","signature_status":"signed_v1","first_computed_at":"2026-05-17T23:38:13.814348Z","public_key_fingerprint":"8d4b5ee74e4693bcd1df2446408b0d54"},"graph_snapshot":{"paper":{"title":"PaLI-X: On Scaling up a Multilingual Vision and Language Model","license":"http://creativecommons.org/licenses/by/4.0/","headline":"Scaling up PaLI-X sets new state-of-the-art on most vision and language benchmarks and shows emergent capabilities.","cross_cats":["cs.CL","cs.LG"],"primary_cat":"cs.CV","authors_text":"AJ Piergiovanni, Alexander Kolesnikov, Andreas Peter Steiner, Anelia Angelova, Anurag Arnab, Arsha Nagrani, Austin Waters, Basil Mustafa, Bo Pang, Carlos Riquelme Ruiz, Ceslee Montgomery, Daniel Keysers, Daniel Salz, Filip Pavetic, Gang Li, Hexiang Hu, Ibrahim Alabdulmohsin, Jialin Wu, Josip Djolonga, Julien Amelot, Kenton Lee, Keran Rong, Lucas Beyer, Mandar Joshi, Mario Lucic, Marvin Ritter, Matthias Minderer, Michael Tschannen, Mojtaba Seyedhosseini, Mostafa Dehghani, Neil Houlsby, Paulina Pietrzyk, Piotr Padlewski, Radu Soricut, Sebastian Goodman, Siamak Shakeri, Soravit Changpinyo, Xiaohua Zhai, Xiao Wang, Xi Chen, Yang Li, Yi Tay, Yuanzhong Xu","submitted_at":"2023-05-29T18:58:38Z","abstract_excerpt":"We present the training recipe and results of scaling up PaLI-X, a multilingual vision and language model, both in terms of size of the components and the breadth of its training task mixture. Our model achieves new levels of performance on a wide-range of varied and complex tasks, including multiple image-based captioning and question-answering tasks, image-based document understanding and few-shot (in-context) learning, as well as object detection, video question answering, and video captioning. PaLI-X advances the state-of-the-art on most vision-and-language benchmarks considered (25+ of th"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"PaLI-X advances the state-of-the-art on most vision-and-language benchmarks considered (25+ of them) and exhibits emerging capabilities such as complex counting and multilingual object detection.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"That increasing model size and broadening the training task mixture will reliably produce both higher benchmark scores and the observed emergent behaviors without requiring task-specific fine-tuning or additional architectural changes.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"Scaling a multilingual vision-language model in size and training breadth yields new state-of-the-art results on over 25 benchmarks plus emerging abilities in counting and multilingual detection.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"Scaling up PaLI-X sets new state-of-the-art on most vision and language benchmarks and shows emergent capabilities.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"9ad1cfef05ed1adf2e3fdf89e586509cc9603f986e096d420ec74015657cd5e6"},"source":{"id":"2305.18565","kind":"arxiv","version":1},"verdict":{"id":"264574c8-6565-47c7-8d01-927db57ff00a","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-17T14:29:04.937460Z","strongest_claim":"PaLI-X advances the state-of-the-art on most vision-and-language benchmarks considered (25+ of them) and exhibits emerging capabilities such as complex counting and multilingual object detection.","one_line_summary":"Scaling a multilingual vision-language model in size and training breadth yields new state-of-the-art results on over 25 benchmarks plus emerging abilities in counting and multilingual detection.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"That increasing model size and broadening the training task mixture will reliably produce both higher benchmark scores and the observed emergent behaviors without requiring task-specific fine-tuning or additional architectural changes.","pith_extraction_headline":"Scaling up PaLI-X sets new state-of-the-art on most vision and language benchmarks and shows emergent capabilities."},"references":{"count":99,"sample":[{"doi":"","year":2022,"title":"PaLM: Scaling Language Modeling with Pathways","work_id":"a94f3ef7-2c49-4445-93fe-6ec16aafd966","ref_index":1,"cited_arxiv_id":"2204.02311","is_internal_anchor":true},{"doi":"","year":2020,"title":"Tom B. Brown, Benjamin Mann, Nick Ryder, Jared Kaplan Melanie Subbiah, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-V oss, Gretche","work_id":"d5c09f2f-dcd2-437f-b8b9-cff70228c5b0","ref_index":2,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2022,"title":"GLaM: Efficient scaling of language models with mixture-of-experts","work_id":"cc30ef45-7346-488a-9685-f77b6fa80ccc","ref_index":3,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2023,"title":"Dai, Orhan Firat, Melvin Johnson, Dmitry Lepikhin, Alexandre Passos, Siamak Shakeri, Emanuel Taropa, Paige Bailey, Zhifeng Chen, Eric Chu, Jonathan H","work_id":"2afad9fb-23ec-490f-bd19-fc208a680557","ref_index":4,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2023,"title":"PaLI: A jointly-scaled multilingual language-image model","work_id":"8a12e27d-4f19-46bb-babd-b05a4c9e0fc8","ref_index":5,"cited_arxiv_id":"","is_internal_anchor":false}],"resolved_work":99,"snapshot_sha256":"7f3c3e3bca5182bf876fd56ada3cb7974d5f89fa5a79eed1dfd9b8021c625fb3","internal_anchors":4},"formal_canon":{"evidence_count":2,"snapshot_sha256":"9114177088637e7dacb004206e60967cf021ef82e55e271b66073aa58b9c86c5"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"},"aliases":[{"alias_kind":"arxiv","alias_value":"2305.18565","created_at":"2026-05-17T23:38:13.814469+00:00"},{"alias_kind":"arxiv_version","alias_value":"2305.18565v1","created_at":"2026-05-17T23:38:13.814469+00:00"},{"alias_kind":"doi","alias_value":"10.48550/arxiv.2305.18565","created_at":"2026-05-17T23:38:13.814469+00:00"},{"alias_kind":"pith_short_12","alias_value":"W4TPLI53LVIH","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_16","alias_value":"W4TPLI53LVIHYZ6U","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_8","alias_value":"W4TPLI53","created_at":"2026-05-18T12:33:37.589309+00:00"}],"events":[],"event_summary":{},"paper_claims":[],"inbound_citations":{"count":23,"internal_anchor_count":23,"sample":[{"citing_arxiv_id":"2312.11805","citing_title":"Gemini: A Family of Highly Capable Multimodal Models","ref_index":12,"is_internal_anchor":true},{"citing_arxiv_id":"2406.05410","citing_title":"ChatSR: Multimodal Large Language Models for Scientific Formula Discovery","ref_index":4,"is_internal_anchor":true},{"citing_arxiv_id":"2605.21919","citing_title":"SDGBiasBench: Benchmarking and Mitigating Vision--Language Models' Biases in Sustainable Development Goals","ref_index":11,"is_internal_anchor":true},{"citing_arxiv_id":"2605.16409","citing_title":"Multilingual OCR-Aware Fine-Tuning and Prompt-Guided Chain-of-Thought Reasoning for Multimodal Large Language Models","ref_index":6,"is_internal_anchor":true},{"citing_arxiv_id":"2509.22123","citing_title":"Multilingual Vision-Language Models, A Survey","ref_index":25,"is_internal_anchor":true},{"citing_arxiv_id":"2505.03233","citing_title":"GraspVLA: a Grasping Foundation Model Pre-trained on Billion-scale Synthetic Action Data","ref_index":24,"is_internal_anchor":true},{"citing_arxiv_id":"2507.01925","citing_title":"A Survey on Vision-Language-Action Models: An Action Tokenization Perspective","ref_index":270,"is_internal_anchor":true},{"citing_arxiv_id":"2507.15493","citing_title":"GR-3 Technical Report","ref_index":14,"is_internal_anchor":true},{"citing_arxiv_id":"2403.01823","citing_title":"RT-H: Action Hierarchies Using Language","ref_index":54,"is_internal_anchor":true},{"citing_arxiv_id":"2311.01378","citing_title":"Vision-Language Foundation Models as Effective Robot Imitators","ref_index":7,"is_internal_anchor":true},{"citing_arxiv_id":"2403.09611","citing_title":"MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training","ref_index":17,"is_internal_anchor":true},{"citing_arxiv_id":"2310.06114","citing_title":"Learning Interactive Real-World Simulators","ref_index":244,"is_internal_anchor":true},{"citing_arxiv_id":"2309.14525","citing_title":"Aligning Large Multimodal Models with Factually Augmented RLHF","ref_index":7,"is_internal_anchor":true},{"citing_arxiv_id":"2410.17247","citing_title":"PyramidDrop: Accelerating Your Large Vision-Language Models via Pyramid Visual Redundancy Reduction","ref_index":10,"is_internal_anchor":true},{"citing_arxiv_id":"2311.16502","citing_title":"MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI","ref_index":9,"is_internal_anchor":true},{"citing_arxiv_id":"2305.03726","citing_title":"Otter: A Multi-Modal Model with In-Context Instruction Tuning","ref_index":19,"is_internal_anchor":true},{"citing_arxiv_id":"2312.14238","citing_title":"InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks","ref_index":25,"is_internal_anchor":true},{"citing_arxiv_id":"2310.03744","citing_title":"Improved Baselines with Visual Instruction Tuning","ref_index":11,"is_internal_anchor":true},{"citing_arxiv_id":"2411.19650","citing_title":"CogACT: A Foundational Vision-Language-Action Model for Synergizing Cognition and Action in Robotic Manipulation","ref_index":13,"is_internal_anchor":true},{"citing_arxiv_id":"2604.21326","citing_title":"MiMIC: Mitigating Visual Modality Collapse in Universal Multimodal Retrieval While Avoiding Semantic Misalignment","ref_index":83,"is_internal_anchor":true},{"citing_arxiv_id":"2407.07726","citing_title":"PaliGemma: A versatile 3B VLM for transfer","ref_index":28,"is_internal_anchor":true},{"citing_arxiv_id":"2604.17876","citing_title":"OFlow: Injecting Object-Aware Temporal Flow Matching for Robust Robotic Manipulation","ref_index":14,"is_internal_anchor":true},{"citing_arxiv_id":"2604.19902","citing_title":"MMCORE: MultiModal COnnection with Representation Aligned Latent Embeddings","ref_index":2,"is_internal_anchor":true}]},"formal_canon":{"evidence_count":2,"sample":[],"anchors":[]},"links":{"html":"https://pith.science/pith/W4TPLI53LVIHYZ6UCVBRL3443S","json":"https://pith.science/pith/W4TPLI53LVIHYZ6UCVBRL3443S.json","graph_json":"https://pith.science/api/pith-number/W4TPLI53LVIHYZ6UCVBRL3443S/graph.json","events_json":"https://pith.science/api/pith-number/W4TPLI53LVIHYZ6UCVBRL3443S/events.json","paper":"https://pith.science/paper/W4TPLI53"},"agent_actions":{"view_html":"https://pith.science/pith/W4TPLI53LVIHYZ6UCVBRL3443S","download_json":"https://pith.science/pith/W4TPLI53LVIHYZ6UCVBRL3443S.json","view_paper":"https://pith.science/paper/W4TPLI53","resolve_alias":"https://pith.science/api/pith-number/resolve?arxiv=2305.18565&json=true","fetch_graph":"https://pith.science/api/pith-number/W4TPLI53LVIHYZ6UCVBRL3443S/graph.json","fetch_events":"https://pith.science/api/pith-number/W4TPLI53LVIHYZ6UCVBRL3443S/events.json","actions":{"anchor_timestamp":"https://pith.science/pith/W4TPLI53LVIHYZ6UCVBRL3443S/action/timestamp_anchor","attest_storage":"https://pith.science/pith/W4TPLI53LVIHYZ6UCVBRL3443S/action/storage_attestation","attest_author":"https://pith.science/pith/W4TPLI53LVIHYZ6UCVBRL3443S/action/author_attestation","sign_citation":"https://pith.science/pith/W4TPLI53LVIHYZ6UCVBRL3443S/action/citation_signature","submit_replication":"https://pith.science/pith/W4TPLI53LVIHYZ6UCVBRL3443S/action/replication_record"}},"created_at":"2026-05-17T23:38:13.814469+00:00","updated_at":"2026-05-17T23:38:13.814469+00:00"}