{"record_type":"pith_number_record","schema_url":"https://pith.science/schemas/pith-number/v1.json","pith_number":"pith:2024:DULKBTOBJA4XIQFVFZQSJCUWTM","short_pith_number":"pith:DULKBTOB","schema_version":"1.0","canonical_sha256":"1d16a0cdc148397440b52e61248a969b15fac2d9e1570c0a7906224e956eda27","source":{"kind":"arxiv","id":"2412.03555","version":1},"attestation_state":"computed","paper":{"title":"PaliGemma 2: A Family of Versatile VLMs for Transfer","license":"http://creativecommons.org/licenses/by/4.0/","headline":"PaliGemma 2 pairs Gemma 2 language models with SigLIP encoders and trains them at multiple resolutions to achieve strong transfer on OCR and captioning tasks.","cross_cats":[],"primary_cat":"cs.CV","authors_text":"Alexey Gritsenko, Andreas Steiner, Andr\\'e Susano Pinto, Anthony Sherbondy, Daniel Keysers, Emanuele Bugliarello, Ibrahim Alabdulmohsin, Lucas Beyer, Matthias Minderer, Michael Tschannen, Reeve Ingle, Sahar Kazemzadeh, Shangbang Long, Siyang Qin, Thomas Mesnard, Xiaohua Zhai, Xiao Wang, Yonatan Bitton","submitted_at":"2024-12-04T18:50:42Z","abstract_excerpt":"PaliGemma 2 is an upgrade of the PaliGemma open Vision-Language Model (VLM) based on the Gemma 2 family of language models. We combine the SigLIP-So400m vision encoder that was also used by PaliGemma with the whole range of Gemma 2 models, from the 2B one all the way up to the 27B model. We train these models at three resolutions (224px, 448px, and 896px) in multiple stages to equip them with broad knowledge for transfer via fine-tuning. The resulting family of base models covering different model sizes and resolutions allows us to investigate factors impacting transfer performance (such as le"},"verification_status":{"content_addressed":true,"pith_receipt":true,"author_attested":false,"weak_author_claims":0,"strong_author_claims":0,"externally_anchored":false,"storage_verified":false,"citation_signatures":0,"replication_records":0,"graph_snapshot":true,"references_resolved":true,"formal_links_present":true},"canonical_record":{"source":{"id":"2412.03555","kind":"arxiv","version":1},"metadata":{"license":"http://creativecommons.org/licenses/by/4.0/","primary_cat":"cs.CV","submitted_at":"2024-12-04T18:50:42Z","cross_cats_sorted":[],"title_canon_sha256":"db490a13d82e857cc0961af6be15e96b9c4e49e1742d092153959eaaaf28eacf","abstract_canon_sha256":"501a62ef5aaa69b608bba60f579b7b53b44da05a7c37f19dc796242d5c2856e2"},"schema_version":"1.0"},"receipt":{"kind":"pith_receipt","key_id":"pith-v1-2026-05","algorithm":"ed25519","signed_at":"2026-05-17T23:38:52.926803Z","signature_b64":"77mg6oFjm6mdmV3hLtwRRCm4wMxvMaevSiWtKXBOSOh58nlis6Ra8kOlaWhLtv8/4XK3joK1YrRpgW+q3s+LBg==","signed_message":"canonical_sha256_bytes","builder_version":"pith-number-builder-2026-05-17-v1","receipt_version":"0.3","canonical_sha256":"1d16a0cdc148397440b52e61248a969b15fac2d9e1570c0a7906224e956eda27","last_reissued_at":"2026-05-17T23:38:52.926181Z","signature_status":"signed_v1","first_computed_at":"2026-05-17T23:38:52.926181Z","public_key_fingerprint":"8d4b5ee74e4693bcd1df2446408b0d54"},"graph_snapshot":{"paper":{"title":"PaliGemma 2: A Family of Versatile VLMs for Transfer","license":"http://creativecommons.org/licenses/by/4.0/","headline":"PaliGemma 2 pairs Gemma 2 language models with SigLIP encoders and trains them at multiple resolutions to achieve strong transfer on OCR and captioning tasks.","cross_cats":[],"primary_cat":"cs.CV","authors_text":"Alexey Gritsenko, Andreas Steiner, Andr\\'e Susano Pinto, Anthony Sherbondy, Daniel Keysers, Emanuele Bugliarello, Ibrahim Alabdulmohsin, Lucas Beyer, Matthias Minderer, Michael Tschannen, Reeve Ingle, Sahar Kazemzadeh, Shangbang Long, Siyang Qin, Thomas Mesnard, Xiaohua Zhai, Xiao Wang, Yonatan Bitton","submitted_at":"2024-12-04T18:50:42Z","abstract_excerpt":"PaliGemma 2 is an upgrade of the PaliGemma open Vision-Language Model (VLM) based on the Gemma 2 family of language models. We combine the SigLIP-So400m vision encoder that was also used by PaliGemma with the whole range of Gemma 2 models, from the 2B one all the way up to the 27B model. We train these models at three resolutions (224px, 448px, and 896px) in multiple stages to equip them with broad knowledge for transfer via fine-tuning. The resulting family of base models covering different model sizes and resolutions allows us to investigate factors impacting transfer performance (such as le"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"PaliGemma 2 obtains state-of-the-art results on different OCR-related tasks such as table structure recognition, molecular structure recognition, music score recognition, as well as long fine-grained captioning and radiography report generation.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"That multi-stage training at multiple resolutions equips the models with broad transferable knowledge; the abstract provides no controlled ablations or details on how this is verified versus simpler baselines.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"PaliGemma 2 is a family of vision-language models that achieves state-of-the-art results on transfer tasks like table structure recognition and radiography report generation by combining SigLIP with Gemma 2 models at various sizes and resolutions.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"PaliGemma 2 pairs Gemma 2 language models with SigLIP encoders and trains them at multiple resolutions to achieve strong transfer on OCR and captioning tasks.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"4755c3abfb93bd345343458493b5a814eaee1c0bbc36412813d07e56ec8000bd"},"source":{"id":"2412.03555","kind":"arxiv","version":1},"verdict":{"id":"8151d1fc-65c0-4075-858e-030a31b7c61b","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-15T09:10:03.840284Z","strongest_claim":"PaliGemma 2 obtains state-of-the-art results on different OCR-related tasks such as table structure recognition, molecular structure recognition, music score recognition, as well as long fine-grained captioning and radiography report generation.","one_line_summary":"PaliGemma 2 is a family of vision-language models that achieves state-of-the-art results on transfer tasks like table structure recognition and radiography report generation by combining SigLIP with Gemma 2 models at various sizes and resolutions.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"That multi-stage training at multiple resolutions equips the models with broad transferable knowledge; the abstract provides no controlled ablations or details on how this is verified versus simpler baselines.","pith_extraction_headline":"PaliGemma 2 pairs Gemma 2 language models with SigLIP encoders and trains them at multiple resolutions to achieve strong transfer on OCR and captioning tasks."},"references":{"count":113,"sample":[{"doi":"","year":2019,"title":"M. Acharya, K. Kafle, and C. Kanan. Tal- lyQA: Answering complex counting ques- tions. InAAAI, 2019","work_id":"abd3cdf7-6b70-45ad-af36-e39c3e976935","ref_index":1,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2019,"title":"H. Agrawal, K. Desai, Y. Wang, X. Chen, R. Jain, M. Johnson, D. Batra, D. Parikh, S. Lee, and P. Anderson. NoCaps: Novel object captioning at scale. InICCV, 2019","work_id":"9c92bc55-33f1-436d-be7d-c34462bead7f","ref_index":2,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2023,"title":"I. Alabdulmohsin, X. Zhai, A. Kolesnikov, and L. Beyer. Getting vit in shape: Scaling laws for compute-optimal model design. In NeurIPS, 2023","work_id":"032c297a-aead-419b-a05e-2dc732acd9b1","ref_index":3,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2022,"title":"J.-B. Alayrac, J. Donahue, P. Luc, A. Miech, I. Barr, Y. Hasson, K. Lenc, A. Men- sch, K. Millican, M. Reynolds, R. Ring, E. Rutherford, S. Cabi, T. Han, Z. Gong, S. Samangooei, M. Monteiro, J. Menick","work_id":"dcd6eeb3-2800-427a-a51b-14d521d049e0","ref_index":4,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2023,"title":"Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond","work_id":"cbc2bb21-b6bb-46c0-80bf-107e195ffe10","ref_index":5,"cited_arxiv_id":"2308.12966","is_internal_anchor":true}],"resolved_work":113,"snapshot_sha256":"a57dd2d510ec25ed2c6898eee33ffd97403e1f2d854e763515bd84be3e268509","internal_anchors":13},"formal_canon":{"evidence_count":2,"snapshot_sha256":"74b3134658523acba323f9b21695676acc3743d9fdef6a89637a3ea3354a1607"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"},"aliases":[{"alias_kind":"arxiv","alias_value":"2412.03555","created_at":"2026-05-17T23:38:52.926297+00:00"},{"alias_kind":"arxiv_version","alias_value":"2412.03555v1","created_at":"2026-05-17T23:38:52.926297+00:00"},{"alias_kind":"doi","alias_value":"10.48550/arxiv.2412.03555","created_at":"2026-05-17T23:38:52.926297+00:00"},{"alias_kind":"pith_short_12","alias_value":"DULKBTOBJA4X","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_16","alias_value":"DULKBTOBJA4XIQFV","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_8","alias_value":"DULKBTOB","created_at":"2026-05-18T12:33:37.589309+00:00"}],"events":[],"event_summary":{},"paper_claims":[],"inbound_citations":{"count":26,"internal_anchor_count":26,"sample":[{"citing_arxiv_id":"2505.07813","citing_title":"DexWild: Dexterous Human Interactions for In-the-Wild Robot Policies","ref_index":48,"is_internal_anchor":true},{"citing_arxiv_id":"2604.24681","citing_title":"Learning Human-Intention Priors from Large-Scale Human Demonstrations for Robotic Manipulation","ref_index":50,"is_internal_anchor":true},{"citing_arxiv_id":"2605.22255","citing_title":"Direct content-based retrieval from music scores images","ref_index":48,"is_internal_anchor":true},{"citing_arxiv_id":"2602.08167","citing_title":"Self-Supervised Bootstrapping of Action-Predictive Embodied Reasoning","ref_index":27,"is_internal_anchor":true},{"citing_arxiv_id":"2510.08278","citing_title":"A Multimodal Depth-Aware Method For Embodied Reference Understanding","ref_index":7,"is_internal_anchor":true},{"citing_arxiv_id":"2511.17411","citing_title":"SPEAR-1: Scaling Beyond Robot Demonstrations via 3D Understanding","ref_index":41,"is_internal_anchor":true},{"citing_arxiv_id":"2604.14125","citing_title":"HiVLA: A Visual-Grounded-Centric Hierarchical Embodied Manipulation System","ref_index":35,"is_internal_anchor":true},{"citing_arxiv_id":"2605.09948","citing_title":"LoopVLA: Learning Sufficiency in Recurrent Refinement for Vision-Language-Action Models","ref_index":18,"is_internal_anchor":true},{"citing_arxiv_id":"2501.15830","citing_title":"SpatialVLA: Exploring Spatial Representations for Visual-Language-Action Model","ref_index":62,"is_internal_anchor":true},{"citing_arxiv_id":"2605.10761","citing_title":"RadThinking: A Dataset for Longitudinal Clinical Reasoning in Radiology","ref_index":100,"is_internal_anchor":true},{"citing_arxiv_id":"2604.24681","citing_title":"Learning Human-Intention Priors from Large-Scale Human Demonstrations for Robotic Manipulation","ref_index":50,"is_internal_anchor":true},{"citing_arxiv_id":"2605.01191","citing_title":"Sentinel-VLA: A Metacognitive VLA Model with Active Status Monitoring for Dynamic Reasoning and Error Recovery","ref_index":14,"is_internal_anchor":true},{"citing_arxiv_id":"2605.01911","citing_title":"SurgCheck: Do Vision-Language Models Really Look at Images in Surgical VQA?","ref_index":18,"is_internal_anchor":true},{"citing_arxiv_id":"2605.00078","citing_title":"Being-H0.7: A Latent World-Action Model from Egocentric Videos","ref_index":27,"is_internal_anchor":true},{"citing_arxiv_id":"2604.21061","citing_title":"InVitroVision: a Multi-Modal AI Model for Automated Description of Embryo Development using Natural Language","ref_index":13,"is_internal_anchor":true},{"citing_arxiv_id":"2604.19105","citing_title":"EgoMotion: Hierarchical Reasoning and Diffusion for Egocentric Vision-Language Motion Generation","ref_index":42,"is_internal_anchor":true},{"citing_arxiv_id":"2604.10787","citing_title":"When Meaning Isn't Literal: Exploring Idiomatic Meaning Across Languages and Modalities","ref_index":23,"is_internal_anchor":true},{"citing_arxiv_id":"2604.12966","citing_title":"Boosting Visual Instruction Tuning with Self-Supervised Guidance","ref_index":63,"is_internal_anchor":true},{"citing_arxiv_id":"2604.11490","citing_title":"Anthropogenic Regional Adaptation in Multimodal Vision-Language Model","ref_index":52,"is_internal_anchor":true},{"citing_arxiv_id":"2604.10432","citing_title":"AnySlot: Goal-Conditioned Vision-Language-Action Policies for Zero-Shot Slot-Level Placement","ref_index":32,"is_internal_anchor":true},{"citing_arxiv_id":"2604.09330","citing_title":"VAG: Dual-Stream Video-Action Generation for Embodied Data Synthesis","ref_index":56,"is_internal_anchor":true},{"citing_arxiv_id":"2605.07593","citing_title":"TraceAV-Bench: Benchmarking Multi-Hop Trajectory Reasoning over Long Audio-Visual Videos","ref_index":8,"is_internal_anchor":true},{"citing_arxiv_id":"2502.14786","citing_title":"SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features","ref_index":56,"is_internal_anchor":true},{"citing_arxiv_id":"2604.14125","citing_title":"HiVLA: A Visual-Grounded-Centric Hierarchical Embodied Manipulation System","ref_index":35,"is_internal_anchor":true},{"citing_arxiv_id":"2604.17800","citing_title":"ReFineVLA: Multimodal Reasoning-Aware Generalist Robotic Policies via Teacher-Guided Fine-Tuning","ref_index":32,"is_internal_anchor":true}]},"formal_canon":{"evidence_count":2,"sample":[],"anchors":[]},"links":{"html":"https://pith.science/pith/DULKBTOBJA4XIQFVFZQSJCUWTM","json":"https://pith.science/pith/DULKBTOBJA4XIQFVFZQSJCUWTM.json","graph_json":"https://pith.science/api/pith-number/DULKBTOBJA4XIQFVFZQSJCUWTM/graph.json","events_json":"https://pith.science/api/pith-number/DULKBTOBJA4XIQFVFZQSJCUWTM/events.json","paper":"https://pith.science/paper/DULKBTOB"},"agent_actions":{"view_html":"https://pith.science/pith/DULKBTOBJA4XIQFVFZQSJCUWTM","download_json":"https://pith.science/pith/DULKBTOBJA4XIQFVFZQSJCUWTM.json","view_paper":"https://pith.science/paper/DULKBTOB","resolve_alias":"https://pith.science/api/pith-number/resolve?arxiv=2412.03555&json=true","fetch_graph":"https://pith.science/api/pith-number/DULKBTOBJA4XIQFVFZQSJCUWTM/graph.json","fetch_events":"https://pith.science/api/pith-number/DULKBTOBJA4XIQFVFZQSJCUWTM/events.json","actions":{"anchor_timestamp":"https://pith.science/pith/DULKBTOBJA4XIQFVFZQSJCUWTM/action/timestamp_anchor","attest_storage":"https://pith.science/pith/DULKBTOBJA4XIQFVFZQSJCUWTM/action/storage_attestation","attest_author":"https://pith.science/pith/DULKBTOBJA4XIQFVFZQSJCUWTM/action/author_attestation","sign_citation":"https://pith.science/pith/DULKBTOBJA4XIQFVFZQSJCUWTM/action/citation_signature","submit_replication":"https://pith.science/pith/DULKBTOBJA4XIQFVFZQSJCUWTM/action/replication_record"}},"created_at":"2026-05-17T23:38:52.926297+00:00","updated_at":"2026-05-17T23:38:52.926297+00:00"}