{"record_type":"pith_number_record","schema_url":"https://pith.science/schemas/pith-number/v1.json","pith_number":"pith:2024:CFJHD4PYK63OGQJLKSW2TFT7GC","short_pith_number":"pith:CFJHD4PY","schema_version":"1.0","canonical_sha256":"115271f1f857b6e3412b54ada9967f30b6a8d5075d42d20717a0ad84e9a593f3","source":{"kind":"arxiv","id":"2406.16860","version":2},"attestation_state":"computed","paper":{"title":"Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs","license":"http://creativecommons.org/licenses/by/4.0/","headline":"Cambrian-1 shows vision-centric design across twenty encoders plus new benchmarks produces stronger sensory grounding in multimodal LLMs.","cross_cats":[],"primary_cat":"cs.CV","authors_text":"Adithya Iyer, Ellis Brown, Jihan Yang, Manoj Middepogu, Penghao Wu, Rob Fergus, Sai Charitha Akula, Saining Xie, Sanghyun Woo, Shengbang Tong, Shusheng Yang, Xichen Pan, Yann LeCun, Ziteng Wang","submitted_at":"2024-06-24T17:59:42Z","abstract_excerpt":"We introduce Cambrian-1, a family of multimodal LLMs (MLLMs) designed with a vision-centric approach. While stronger language models can enhance multimodal capabilities, the design choices for vision components are often insufficiently explored and disconnected from visual representation learning research. This gap hinders accurate sensory grounding in real-world scenarios. Our study uses LLMs and visual instruction tuning as an interface to evaluate various visual representations, offering new insights into different models and architectures -- self-supervised, strongly supervised, or combina"},"verification_status":{"content_addressed":true,"pith_receipt":true,"author_attested":false,"weak_author_claims":0,"strong_author_claims":0,"externally_anchored":false,"storage_verified":false,"citation_signatures":0,"replication_records":0,"graph_snapshot":true,"references_resolved":true,"formal_links_present":true},"canonical_record":{"source":{"id":"2406.16860","kind":"arxiv","version":2},"metadata":{"license":"http://creativecommons.org/licenses/by/4.0/","primary_cat":"cs.CV","submitted_at":"2024-06-24T17:59:42Z","cross_cats_sorted":[],"title_canon_sha256":"daabba9038aa88b85882b2e1edc2f3cac3d8d26c8aae3905f3290a834b08ea4b","abstract_canon_sha256":"e75c7a5db2e5a7c0cdc8282f2d5b275416c8812e7c4b61ef0c742e12d554aa97"},"schema_version":"1.0"},"receipt":{"kind":"pith_receipt","key_id":"pith-v1-2026-05","algorithm":"ed25519","signed_at":"2026-05-17T23:38:46.168055Z","signature_b64":"z4qdKF6FE13eHrZK7jNlZbmfY5O9PwguW0xrP1R+W+3ep+5TvC2UOi6bS+LVMmWxtLMATzaYHEwyww5/YEFFAA==","signed_message":"canonical_sha256_bytes","builder_version":"pith-number-builder-2026-05-17-v1","receipt_version":"0.3","canonical_sha256":"115271f1f857b6e3412b54ada9967f30b6a8d5075d42d20717a0ad84e9a593f3","last_reissued_at":"2026-05-17T23:38:46.167487Z","signature_status":"signed_v1","first_computed_at":"2026-05-17T23:38:46.167487Z","public_key_fingerprint":"8d4b5ee74e4693bcd1df2446408b0d54"},"graph_snapshot":{"paper":{"title":"Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs","license":"http://creativecommons.org/licenses/by/4.0/","headline":"Cambrian-1 shows vision-centric design across twenty encoders plus new benchmarks produces stronger sensory grounding in multimodal LLMs.","cross_cats":[],"primary_cat":"cs.CV","authors_text":"Adithya Iyer, Ellis Brown, Jihan Yang, Manoj Middepogu, Penghao Wu, Rob Fergus, Sai Charitha Akula, Saining Xie, Sanghyun Woo, Shengbang Tong, Shusheng Yang, Xichen Pan, Yann LeCun, Ziteng Wang","submitted_at":"2024-06-24T17:59:42Z","abstract_excerpt":"We introduce Cambrian-1, a family of multimodal LLMs (MLLMs) designed with a vision-centric approach. While stronger language models can enhance multimodal capabilities, the design choices for vision components are often insufficiently explored and disconnected from visual representation learning research. This gap hinders accurate sensory grounding in real-world scenarios. Our study uses LLMs and visual instruction tuning as an interface to evaluate various visual representations, offering new insights into different models and architectures -- self-supervised, strongly supervised, or combina"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"Cambrian-1 not only achieves state-of-the-art performance but also serves as a comprehensive, open cookbook for instruction-tuned MLLMs.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"That existing MLLM benchmarks are insufficiently vision-centric and that the new CV-Bench plus SVA will provide more accurate measurement of sensory grounding without introducing their own selection or interpretation biases.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"Cambrian-1 is a vision-centric multimodal LLM family that evaluates over 20 vision encoders, introduces CV-Bench and the Spatial Vision Aggregator, and releases open models, code, and data achieving strong performance on visual grounding tasks.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"Cambrian-1 shows vision-centric design across twenty encoders plus new benchmarks produces stronger sensory grounding in multimodal LLMs.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"4986064182dc74d001e3d130106d41790483f5cf29d33fa2477eec188ac08c9b"},"source":{"id":"2406.16860","kind":"arxiv","version":2},"verdict":{"id":"047c6a27-dfc9-4ef1-9442-587b182a3755","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-16T23:59:20.630148Z","strongest_claim":"Cambrian-1 not only achieves state-of-the-art performance but also serves as a comprehensive, open cookbook for instruction-tuned MLLMs.","one_line_summary":"Cambrian-1 is a vision-centric multimodal LLM family that evaluates over 20 vision encoders, introduces CV-Bench and the Spatial Vision Aggregator, and releases open models, code, and data achieving strong performance on visual grounding tasks.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"That existing MLLM benchmarks are insufficiently vision-centric and that the new CV-Bench plus SVA will provide more accurate measurement of sensory grounding without introducing their own selection or interpretation biases.","pith_extraction_headline":"Cambrian-1 shows vision-centric design across twenty encoders plus new benchmarks produces stronger sensory grounding in multimodal LLMs."},"references":{"count":163,"sample":[{"doi":"","year":2019,"title":"TallyQA: Answering complex counting questions","work_id":"51a62773-14a6-483a-801b-0983a5934624","ref_index":1,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2018,"title":"Don’t just assume; look and answer: Overcoming priors for visual question answering","work_id":"a35e5014-61a8-4aa1-8a95-197732aebd95","ref_index":2,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2021,"title":"Objectron: A Large Scale Dataset of Object-Centric Videos in the Wild with Pose Annotations","work_id":"adfd9a68-2d08-4eb3-a0a8-1329f2ea8839","ref_index":3,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2024,"title":"Llama 3 Model Card","work_id":"2d5b7141-13fd-4e75-aaca-ed55e101d11e","ref_index":4,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2024,"title":"arXiv preprint arXiv:2402.05128 , year=","work_id":"0ddddc2f-3b79-40a6-bead-4616eb646f62","ref_index":5,"cited_arxiv_id":"","is_internal_anchor":false}],"resolved_work":163,"snapshot_sha256":"d56be8a0f9aaa8740d6297695e4b08b620d0a20116881e13ac071c99d21b7a6c","internal_anchors":22},"formal_canon":{"evidence_count":2,"snapshot_sha256":"560449231982ea7d46c61c3b86d02bd53767ee3ddd49d32b450e62e0b2621447"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"},"aliases":[{"alias_kind":"arxiv","alias_value":"2406.16860","created_at":"2026-05-17T23:38:46.167574+00:00"},{"alias_kind":"arxiv_version","alias_value":"2406.16860v2","created_at":"2026-05-17T23:38:46.167574+00:00"},{"alias_kind":"doi","alias_value":"10.48550/arxiv.2406.16860","created_at":"2026-05-17T23:38:46.167574+00:00"},{"alias_kind":"pith_short_12","alias_value":"CFJHD4PYK63O","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_16","alias_value":"CFJHD4PYK63OGQJL","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_8","alias_value":"CFJHD4PY","created_at":"2026-05-18T12:33:37.589309+00:00"}],"events":[],"event_summary":{},"paper_claims":[],"inbound_citations":{"count":23,"internal_anchor_count":23,"sample":[{"citing_arxiv_id":"2605.18160","citing_title":"Vision Inference Former: Sustaining Visual Consistency in Multimodal Large Language Models","ref_index":37,"is_internal_anchor":true},{"citing_arxiv_id":"2408.04840","citing_title":"mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models","ref_index":245,"is_internal_anchor":true},{"citing_arxiv_id":"2507.01955","citing_title":"How Well Does GPT-4o Understand Vision? Evaluating Multimodal Foundation Models on Standard Computer Vision Tasks","ref_index":57,"is_internal_anchor":true},{"citing_arxiv_id":"2501.01957","citing_title":"VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction","ref_index":24,"is_internal_anchor":true},{"citing_arxiv_id":"2502.04326","citing_title":"WorldSense: Evaluating Real-world Omnimodal Understanding for Multimodal LLMs","ref_index":65,"is_internal_anchor":true},{"citing_arxiv_id":"2512.10362","citing_title":"Visual Funnel: Resolving Contextual Blindness in Multimodal Large Language Models","ref_index":23,"is_internal_anchor":true},{"citing_arxiv_id":"2503.12937","citing_title":"R1-VL: Learning to Reason with Multimodal Large Language Models via Step-wise Group Relative Policy Optimization","ref_index":38,"is_internal_anchor":true},{"citing_arxiv_id":"2410.17434","citing_title":"LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding","ref_index":23,"is_internal_anchor":true},{"citing_arxiv_id":"2408.13257","citing_title":"MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans?","ref_index":61,"is_internal_anchor":true},{"citing_arxiv_id":"2412.03555","citing_title":"PaliGemma 2: A Family of Versatile VLMs for Transfer","ref_index":93,"is_internal_anchor":true},{"citing_arxiv_id":"2409.02813","citing_title":"MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark","ref_index":47,"is_internal_anchor":true},{"citing_arxiv_id":"2408.12528","citing_title":"Show-o: One Single Transformer to Unify Multimodal Understanding and Generation","ref_index":19,"is_internal_anchor":true},{"citing_arxiv_id":"2407.07726","citing_title":"PaliGemma: A versatile 3B VLM for transfer","ref_index":129,"is_internal_anchor":true},{"citing_arxiv_id":"2604.12148","citing_title":"ViLL-E: Video LLM Embeddings for Retrieval","ref_index":47,"is_internal_anchor":true},{"citing_arxiv_id":"2501.13106","citing_title":"VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding","ref_index":40,"is_internal_anchor":true},{"citing_arxiv_id":"2604.04780","citing_title":"CLEAR: Unlocking Generative Potential for Degraded Image Understanding in Unified Multimodal Models","ref_index":35,"is_internal_anchor":true},{"citing_arxiv_id":"2604.05079","citing_title":"SVAgent: Storyline-Guided Long Video Understanding via Cross-Modal Multi-Agent Collaboration","ref_index":37,"is_internal_anchor":true},{"citing_arxiv_id":"2408.01800","citing_title":"MiniCPM-V: A GPT-4V Level MLLM on Your Phone","ref_index":99,"is_internal_anchor":true},{"citing_arxiv_id":"2502.14786","citing_title":"SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features","ref_index":59,"is_internal_anchor":true},{"citing_arxiv_id":"2408.03326","citing_title":"LLaVA-OneVision: Easy Visual Task Transfer","ref_index":133,"is_internal_anchor":true},{"citing_arxiv_id":"2504.10479","citing_title":"InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models","ref_index":117,"is_internal_anchor":true},{"citing_arxiv_id":"2412.05271","citing_title":"Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling","ref_index":235,"is_internal_anchor":true},{"citing_arxiv_id":"2508.18265","citing_title":"InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency","ref_index":132,"is_internal_anchor":true}]},"formal_canon":{"evidence_count":2,"sample":[],"anchors":[]},"links":{"html":"https://pith.science/pith/CFJHD4PYK63OGQJLKSW2TFT7GC","json":"https://pith.science/pith/CFJHD4PYK63OGQJLKSW2TFT7GC.json","graph_json":"https://pith.science/api/pith-number/CFJHD4PYK63OGQJLKSW2TFT7GC/graph.json","events_json":"https://pith.science/api/pith-number/CFJHD4PYK63OGQJLKSW2TFT7GC/events.json","paper":"https://pith.science/paper/CFJHD4PY"},"agent_actions":{"view_html":"https://pith.science/pith/CFJHD4PYK63OGQJLKSW2TFT7GC","download_json":"https://pith.science/pith/CFJHD4PYK63OGQJLKSW2TFT7GC.json","view_paper":"https://pith.science/paper/CFJHD4PY","resolve_alias":"https://pith.science/api/pith-number/resolve?arxiv=2406.16860&json=true","fetch_graph":"https://pith.science/api/pith-number/CFJHD4PYK63OGQJLKSW2TFT7GC/graph.json","fetch_events":"https://pith.science/api/pith-number/CFJHD4PYK63OGQJLKSW2TFT7GC/events.json","actions":{"anchor_timestamp":"https://pith.science/pith/CFJHD4PYK63OGQJLKSW2TFT7GC/action/timestamp_anchor","attest_storage":"https://pith.science/pith/CFJHD4PYK63OGQJLKSW2TFT7GC/action/storage_attestation","attest_author":"https://pith.science/pith/CFJHD4PYK63OGQJLKSW2TFT7GC/action/author_attestation","sign_citation":"https://pith.science/pith/CFJHD4PYK63OGQJLKSW2TFT7GC/action/citation_signature","submit_replication":"https://pith.science/pith/CFJHD4PYK63OGQJLKSW2TFT7GC/action/replication_record"}},"created_at":"2026-05-17T23:38:46.167574+00:00","updated_at":"2026-05-17T23:38:46.167574+00:00"}