{"paper":{"title":"Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs","license":"http://creativecommons.org/licenses/by/4.0/","headline":"Cambrian-1 shows vision-centric design across twenty encoders plus new benchmarks produces stronger sensory grounding in multimodal LLMs.","cross_cats":[],"primary_cat":"cs.CV","authors_text":"Adithya Iyer, Ellis Brown, Jihan Yang, Manoj Middepogu, Penghao Wu, Rob Fergus, Sai Charitha Akula, Saining Xie, Sanghyun Woo, Shengbang Tong, Shusheng Yang, Xichen Pan, Yann LeCun, Ziteng Wang","submitted_at":"2024-06-24T17:59:42Z","abstract_excerpt":"We introduce Cambrian-1, a family of multimodal LLMs (MLLMs) designed with a vision-centric approach. While stronger language models can enhance multimodal capabilities, the design choices for vision components are often insufficiently explored and disconnected from visual representation learning research. This gap hinders accurate sensory grounding in real-world scenarios. Our study uses LLMs and visual instruction tuning as an interface to evaluate various visual representations, offering new insights into different models and architectures -- self-supervised, strongly supervised, or combina"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"Cambrian-1 not only achieves state-of-the-art performance but also serves as a comprehensive, open cookbook for instruction-tuned MLLMs.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"That existing MLLM benchmarks are insufficiently vision-centric and that the new CV-Bench plus SVA will provide more accurate measurement of sensory grounding without introducing their own selection or interpretation biases.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"Cambrian-1 is a vision-centric multimodal LLM family that evaluates over 20 vision encoders, introduces CV-Bench and the Spatial Vision Aggregator, and releases open models, code, and data achieving strong performance on visual grounding tasks.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"Cambrian-1 shows vision-centric design across twenty encoders plus new benchmarks produces stronger sensory grounding in multimodal LLMs.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"4986064182dc74d001e3d130106d41790483f5cf29d33fa2477eec188ac08c9b"},"source":{"id":"2406.16860","kind":"arxiv","version":2},"verdict":{"id":"047c6a27-dfc9-4ef1-9442-587b182a3755","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-16T23:59:20.630148Z","strongest_claim":"Cambrian-1 not only achieves state-of-the-art performance but also serves as a comprehensive, open cookbook for instruction-tuned MLLMs.","one_line_summary":"Cambrian-1 is a vision-centric multimodal LLM family that evaluates over 20 vision encoders, introduces CV-Bench and the Spatial Vision Aggregator, and releases open models, code, and data achieving strong performance on visual grounding tasks.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"That existing MLLM benchmarks are insufficiently vision-centric and that the new CV-Bench plus SVA will provide more accurate measurement of sensory grounding without introducing their own selection or interpretation biases.","pith_extraction_headline":"Cambrian-1 shows vision-centric design across twenty encoders plus new benchmarks produces stronger sensory grounding in multimodal LLMs."},"references":{"count":163,"sample":[{"doi":"","year":2019,"title":"TallyQA: Answering complex counting questions","work_id":"51a62773-14a6-483a-801b-0983a5934624","ref_index":1,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2018,"title":"Don’t just assume; look and answer: Overcoming priors for visual question answering","work_id":"a35e5014-61a8-4aa1-8a95-197732aebd95","ref_index":2,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2021,"title":"Objectron: A Large Scale Dataset of Object-Centric Videos in the Wild with Pose Annotations","work_id":"adfd9a68-2d08-4eb3-a0a8-1329f2ea8839","ref_index":3,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2024,"title":"Llama 3 Model Card","work_id":"2d5b7141-13fd-4e75-aaca-ed55e101d11e","ref_index":4,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2024,"title":"arXiv preprint arXiv:2402.05128 , year=","work_id":"0ddddc2f-3b79-40a6-bead-4616eb646f62","ref_index":5,"cited_arxiv_id":"","is_internal_anchor":false}],"resolved_work":163,"snapshot_sha256":"d56be8a0f9aaa8740d6297695e4b08b620d0a20116881e13ac071c99d21b7a6c","internal_anchors":22},"formal_canon":{"evidence_count":2,"snapshot_sha256":"560449231982ea7d46c61c3b86d02bd53767ee3ddd49d32b450e62e0b2621447"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"}