Recognition: 2 theorem links
Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs
Pith reviewed 2026-05-16 23:59 UTC · model grok-4.3
The pith
Cambrian-1 shows vision-centric design across twenty encoders plus new benchmarks produces stronger sensory grounding in multimodal LLMs.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Cambrian-1 achieves state-of-the-art performance on multimodal tasks by using a vision-centric approach that includes evaluating multiple vision encoders, introducing the CV-Bench for better measurement of visual capabilities, and employing the Spatial Vision Aggregator to integrate features spatially. The work also details the curation of instruction-tuning data and releases all components openly as a cookbook for future MLLM development.
What carries the argument
The Spatial Vision Aggregator, a dynamic spatially-aware connector that fuses high-resolution vision features with an LLM while cutting token count.
If this is right
- Balanced selection of public visual instruction data improves model performance without new private datasets.
- Hybrid vision encoders outperform single-paradigm encoders when paired with the same LLM backbone.
- CV-Bench scores correlate more closely with real-world visual grounding than earlier multimodal suites.
- Releasing full weights, code, and tuning recipes allows direct reproduction and extension by other groups.
Where Pith is reading between the lines
- The same encoder-comparison method could be applied to test whether newer self-supervised vision models close the remaining gap to supervised ones.
- SVA-style connectors might be adapted to reduce token usage in other high-resolution multimodal pipelines beyond instruction tuning.
- Widespread adoption of the open cookbook could shift research focus from scaling language models to systematic visual-representation choices.
Load-bearing premise
Current MLLM benchmarks do not capture visual grounding accurately enough, so new tests like CV-Bench will give a truer picture without adding their own selection biases.
What would settle it
An experiment in which Cambrian-1 models underperform prior MLLMs on a held-out set of real-world tasks that demand fine-grained visual discrimination would refute the central claim.
read the original abstract
We introduce Cambrian-1, a family of multimodal LLMs (MLLMs) designed with a vision-centric approach. While stronger language models can enhance multimodal capabilities, the design choices for vision components are often insufficiently explored and disconnected from visual representation learning research. This gap hinders accurate sensory grounding in real-world scenarios. Our study uses LLMs and visual instruction tuning as an interface to evaluate various visual representations, offering new insights into different models and architectures -- self-supervised, strongly supervised, or combinations thereof -- based on experiments with over 20 vision encoders. We critically examine existing MLLM benchmarks, address the difficulties involved in consolidating and interpreting results from various tasks, and introduce a new vision-centric benchmark, CV-Bench. To further improve visual grounding, we propose the Spatial Vision Aggregator (SVA), a dynamic and spatially-aware connector that integrates high-resolution vision features with LLMs while reducing the number of tokens. Additionally, we discuss the curation of high-quality visual instruction-tuning data from publicly available sources, emphasizing the importance of data source balancing and distribution ratio. Collectively, Cambrian-1 not only achieves state-of-the-art performance but also serves as a comprehensive, open cookbook for instruction-tuned MLLMs. We provide model weights, code, supporting tools, datasets, and detailed instruction-tuning and evaluation recipes. We hope our release will inspire and accelerate advancements in multimodal systems and visual representation learning.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Cambrian-1, a family of vision-centric multimodal LLMs. It reports experiments with over 20 vision encoders (self-supervised, supervised, and combinations), proposes the Spatial Vision Aggregator (SVA) as a dynamic connector for high-resolution features, introduces CV-Bench as a new vision-centric benchmark to address limitations in existing MLLM evaluations, details curation and balancing of visual instruction-tuning data, and claims state-of-the-art performance while releasing models, code, datasets, and recipes as an open cookbook.
Significance. If the empirical claims hold under rigorous validation, the work is significant for systematically exploring under-studied vision components in MLLMs and for releasing a comprehensive open resource that could accelerate research on visual grounding and representation learning.
major comments (2)
- [CV-Bench] CV-Bench section: The claim that CV-Bench delivers more accurate measurement of sensory grounding than prior benchmarks rests on reduced interpretation biases, yet the manuscript provides no ablations on task selection criteria, inter-rater reliability, or explicit controls for language-prior leakage; without these, the benchmark's superiority for vision-centric evaluation remains unverified and load-bearing for the paper's central thesis.
- [Method and Experiments] SVA and vision-encoder experiments: The reported gains from combining encoders and using SVA for token-efficient integration lack detailed ablations on the free parameters (encoder selection, balancing ratios, SVA hyperparameters) and do not include statistical significance or error bars, which are required to substantiate the SOTA performance claims over baselines.
minor comments (2)
- [Experiments] Evaluation protocols: Add explicit details on data splits, exact scoring procedures, and statistical tests for all reported metrics to allow reproduction and assessment of robustness.
- [Figures and Notation] Notation and figures: Clarify the exact token-reduction formula for SVA and ensure all figures include axis labels, legends, and confidence intervals where applicable.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback on our manuscript. We have carefully considered each major comment and outline our responses below, including planned revisions to address the concerns raised while preserving the core contributions of Cambrian-1.
read point-by-point responses
-
Referee: [CV-Bench] CV-Bench section: The claim that CV-Bench delivers more accurate measurement of sensory grounding than prior benchmarks rests on reduced interpretation biases, yet the manuscript provides no ablations on task selection criteria, inter-rater reliability, or explicit controls for language-prior leakage; without these, the benchmark's superiority for vision-centric evaluation remains unverified and load-bearing for the paper's central thesis.
Authors: We appreciate the referee's emphasis on rigorous validation for CV-Bench. While the benchmark was designed with tasks that prioritize direct visual perception (e.g., spatial relations and object attributes) to reduce reliance on language priors compared to existing MLLM benchmarks, we acknowledge that explicit documentation of these design choices is needed. In the revised manuscript, we will expand the CV-Bench section to detail the task selection criteria, report inter-rater reliability scores from the annotation process, and include controls for language-prior leakage such as ablation studies comparing model performance with and without visual inputs. These additions will better substantiate the benchmark's utility for vision-centric evaluation. revision: yes
-
Referee: [Method and Experiments] SVA and vision-encoder experiments: The reported gains from combining encoders and using SVA for token-efficient integration lack detailed ablations on the free parameters (encoder selection, balancing ratios, SVA hyperparameters) and do not include statistical significance or error bars, which are required to substantiate the SOTA performance claims over baselines.
Authors: We thank the referee for underscoring the importance of comprehensive ablations and statistical rigor to support our empirical claims. Our experiments systematically evaluated over 20 vision encoders and their combinations, with SVA hyperparameters selected via validation performance and balancing ratios informed by data distribution analysis. To strengthen this, the revised version will incorporate additional ablation studies on encoder selection, data balancing ratios, and SVA hyperparameters in the main text and appendix. We will also report error bars from multiple random seeds and include statistical significance tests (e.g., paired t-tests) for key comparisons against baselines to more robustly substantiate the performance gains. revision: yes
Circularity Check
No significant circularity: empirical contributions and new benchmarks stand independently
full rationale
The paper's central claims rest on new empirical evaluations across over 20 vision encoders, architectural proposals such as SVA, curation of instruction-tuning data with explicit balancing choices, and the introduction of CV-Bench to address benchmark limitations. These elements do not reduce by the paper's own equations or definitions to previously fitted parameters, self-referential derivations, or load-bearing self-citations. No step matches the enumerated circularity patterns; the work is self-contained against external benchmarks and falsifiable via reported model rankings and ablation-style experiments on vision components.
Axiom & Free-Parameter Ledger
free parameters (3)
- Vision encoder selection and combination
- Data source balancing ratios
- SVA architectural hyperparameters
axioms (2)
- domain assumption Visual instruction tuning serves as a reliable interface to evaluate different visual representations
- domain assumption Existing MLLM benchmarks have difficulties in consolidation and interpretation that a new vision-centric benchmark can address
invented entities (1)
-
Spatial Vision Aggregator (SVA)
no independent evidence
Forward citations
Cited by 19 Pith papers
-
MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark
MMMU-Pro is a stricter multimodal benchmark that removes text-only solvable questions, augments options, and requires reading text from images, yielding substantially lower model scores of 16.8-26.9%.
-
MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans?
MME-RealWorld is the largest manually annotated high-resolution benchmark for MLLMs, where even the best models achieve less than 60% accuracy on challenging real-world tasks.
-
SVAgent: Storyline-Guided Long Video Understanding via Cross-Modal Multi-Agent Collaboration
SVAgent improves long video question answering by constructing storylines via multi-agent collaboration and aligning cross-modal predictions for more robust, human-like reasoning.
-
R1-VL: Learning to Reason with Multimodal Large Language Models via Step-wise Group Relative Policy Optimization
R1-VL uses StepGRPO with rule-based StepRAR and StepRVR rewards to let MLLMs learn step-by-step reasoning beyond imitation of positive paths.
-
WorldSense: Evaluating Real-world Omnimodal Understanding for Multimodal LLMs
WorldSense provides the first benchmark requiring synergistic audio-video-text understanding on 1,662 real-world videos and 3,172 QA pairs, where the best current multimodal LLM reaches only 65.1% accuracy.
-
ViLL-E: Video LLM Embeddings for Retrieval
ViLL-E introduces a dynamic embedding mechanism and joint contrastive-generative training for VideoLLMs, delivering up to 7% gains in temporal localization and 4% in video retrieval while enabling new zero-shot capabilities.
-
CLEAR: Unlocking Generative Potential for Degraded Image Understanding in Unified Multimodal Models
CLEAR uses degradation-aware fine-tuning, a latent representation bridge, and interleaved reinforcement learning to connect generative and reasoning capabilities in multimodal models for better degraded image understanding.
-
Visual Funnel: Resolving Contextual Blindness in Multimodal Large Language Models
Visual Funnel resolves contextual blindness in MLLMs by constructing an entropy-scaled portfolio of hierarchically structured image crops that preserves both local detail and global context.
-
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency
InternVL3.5 advances open-source multimodal models with Cascade RL for +16% reasoning gains and ViR for 4x inference speedup, with the 241B model reaching SOTA among open-source MLLMs on multimodal, reasoning, and age...
-
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models
InternVL3-78B sets a new open-source SOTA of 72.2 on MMMU via native joint multimodal pre-training, V2PE, MPO, and test-time scaling while remaining competitive with proprietary models.
-
Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling
InternVL 2.5 is the first open-source MLLM to surpass 70% on the MMMU benchmark via model, data, and test-time scaling, with a 3.7-point gain from chain-of-thought reasoning.
-
LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding
LongVU adaptively compresses long video tokens using DINOv2-based frame deduplication, text-guided cross-modal selection, and temporal spatial reduction to improve video-language understanding in MLLMs with minimal de...
-
Show-o: One Single Transformer to Unify Multimodal Understanding and Generation
Show-o unifies autoregressive and discrete diffusion modeling inside one transformer to support multimodal understanding and generation tasks with competitive benchmark performance.
-
LLaVA-OneVision: Easy Visual Task Transfer
LLaVA-OneVision is the first single open LMM to simultaneously achieve strong performance in single-image, multi-image, and video scenarios with cross-scenario transfer capabilities.
-
MiniCPM-V: A GPT-4V Level MLLM on Your Phone
MiniCPM-Llama3-V 2.5 delivers GPT-4V-level multimodal performance on phones through architecture, pretraining, and alignment optimizations.
-
SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features
SigLIP 2 models trained with a unified recipe of captioning, self-supervised losses, and curated diverse data outperform prior SigLIP versions on classification, retrieval, localization, dense prediction, and multilin...
-
VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding
VideoLLaMA3 uses a vision-centric training paradigm and token-reduction design to reach competitive results on image and video benchmarks.
-
PaliGemma 2: A Family of Versatile VLMs for Transfer
PaliGemma 2 is a family of vision-language models that achieves state-of-the-art results on transfer tasks like table structure recognition and radiography report generation by combining SigLIP with Gemma 2 models at ...
-
PaliGemma: A versatile 3B VLM for transfer
PaliGemma is an open 3B VLM based on SigLIP and Gemma that achieves strong performance on nearly 40 diverse open-world tasks including benchmarks, remote-sensing, and segmentation.
Reference graph
Works this paper leans on
-
[1]
TallyQA: Answering complex counting questions
M. Acharya, K. Kafle, and C. Kanan. “TallyQA: Answering complex counting questions”. In: AAAI. 2019
work page 2019
-
[2]
Don’t just assume; look and answer: Overcoming priors for visual question answering
A. Agrawal et al. “Don’t just assume; look and answer: Overcoming priors for visual question answering”. In: CVPR. 2018
work page 2018
-
[3]
Objectron: A Large Scale Dataset of Object-Centric Videos in the Wild with Pose Annotations
A. Ahmadyan et al. “Objectron: A Large Scale Dataset of Object-Centric Videos in the Wild with Pose Annotations”. In: CVPR (2021)
work page 2021
- [4]
-
[5]
H. A. Alawwad et al. “Enhancing Textbook Question Answering Task with Large Lan- guage Models and Retrieval Augmented Generation”. In: arXiv preprint arXiv:2402.05128 (2024)
-
[6]
Flamingo: a visual language model for few-shot learning
J.-B. Alayrac et al. “Flamingo: a visual language model for few-shot learning”. In: NeurIPS. 2022
work page 2022
-
[7]
T. Aquinas. Quaestiones Disputatae de Veritate. q.2 a.3 arg.19, 1259
-
[8]
Aristotle. Metaphysics. Ed. by T. by W. D. Ross. The Internet Classics Archive, 350BCE
-
[9]
Self-supervised learning from images with a joint-embedding predictive architecture
M. Assran et al. “Self-supervised learning from images with a joint-embedding predictive architecture”. In: CVPR. 2023
work page 2023
-
[10]
J. Bai et al. “Qwen Technical Report”. In: arXiv preprint arXiv:2309.16609 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[11]
Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond
J. Bai et al. “Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond”. In: (2023)
work page 2023
-
[12]
Probing the 3D Awareness of Visual Foundation Models
M. E. Banani et al. “Probing the 3D Awareness of Visual Foundation Models”. In:arXiv preprint arXiv:2404.08636 (2024)
-
[13]
ARKitScenes - A Diverse Real-World Dataset for 3D Indoor Scene Understanding Using Mobile RGB-D Data
G. Baruch et al. “ARKitScenes - A Diverse Real-World Dataset for 3D Indoor Scene Understanding Using Mobile RGB-D Data”. In: NeurIPS Datasets and Benchmarks Track (Round 1). 2021
work page 2021
-
[14]
Automatikz: Text-guided synthesis of scientific vector graphics with tikz
J. Belouadi, A. Lauscher, and S. Eger. “Automatikz: Text-guided synthesis of scientific vector graphics with tikz”. In: ICLR. 2024
work page 2024
-
[15]
Midas v3. 1–a model zoo for robust monocular relative depth estimation
R. Birkl, D. Wofk, and M. Müller. “Midas v3. 1–a model zoo for robust monocular relative depth estimation”. In: arXiv preprint arXiv:2307.14460 (2023)
-
[16]
Latr: Layout-aware transformer for scene-text vqa
A. F. Biten et al. “Latr: Layout-aware transformer for scene-text vqa”. In: CVPR. 2022
work page 2022
-
[17]
Scene text visual question answering
A. F. Biten et al. “Scene text visual question answering”. In: ICCV. 2019
work page 2019
-
[18]
Omni3d: A large benchmark and model for 3d object detection in the wild
G. Brazil et al. “Omni3d: A large benchmark and model for 3d object detection in the wild”. In: CVPR. 2023
work page 2023
-
[19]
J. Buchner. imagehash (fork).https://github.com/JohannesBuchner/imagehash . 2021
work page 2021
-
[20]
nuscenes: A multimodal dataset for autonomous driving
H. Caesar et al. “nuscenes: A multimodal dataset for autonomous driving”. In: CVPR. 2020
work page 2020
-
[21]
Honeybee: Locality-enhanced projector for multimodal llm
J. Cha et al. “Honeybee: Locality-enhanced projector for multimodal llm”. In: CVPR. 2024
work page 2024
-
[22]
Visually Dehallucinative Instruction Generation: Know What You Don’t Know
S. Cha et al. “Visually Dehallucinative Instruction Generation: Know What You Don’t Know”. In: arXiv preprint arXiv:2402.09717 (2024)
-
[23]
Does Thought Require Sensory Grounding? From Pure Thinkers to Large Language Models
D. J. Chalmers. “Does Thought Require Sensory Grounding? From Pure Thinkers to Large Language Models”. In: Proceedings and Addresses of the American Philosophical Association 97 (2023), pp. 22–45. 23
work page 2023
-
[24]
A survey on evaluation of large language models
Y. Chang et al. “A survey on evaluation of large language models”. In: ACM Transactions on Intelligent Systems and Technology 15.3 (2024), pp. 1–45
work page 2024
-
[25]
Allava: Harnessing gpt4v-synthesized data for a lite vision-language model
G. H. Chen et al. “ALLaVA: Harnessing GPT4V-synthesized Data for A Lite Vision- Language Model”. In: arXiv preprint arXiv:2402.11684 (2024)
-
[26]
Are We on the Right Way for Evaluating Large Vision-Language Models?
L. Chen et al. “Are We on the Right Way for Evaluating Large Vision-Language Models?” In: arXiv preprint arXiv:2403.20330 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[27]
ShareGPT4V: Improving Large Multi-Modal Models with Better Captions
L. Chen et al. “Sharegpt4v: Improving large multi-modal models with better captions”. In: arXiv preprint arXiv:2311.12793 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[28]
Pali: A jointly-scaled multilingual language-image model
X. Chen et al. “Pali: A jointly-scaled multilingual language-image model”. In:ICLR. 2023
work page 2023
-
[29]
An empirical study of training self-supervised vision transformers
X. Chen, S. Xie, and K. He. “An empirical study of training self-supervised vision transformers”. In: ICCV. 2021
work page 2021
-
[30]
How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites
Z. Chen et al. “How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites”. In: arXiv preprint arXiv:2404.16821 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[31]
Finqa: A dataset of numerical reasoning over financial data
Z. Chen et al. “Finqa: A dataset of numerical reasoning over financial data”. In: EMNLP. 2021
work page 2021
-
[32]
HiTab: A hierarchical table dataset for question answering and natural language generation
Z. Cheng et al. “HiTab: A hierarchical table dataset for question answering and natural language generation”. In: ACL. 2022
work page 2022
-
[33]
Reproducible scaling laws for contrastive language-image learning
M. Cherti et al. “Reproducible scaling laws for contrastive language-image learning”. In: CVPR. 2023
work page 2023
-
[34]
Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference
W.-L. Chiang et al. “Chatbot arena: An open platform for evaluating llms by human preference”. In: arXiv preprint arXiv:2403.04132 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[35]
Mobilevlm v2: Faster and stronger baseline for vision language model
X. Chu et al. “Mobilevlm v2: Faster and stronger baseline for vision language model”. In: arXiv preprint arXiv:2402.03766 (2024)
-
[36]
Conover et al.Free Dolly: Introducing the World’s First Truly Open Instruction-Tuned LLM
M. Conover et al.Free Dolly: Introducing the World’s First Truly Open Instruction-Tuned LLM
-
[37]
U R L: https://www.databricks.com/blog/2023/04/12/dolly-first- open-commercially-viable-instruction-tuned-llm (visited on 06/30/2023)
work page 2023
-
[38]
Instructblip: Towards general-purpose vision-language models with instruction tuning
W. Dai et al. “Instructblip: Towards general-purpose vision-language models with instruction tuning”. In: NeurIPS. 2024
work page 2024
-
[39]
Rlhf workflow: From reward modeling to online rlhf
H. Dong et al. “Rlhf workflow: From reward modeling to online rlhf”. In: arXiv preprint arXiv:2405.07863 (2024)
-
[40]
An image is worth 16x16 words: Transformers for image recogni- tion at scale
A. Dosovitskiy et al. “An image is worth 16x16 words: Transformers for image recogni- tion at scale”. In: ICLR. 2021
work page 2021
- [41]
-
[42]
BLINK: Multimodal Large Language Models Can See but Not Perceive
X. Fu et al. “BLINK: Multimodal Large Language Models Can See but Not Perceive”. In: arXiv preprint arXiv:2404.12390 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[43]
Datacomp: In search of the next generation of multimodal datasets
S. Y. Gadre et al. “Datacomp: In search of the next generation of multimodal datasets”. In: vol. 36. 2024
work page 2024
-
[44]
arXiv preprint arXiv:2312.11370 (2023)
J. Gao et al. “G-llava: Solving geometric problem with multi-modal large language model”. In: arXiv preprint arXiv:2312.11370 (2023)
-
[45]
LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model
P . Gao et al. “Llama-adapter v2: Parameter-efficient visual instruction model”. In:arXiv preprint arXiv:2304.15010 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[46]
arXiv preprint arXiv:2402.05935 (2024) 15
P . Gao et al. “SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large Language Models”. In: arXiv preprint arXiv:2402.05935 (2024)
-
[47]
Planting a seed of vision in large language model
Y. Ge et al. “Planting a seed of vision in large language model”. In: arXiv preprint arXiv:2307.08041 (2023). 24
-
[48]
Are we ready for Autonomous Driving? The KITTI Vision Benchmark Suite
A. Geiger, P . Lenz, and R. Urtasun. “Are we ready for Autonomous Driving? The KITTI Vision Benchmark Suite”. In: CVPR. 2012
work page 2012
-
[49]
Shortcut learning in deep neural networks
R. Geirhos et al. “Shortcut learning in deep neural networks”. In: Nature Machine Intelli- gence (2020)
work page 2020
-
[50]
Rich feature hierarchies for accurate object detection and semantic segmentation
R. Girshick et al. “Rich feature hierarchies for accurate object detection and semantic segmentation”. In: CVPR. 2014
work page 2014
-
[51]
Google. Gemini. 2023
work page 2023
-
[52]
Making the v in vqa matter: Elevating the role of image understanding in visual question answering
Y. Goyal et al. “Making the v in vqa matter: Elevating the role of image understanding in visual question answering”. In: CVPR. 2017
work page 2017
-
[53]
Vizwiz grand challenge: Answering visual questions from blind people
D. Gurari et al. “Vizwiz grand challenge: Answering visual questions from blind people”. In: CVPR. 2018
work page 2018
-
[54]
Masked autoencoders are scalable vision learners
K. He et al. “Masked autoencoders are scalable vision learners”. In: CVPR. 2022
work page 2022
-
[55]
PathVQA: 30000+ Questions for Medical Visual Question Answering
X. He et al. “PathVQA: 30000+ Questions for Medical Visual Question Answering”. In: CoRR abs/2003.10286 (2020)
work page internal anchor Pith review Pith/arXiv arXiv 2003
-
[56]
AI2D-RST: A multimodal corpus of 1000 primary school science diagrams
T. Hiippala et al. “AI2D-RST: A multimodal corpus of 1000 primary school science diagrams”. In: Language Resources and Evaluation 55 (2021), pp. 661–688
work page 2021
-
[57]
Training compute-optimal large language models
J. Hoffmann et al. “Training compute-optimal large language models”. In: NeurIPS (2023)
work page 2023
-
[58]
Screenqa: Large-scale question-answer pairs over mobile app screenshots
Y.-C. Hsiao, F. Zubach, M. Wang, et al. “Screenqa: Large-scale question-answer pairs over mobile app screenshots”. In: arXiv preprint arXiv:2209.08199 (2022)
-
[59]
GQA: A New Dataset for Real-World Visual Reason- ing and Compositional Question Answering
D. A. Hudson and C. D. Manning. “GQA: A New Dataset for Real-World Visual Reason- ing and Compositional Question Answering”. In: CVPR. 2019
work page 2019
-
[60]
Perceiver: General perception with iterative attention
A. Jaegle et al. “Perceiver: General perception with iterative attention”. In: ICML. 2021
work page 2021
-
[61]
Clevr: A diagnostic dataset for compositional language and elementary visual reasoning
J. Johnson et al. “Clevr: A diagnostic dataset for compositional language and elementary visual reasoning”. In: CVPR. 2017
work page 2017
-
[62]
N. Jouppi et al. “Tpu v4: An optically reconfigurable supercomputer for machine learning with hardware support for embeddings”. In: Proceedings of the 50th Annual International Symposium on Computer Architecture. 2023
work page 2023
-
[63]
Dvqa: Understanding data visualizations via question answering
K. Kafle et al. “Dvqa: Understanding data visualizations via question answering”. In: CVPR. 2018
work page 2018
-
[64]
Chart-to-text: A large-scale benchmark for chart summarization
S. Kantharaj et al. “Chart-to-text: A large-scale benchmark for chart summarization”. In: ACL. 2022
work page 2022
-
[65]
Prismatic vlms: Investigating the design space of visually-conditioned language models
S. Karamcheti et al. “Prismatic vlms: Investigating the design space of visually-conditioned language models”. In: arXiv preprint arXiv:2402.07865 (2024)
-
[66]
Geomverse: A systematic evaluation of large models for geometric reasoning
M. Kazemi et al. “Geomverse: A systematic evaluation of large models for geometric reasoning”. In: 2023
work page 2023
-
[67]
A diagram is worth a dozen images
A. Kembhavi et al. “A diagram is worth a dozen images”. In: ECCV. 2016
work page 2016
-
[68]
The hateful memes challenge: Detecting hate speech in multimodal memes
D. Kiela et al. “The hateful memes challenge: Detecting hate speech in multimodal memes”. In: NeurIPS. 2020
work page 2020
-
[69]
Donut: Document understanding transformer without ocr
G. Kim et al. “Donut: Document understanding transformer without ocr”. In: ECCV. 2022
work page 2022
- [70]
-
[71]
Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations
R. Krishna et al. “Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations”. In: IJCV (2016). 25
work page 2016
- [72]
-
[73]
Unlocking the conversion of Web Screenshots into HTML Code with the WebSight Dataset
H. Laurençon, L. Tronchon, and V . Sanh. “Unlocking the conversion of Web Screenshots into HTML Code with the WebSight Dataset”. In: arXiv preprint arXiv:2403.09029 (2024)
-
[74]
What matters when building vision-language models?
H. Laurençon et al. “What matters when building vision-language models?” In: arXiv preprint arXiv:2405.02246 (2024)
-
[75]
Internet Explorer: Targeted Representation Learning on the Open Web
A. C. Li et al. “Internet Explorer: Targeted Representation Learning on the Open Web”. In: ICML. 2023
work page 2023
-
[76]
Your diffusion model is secretly a zero-shot classifier
A. C. Li et al. “Your diffusion model is secretly a zero-shot classifier”. In: ICCV. 2023
work page 2023
- [77]
-
[78]
Multimodal ArXiv: A Dataset for Improving Scientific Comprehension of Large Vision-Language Models
L. Li et al. “Multimodal ArXiv: A Dataset for Improving Scientific Comprehension of Large Vision-Language Models”. In: arXiv preprint arXiv:2403.00231 (2024)
-
[79]
Mini-gemini: Mining the potential of multi-modality vision language models
Y. Li et al. “Mini-gemini: Mining the potential of multi-modality vision language models”. In: arXiv preprint arXiv:2403.18814 (2024)
-
[80]
W. Lian et al. OpenOrca: An Open Dataset of GPT Augmented FLAN Reasoning Traces . https://https://huggingface.co/Open-Orca/OpenOrca. 2023
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.