pith. machine review for the scientific record. sign in

arxiv: 2406.16860 · v2 · submitted 2024-06-24 · 💻 cs.CV

Recognition: 2 theorem links

Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs

Authors on Pith no claims yet

Pith reviewed 2026-05-16 23:59 UTC · model grok-4.3

classification 💻 cs.CV
keywords multimodal LLMsvision-centric designvisual instruction tuningCV-BenchSpatial Vision Aggregatorvision encodersopen MLLM recipes
0
0 comments X

The pith

Cambrian-1 shows vision-centric design across twenty encoders plus new benchmarks produces stronger sensory grounding in multimodal LLMs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Cambrian-1 introduces a family of multimodal LLMs built by prioritizing vision components over language scaling alone. The work treats instruction tuning as a testbed for more than twenty vision encoders drawn from self-supervised, supervised, and hybrid training regimes. Existing benchmarks are examined for weak visual coverage, prompting the creation of CV-Bench to measure grounding more directly. A Spatial Vision Aggregator is added to fuse high-resolution features without excessive tokens, while data curation rules emphasize source balance. The models reach state-of-the-art results and release every component as an open recipe for future instruction-tuned MLLMs.

Core claim

Cambrian-1 achieves state-of-the-art performance on multimodal tasks by using a vision-centric approach that includes evaluating multiple vision encoders, introducing the CV-Bench for better measurement of visual capabilities, and employing the Spatial Vision Aggregator to integrate features spatially. The work also details the curation of instruction-tuning data and releases all components openly as a cookbook for future MLLM development.

What carries the argument

The Spatial Vision Aggregator, a dynamic spatially-aware connector that fuses high-resolution vision features with an LLM while cutting token count.

If this is right

  • Balanced selection of public visual instruction data improves model performance without new private datasets.
  • Hybrid vision encoders outperform single-paradigm encoders when paired with the same LLM backbone.
  • CV-Bench scores correlate more closely with real-world visual grounding than earlier multimodal suites.
  • Releasing full weights, code, and tuning recipes allows direct reproduction and extension by other groups.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same encoder-comparison method could be applied to test whether newer self-supervised vision models close the remaining gap to supervised ones.
  • SVA-style connectors might be adapted to reduce token usage in other high-resolution multimodal pipelines beyond instruction tuning.
  • Widespread adoption of the open cookbook could shift research focus from scaling language models to systematic visual-representation choices.

Load-bearing premise

Current MLLM benchmarks do not capture visual grounding accurately enough, so new tests like CV-Bench will give a truer picture without adding their own selection biases.

What would settle it

An experiment in which Cambrian-1 models underperform prior MLLMs on a held-out set of real-world tasks that demand fine-grained visual discrimination would refute the central claim.

read the original abstract

We introduce Cambrian-1, a family of multimodal LLMs (MLLMs) designed with a vision-centric approach. While stronger language models can enhance multimodal capabilities, the design choices for vision components are often insufficiently explored and disconnected from visual representation learning research. This gap hinders accurate sensory grounding in real-world scenarios. Our study uses LLMs and visual instruction tuning as an interface to evaluate various visual representations, offering new insights into different models and architectures -- self-supervised, strongly supervised, or combinations thereof -- based on experiments with over 20 vision encoders. We critically examine existing MLLM benchmarks, address the difficulties involved in consolidating and interpreting results from various tasks, and introduce a new vision-centric benchmark, CV-Bench. To further improve visual grounding, we propose the Spatial Vision Aggregator (SVA), a dynamic and spatially-aware connector that integrates high-resolution vision features with LLMs while reducing the number of tokens. Additionally, we discuss the curation of high-quality visual instruction-tuning data from publicly available sources, emphasizing the importance of data source balancing and distribution ratio. Collectively, Cambrian-1 not only achieves state-of-the-art performance but also serves as a comprehensive, open cookbook for instruction-tuned MLLMs. We provide model weights, code, supporting tools, datasets, and detailed instruction-tuning and evaluation recipes. We hope our release will inspire and accelerate advancements in multimodal systems and visual representation learning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces Cambrian-1, a family of vision-centric multimodal LLMs. It reports experiments with over 20 vision encoders (self-supervised, supervised, and combinations), proposes the Spatial Vision Aggregator (SVA) as a dynamic connector for high-resolution features, introduces CV-Bench as a new vision-centric benchmark to address limitations in existing MLLM evaluations, details curation and balancing of visual instruction-tuning data, and claims state-of-the-art performance while releasing models, code, datasets, and recipes as an open cookbook.

Significance. If the empirical claims hold under rigorous validation, the work is significant for systematically exploring under-studied vision components in MLLMs and for releasing a comprehensive open resource that could accelerate research on visual grounding and representation learning.

major comments (2)
  1. [CV-Bench] CV-Bench section: The claim that CV-Bench delivers more accurate measurement of sensory grounding than prior benchmarks rests on reduced interpretation biases, yet the manuscript provides no ablations on task selection criteria, inter-rater reliability, or explicit controls for language-prior leakage; without these, the benchmark's superiority for vision-centric evaluation remains unverified and load-bearing for the paper's central thesis.
  2. [Method and Experiments] SVA and vision-encoder experiments: The reported gains from combining encoders and using SVA for token-efficient integration lack detailed ablations on the free parameters (encoder selection, balancing ratios, SVA hyperparameters) and do not include statistical significance or error bars, which are required to substantiate the SOTA performance claims over baselines.
minor comments (2)
  1. [Experiments] Evaluation protocols: Add explicit details on data splits, exact scoring procedures, and statistical tests for all reported metrics to allow reproduction and assessment of robustness.
  2. [Figures and Notation] Notation and figures: Clarify the exact token-reduction formula for SVA and ensure all figures include axis labels, legends, and confidence intervals where applicable.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We have carefully considered each major comment and outline our responses below, including planned revisions to address the concerns raised while preserving the core contributions of Cambrian-1.

read point-by-point responses
  1. Referee: [CV-Bench] CV-Bench section: The claim that CV-Bench delivers more accurate measurement of sensory grounding than prior benchmarks rests on reduced interpretation biases, yet the manuscript provides no ablations on task selection criteria, inter-rater reliability, or explicit controls for language-prior leakage; without these, the benchmark's superiority for vision-centric evaluation remains unverified and load-bearing for the paper's central thesis.

    Authors: We appreciate the referee's emphasis on rigorous validation for CV-Bench. While the benchmark was designed with tasks that prioritize direct visual perception (e.g., spatial relations and object attributes) to reduce reliance on language priors compared to existing MLLM benchmarks, we acknowledge that explicit documentation of these design choices is needed. In the revised manuscript, we will expand the CV-Bench section to detail the task selection criteria, report inter-rater reliability scores from the annotation process, and include controls for language-prior leakage such as ablation studies comparing model performance with and without visual inputs. These additions will better substantiate the benchmark's utility for vision-centric evaluation. revision: yes

  2. Referee: [Method and Experiments] SVA and vision-encoder experiments: The reported gains from combining encoders and using SVA for token-efficient integration lack detailed ablations on the free parameters (encoder selection, balancing ratios, SVA hyperparameters) and do not include statistical significance or error bars, which are required to substantiate the SOTA performance claims over baselines.

    Authors: We thank the referee for underscoring the importance of comprehensive ablations and statistical rigor to support our empirical claims. Our experiments systematically evaluated over 20 vision encoders and their combinations, with SVA hyperparameters selected via validation performance and balancing ratios informed by data distribution analysis. To strengthen this, the revised version will incorporate additional ablation studies on encoder selection, data balancing ratios, and SVA hyperparameters in the main text and appendix. We will also report error bars from multiple random seeds and include statistical significance tests (e.g., paired t-tests) for key comparisons against baselines to more robustly substantiate the performance gains. revision: yes

Circularity Check

0 steps flagged

No significant circularity: empirical contributions and new benchmarks stand independently

full rationale

The paper's central claims rest on new empirical evaluations across over 20 vision encoders, architectural proposals such as SVA, curation of instruction-tuning data with explicit balancing choices, and the introduction of CV-Bench to address benchmark limitations. These elements do not reduce by the paper's own equations or definitions to previously fitted parameters, self-referential derivations, or load-bearing self-citations. No step matches the enumerated circularity patterns; the work is self-contained against external benchmarks and falsifiable via reported model rankings and ablation-style experiments on vision components.

Axiom & Free-Parameter Ledger

3 free parameters · 2 axioms · 1 invented entities

The paper is an empirical exploration rather than a first-principles derivation. Central claims depend on experimental validation of vision encoder choices, the design of CV-Bench, and the SVA module, with several hyperparameters and data balancing decisions made through trial rather than derived.

free parameters (3)
  • Vision encoder selection and combination
    Choice of which of the over 20 encoders to test and how to combine self-supervised and supervised ones is experimental.
  • Data source balancing ratios
    Distribution ratios for high-quality visual instruction-tuning data curated from public sources.
  • SVA architectural hyperparameters
    Parameters controlling dynamic spatial aggregation and token reduction in the new connector.
axioms (2)
  • domain assumption Visual instruction tuning serves as a reliable interface to evaluate different visual representations
    Used to compare self-supervised, strongly supervised, and combined encoders.
  • domain assumption Existing MLLM benchmarks have difficulties in consolidation and interpretation that a new vision-centric benchmark can address
    Motivation for introducing CV-Bench.
invented entities (1)
  • Spatial Vision Aggregator (SVA) no independent evidence
    purpose: Dynamic and spatially-aware connector that integrates high-resolution vision features with LLMs while reducing token count
    Proposed to improve visual grounding in multimodal LLMs.

pith-pipeline@v0.9.0 · 5610 in / 1684 out tokens · 113616 ms · 2026-05-16T23:59:20.630148+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 19 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark

    cs.CL 2024-09 accept novelty 8.0

    MMMU-Pro is a stricter multimodal benchmark that removes text-only solvable questions, augments options, and requires reading text from images, yielding substantially lower model scores of 16.8-26.9%.

  2. MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans?

    cs.CV 2024-08 conditional novelty 8.0

    MME-RealWorld is the largest manually annotated high-resolution benchmark for MLLMs, where even the best models achieve less than 60% accuracy on challenging real-world tasks.

  3. SVAgent: Storyline-Guided Long Video Understanding via Cross-Modal Multi-Agent Collaboration

    cs.CV 2026-04 unverdicted novelty 7.0

    SVAgent improves long video question answering by constructing storylines via multi-agent collaboration and aligning cross-modal predictions for more robust, human-like reasoning.

  4. R1-VL: Learning to Reason with Multimodal Large Language Models via Step-wise Group Relative Policy Optimization

    cs.AI 2025-03 conditional novelty 7.0

    R1-VL uses StepGRPO with rule-based StepRAR and StepRVR rewards to let MLLMs learn step-by-step reasoning beyond imitation of positive paths.

  5. WorldSense: Evaluating Real-world Omnimodal Understanding for Multimodal LLMs

    cs.CV 2025-02 unverdicted novelty 7.0

    WorldSense provides the first benchmark requiring synergistic audio-video-text understanding on 1,662 real-world videos and 3,172 QA pairs, where the best current multimodal LLM reaches only 65.1% accuracy.

  6. ViLL-E: Video LLM Embeddings for Retrieval

    cs.CV 2026-04 unverdicted novelty 6.0

    ViLL-E introduces a dynamic embedding mechanism and joint contrastive-generative training for VideoLLMs, delivering up to 7% gains in temporal localization and 4% in video retrieval while enabling new zero-shot capabilities.

  7. CLEAR: Unlocking Generative Potential for Degraded Image Understanding in Unified Multimodal Models

    cs.CV 2026-04 unverdicted novelty 6.0

    CLEAR uses degradation-aware fine-tuning, a latent representation bridge, and interleaved reinforcement learning to connect generative and reasoning capabilities in multimodal models for better degraded image understanding.

  8. Visual Funnel: Resolving Contextual Blindness in Multimodal Large Language Models

    cs.CV 2025-12 unverdicted novelty 6.0

    Visual Funnel resolves contextual blindness in MLLMs by constructing an entropy-scaled portfolio of hierarchically structured image crops that preserves both local detail and global context.

  9. InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

    cs.CV 2025-08 unverdicted novelty 6.0

    InternVL3.5 advances open-source multimodal models with Cascade RL for +16% reasoning gains and ViR for 4x inference speedup, with the 241B model reaching SOTA among open-source MLLMs on multimodal, reasoning, and age...

  10. InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

    cs.CV 2025-04 conditional novelty 6.0

    InternVL3-78B sets a new open-source SOTA of 72.2 on MMMU via native joint multimodal pre-training, V2PE, MPO, and test-time scaling while remaining competitive with proprietary models.

  11. Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

    cs.CV 2024-12 unverdicted novelty 6.0

    InternVL 2.5 is the first open-source MLLM to surpass 70% on the MMMU benchmark via model, data, and test-time scaling, with a 3.7-point gain from chain-of-thought reasoning.

  12. LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding

    cs.CV 2024-10 unverdicted novelty 6.0

    LongVU adaptively compresses long video tokens using DINOv2-based frame deduplication, text-guided cross-modal selection, and temporal spatial reduction to improve video-language understanding in MLLMs with minimal de...

  13. Show-o: One Single Transformer to Unify Multimodal Understanding and Generation

    cs.CV 2024-08 unverdicted novelty 5.0

    Show-o unifies autoregressive and discrete diffusion modeling inside one transformer to support multimodal understanding and generation tasks with competitive benchmark performance.

  14. LLaVA-OneVision: Easy Visual Task Transfer

    cs.CV 2024-08 unverdicted novelty 5.0

    LLaVA-OneVision is the first single open LMM to simultaneously achieve strong performance in single-image, multi-image, and video scenarios with cross-scenario transfer capabilities.

  15. MiniCPM-V: A GPT-4V Level MLLM on Your Phone

    cs.CV 2024-08 conditional novelty 5.0

    MiniCPM-Llama3-V 2.5 delivers GPT-4V-level multimodal performance on phones through architecture, pretraining, and alignment optimizations.

  16. SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

    cs.CV 2025-02 unverdicted novelty 4.0

    SigLIP 2 models trained with a unified recipe of captioning, self-supervised losses, and curated diverse data outperform prior SigLIP versions on classification, retrieval, localization, dense prediction, and multilin...

  17. VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding

    cs.CV 2025-01 unverdicted novelty 4.0

    VideoLLaMA3 uses a vision-centric training paradigm and token-reduction design to reach competitive results on image and video benchmarks.

  18. PaliGemma 2: A Family of Versatile VLMs for Transfer

    cs.CV 2024-12 unverdicted novelty 4.0

    PaliGemma 2 is a family of vision-language models that achieves state-of-the-art results on transfer tasks like table structure recognition and radiography report generation by combining SigLIP with Gemma 2 models at ...

  19. PaliGemma: A versatile 3B VLM for transfer

    cs.CV 2024-07 unverdicted novelty 4.0

    PaliGemma is an open 3B VLM based on SigLIP and Gemma that achieves strong performance on nearly 40 diverse open-world tasks including benchmarks, remote-sensing, and segmentation.

Reference graph

Works this paper leans on

163 extracted references · 163 canonical work pages · cited by 19 Pith papers · 18 internal anchors

  1. [1]

    TallyQA: Answering complex counting questions

    M. Acharya, K. Kafle, and C. Kanan. “TallyQA: Answering complex counting questions”. In: AAAI. 2019

  2. [2]

    Don’t just assume; look and answer: Overcoming priors for visual question answering

    A. Agrawal et al. “Don’t just assume; look and answer: Overcoming priors for visual question answering”. In: CVPR. 2018

  3. [3]

    Objectron: A Large Scale Dataset of Object-Centric Videos in the Wild with Pose Annotations

    A. Ahmadyan et al. “Objectron: A Large Scale Dataset of Object-Centric Videos in the Wild with Pose Annotations”. In: CVPR (2021)

  4. [4]

    Llama 3 Model Card

    AI@Meta. “Llama 3 Model Card”. In: (2024)

  5. [5]

    Enhancing Textbook Question Answering Task with Large Lan- guage Models and Retrieval Augmented Generation

    H. A. Alawwad et al. “Enhancing Textbook Question Answering Task with Large Lan- guage Models and Retrieval Augmented Generation”. In: arXiv preprint arXiv:2402.05128 (2024)

  6. [6]

    Flamingo: a visual language model for few-shot learning

    J.-B. Alayrac et al. “Flamingo: a visual language model for few-shot learning”. In: NeurIPS. 2022

  7. [7]

    T. Aquinas. Quaestiones Disputatae de Veritate. q.2 a.3 arg.19, 1259

  8. [8]

    Metaphysics

    Aristotle. Metaphysics. Ed. by T. by W. D. Ross. The Internet Classics Archive, 350BCE

  9. [9]

    Self-supervised learning from images with a joint-embedding predictive architecture

    M. Assran et al. “Self-supervised learning from images with a joint-embedding predictive architecture”. In: CVPR. 2023

  10. [10]

    Qwen Technical Report

    J. Bai et al. “Qwen Technical Report”. In: arXiv preprint arXiv:2309.16609 (2023)

  11. [11]

    Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond

    J. Bai et al. “Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond”. In: (2023)

  12. [12]

    Probing the 3D Awareness of Visual Foundation Models

    M. E. Banani et al. “Probing the 3D Awareness of Visual Foundation Models”. In:arXiv preprint arXiv:2404.08636 (2024)

  13. [13]

    ARKitScenes - A Diverse Real-World Dataset for 3D Indoor Scene Understanding Using Mobile RGB-D Data

    G. Baruch et al. “ARKitScenes - A Diverse Real-World Dataset for 3D Indoor Scene Understanding Using Mobile RGB-D Data”. In: NeurIPS Datasets and Benchmarks Track (Round 1). 2021

  14. [14]

    Automatikz: Text-guided synthesis of scientific vector graphics with tikz

    J. Belouadi, A. Lauscher, and S. Eger. “Automatikz: Text-guided synthesis of scientific vector graphics with tikz”. In: ICLR. 2024

  15. [15]

    Midas v3. 1–a model zoo for robust monocular relative depth estimation

    R. Birkl, D. Wofk, and M. Müller. “Midas v3. 1–a model zoo for robust monocular relative depth estimation”. In: arXiv preprint arXiv:2307.14460 (2023)

  16. [16]

    Latr: Layout-aware transformer for scene-text vqa

    A. F. Biten et al. “Latr: Layout-aware transformer for scene-text vqa”. In: CVPR. 2022

  17. [17]

    Scene text visual question answering

    A. F. Biten et al. “Scene text visual question answering”. In: ICCV. 2019

  18. [18]

    Omni3d: A large benchmark and model for 3d object detection in the wild

    G. Brazil et al. “Omni3d: A large benchmark and model for 3d object detection in the wild”. In: CVPR. 2023

  19. [19]

    J. Buchner. imagehash (fork).https://github.com/JohannesBuchner/imagehash . 2021

  20. [20]

    nuscenes: A multimodal dataset for autonomous driving

    H. Caesar et al. “nuscenes: A multimodal dataset for autonomous driving”. In: CVPR. 2020

  21. [21]

    Honeybee: Locality-enhanced projector for multimodal llm

    J. Cha et al. “Honeybee: Locality-enhanced projector for multimodal llm”. In: CVPR. 2024

  22. [22]

    Visually Dehallucinative Instruction Generation: Know What You Don’t Know

    S. Cha et al. “Visually Dehallucinative Instruction Generation: Know What You Don’t Know”. In: arXiv preprint arXiv:2402.09717 (2024)

  23. [23]

    Does Thought Require Sensory Grounding? From Pure Thinkers to Large Language Models

    D. J. Chalmers. “Does Thought Require Sensory Grounding? From Pure Thinkers to Large Language Models”. In: Proceedings and Addresses of the American Philosophical Association 97 (2023), pp. 22–45. 23

  24. [24]

    A survey on evaluation of large language models

    Y. Chang et al. “A survey on evaluation of large language models”. In: ACM Transactions on Intelligent Systems and Technology 15.3 (2024), pp. 1–45

  25. [25]

    Allava: Harnessing gpt4v-synthesized data for a lite vision-language model

    G. H. Chen et al. “ALLaVA: Harnessing GPT4V-synthesized Data for A Lite Vision- Language Model”. In: arXiv preprint arXiv:2402.11684 (2024)

  26. [26]

    Are We on the Right Way for Evaluating Large Vision-Language Models?

    L. Chen et al. “Are We on the Right Way for Evaluating Large Vision-Language Models?” In: arXiv preprint arXiv:2403.20330 (2024)

  27. [27]

    ShareGPT4V: Improving Large Multi-Modal Models with Better Captions

    L. Chen et al. “Sharegpt4v: Improving large multi-modal models with better captions”. In: arXiv preprint arXiv:2311.12793 (2023)

  28. [28]

    Pali: A jointly-scaled multilingual language-image model

    X. Chen et al. “Pali: A jointly-scaled multilingual language-image model”. In:ICLR. 2023

  29. [29]

    An empirical study of training self-supervised vision transformers

    X. Chen, S. Xie, and K. He. “An empirical study of training self-supervised vision transformers”. In: ICCV. 2021

  30. [30]

    How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites

    Z. Chen et al. “How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites”. In: arXiv preprint arXiv:2404.16821 (2024)

  31. [31]

    Finqa: A dataset of numerical reasoning over financial data

    Z. Chen et al. “Finqa: A dataset of numerical reasoning over financial data”. In: EMNLP. 2021

  32. [32]

    HiTab: A hierarchical table dataset for question answering and natural language generation

    Z. Cheng et al. “HiTab: A hierarchical table dataset for question answering and natural language generation”. In: ACL. 2022

  33. [33]

    Reproducible scaling laws for contrastive language-image learning

    M. Cherti et al. “Reproducible scaling laws for contrastive language-image learning”. In: CVPR. 2023

  34. [34]

    Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference

    W.-L. Chiang et al. “Chatbot arena: An open platform for evaluating llms by human preference”. In: arXiv preprint arXiv:2403.04132 (2024)

  35. [35]

    Mobilevlm v2: Faster and stronger baseline for vision language model

    X. Chu et al. “Mobilevlm v2: Faster and stronger baseline for vision language model”. In: arXiv preprint arXiv:2402.03766 (2024)

  36. [36]

    Conover et al.Free Dolly: Introducing the World’s First Truly Open Instruction-Tuned LLM

    M. Conover et al.Free Dolly: Introducing the World’s First Truly Open Instruction-Tuned LLM

  37. [37]

    U R L: https://www.databricks.com/blog/2023/04/12/dolly-first- open-commercially-viable-instruction-tuned-llm (visited on 06/30/2023)

  38. [38]

    Instructblip: Towards general-purpose vision-language models with instruction tuning

    W. Dai et al. “Instructblip: Towards general-purpose vision-language models with instruction tuning”. In: NeurIPS. 2024

  39. [39]

    Rlhf workflow: From reward modeling to online rlhf

    H. Dong et al. “Rlhf workflow: From reward modeling to online rlhf”. In: arXiv preprint arXiv:2405.07863 (2024)

  40. [40]

    An image is worth 16x16 words: Transformers for image recogni- tion at scale

    A. Dosovitskiy et al. “An image is worth 16x16 words: Transformers for image recogni- tion at scale”. In: ICLR. 2021

  41. [41]

    Data filtering networks

    A. Fang et al. “Data filtering networks”. In: ICLR. 2024

  42. [42]

    BLINK: Multimodal Large Language Models Can See but Not Perceive

    X. Fu et al. “BLINK: Multimodal Large Language Models Can See but Not Perceive”. In: arXiv preprint arXiv:2404.12390 (2024)

  43. [43]

    Datacomp: In search of the next generation of multimodal datasets

    S. Y. Gadre et al. “Datacomp: In search of the next generation of multimodal datasets”. In: vol. 36. 2024

  44. [44]

    arXiv preprint arXiv:2312.11370 (2023)

    J. Gao et al. “G-llava: Solving geometric problem with multi-modal large language model”. In: arXiv preprint arXiv:2312.11370 (2023)

  45. [45]

    LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model

    P . Gao et al. “Llama-adapter v2: Parameter-efficient visual instruction model”. In:arXiv preprint arXiv:2304.15010 (2023)

  46. [46]

    arXiv preprint arXiv:2402.05935 (2024) 15

    P . Gao et al. “SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large Language Models”. In: arXiv preprint arXiv:2402.05935 (2024)

  47. [47]

    Planting a seed of vision in large language model

    Y. Ge et al. “Planting a seed of vision in large language model”. In: arXiv preprint arXiv:2307.08041 (2023). 24

  48. [48]

    Are we ready for Autonomous Driving? The KITTI Vision Benchmark Suite

    A. Geiger, P . Lenz, and R. Urtasun. “Are we ready for Autonomous Driving? The KITTI Vision Benchmark Suite”. In: CVPR. 2012

  49. [49]

    Shortcut learning in deep neural networks

    R. Geirhos et al. “Shortcut learning in deep neural networks”. In: Nature Machine Intelli- gence (2020)

  50. [50]

    Rich feature hierarchies for accurate object detection and semantic segmentation

    R. Girshick et al. “Rich feature hierarchies for accurate object detection and semantic segmentation”. In: CVPR. 2014

  51. [51]

    Google. Gemini. 2023

  52. [52]

    Making the v in vqa matter: Elevating the role of image understanding in visual question answering

    Y. Goyal et al. “Making the v in vqa matter: Elevating the role of image understanding in visual question answering”. In: CVPR. 2017

  53. [53]

    Vizwiz grand challenge: Answering visual questions from blind people

    D. Gurari et al. “Vizwiz grand challenge: Answering visual questions from blind people”. In: CVPR. 2018

  54. [54]

    Masked autoencoders are scalable vision learners

    K. He et al. “Masked autoencoders are scalable vision learners”. In: CVPR. 2022

  55. [55]

    PathVQA: 30000+ Questions for Medical Visual Question Answering

    X. He et al. “PathVQA: 30000+ Questions for Medical Visual Question Answering”. In: CoRR abs/2003.10286 (2020)

  56. [56]

    AI2D-RST: A multimodal corpus of 1000 primary school science diagrams

    T. Hiippala et al. “AI2D-RST: A multimodal corpus of 1000 primary school science diagrams”. In: Language Resources and Evaluation 55 (2021), pp. 661–688

  57. [57]

    Training compute-optimal large language models

    J. Hoffmann et al. “Training compute-optimal large language models”. In: NeurIPS (2023)

  58. [58]

    Screenqa: Large-scale question-answer pairs over mobile app screenshots

    Y.-C. Hsiao, F. Zubach, M. Wang, et al. “Screenqa: Large-scale question-answer pairs over mobile app screenshots”. In: arXiv preprint arXiv:2209.08199 (2022)

  59. [59]

    GQA: A New Dataset for Real-World Visual Reason- ing and Compositional Question Answering

    D. A. Hudson and C. D. Manning. “GQA: A New Dataset for Real-World Visual Reason- ing and Compositional Question Answering”. In: CVPR. 2019

  60. [60]

    Perceiver: General perception with iterative attention

    A. Jaegle et al. “Perceiver: General perception with iterative attention”. In: ICML. 2021

  61. [61]

    Clevr: A diagnostic dataset for compositional language and elementary visual reasoning

    J. Johnson et al. “Clevr: A diagnostic dataset for compositional language and elementary visual reasoning”. In: CVPR. 2017

  62. [62]

    Tpu v4: An optically reconfigurable supercomputer for machine learning with hardware support for embeddings

    N. Jouppi et al. “Tpu v4: An optically reconfigurable supercomputer for machine learning with hardware support for embeddings”. In: Proceedings of the 50th Annual International Symposium on Computer Architecture. 2023

  63. [63]

    Dvqa: Understanding data visualizations via question answering

    K. Kafle et al. “Dvqa: Understanding data visualizations via question answering”. In: CVPR. 2018

  64. [64]

    Chart-to-text: A large-scale benchmark for chart summarization

    S. Kantharaj et al. “Chart-to-text: A large-scale benchmark for chart summarization”. In: ACL. 2022

  65. [65]

    Prismatic vlms: Investigating the design space of visually-conditioned language models

    S. Karamcheti et al. “Prismatic vlms: Investigating the design space of visually-conditioned language models”. In: arXiv preprint arXiv:2402.07865 (2024)

  66. [66]

    Geomverse: A systematic evaluation of large models for geometric reasoning

    M. Kazemi et al. “Geomverse: A systematic evaluation of large models for geometric reasoning”. In: 2023

  67. [67]

    A diagram is worth a dozen images

    A. Kembhavi et al. “A diagram is worth a dozen images”. In: ECCV. 2016

  68. [68]

    The hateful memes challenge: Detecting hate speech in multimodal memes

    D. Kiela et al. “The hateful memes challenge: Detecting hate speech in multimodal memes”. In: NeurIPS. 2020

  69. [69]

    Donut: Document understanding transformer without ocr

    G. Kim et al. “Donut: Document understanding transformer without ocr”. In: ECCV. 2022

  70. [70]

    Segment anything

    A. Kirillov et al. “Segment anything”. In: ICCV. 2023

  71. [71]

    Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

    R. Krishna et al. “Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations”. In: IJCV (2016). 25

  72. [72]

    laion/gpt4v-dataset

    LAION. laion/gpt4v-dataset. 2023

  73. [73]

    Unlocking the conversion of Web Screenshots into HTML Code with the WebSight Dataset

    H. Laurençon, L. Tronchon, and V . Sanh. “Unlocking the conversion of Web Screenshots into HTML Code with the WebSight Dataset”. In: arXiv preprint arXiv:2403.09029 (2024)

  74. [74]

    What matters when building vision-language models?

    H. Laurençon et al. “What matters when building vision-language models?” In: arXiv preprint arXiv:2405.02246 (2024)

  75. [75]

    Internet Explorer: Targeted Representation Learning on the Open Web

    A. C. Li et al. “Internet Explorer: Targeted Representation Learning on the Open Web”. In: ICML. 2023

  76. [76]

    Your diffusion model is secretly a zero-shot classifier

    A. C. Li et al. “Your diffusion model is secretly a zero-shot classifier”. In: ICCV. 2023

  77. [77]

    Li et al

    B. Li et al. LLaV A-NeXT: Stronger LLMs Supercharge Multimodal Capabilities in the Wild. 2024

  78. [78]

    Multimodal ArXiv: A Dataset for Improving Scientific Comprehension of Large Vision-Language Models

    L. Li et al. “Multimodal ArXiv: A Dataset for Improving Scientific Comprehension of Large Vision-Language Models”. In: arXiv preprint arXiv:2403.00231 (2024)

  79. [79]

    Mini-gemini: Mining the potential of multi-modality vision language models

    Y. Li et al. “Mini-gemini: Mining the potential of multi-modality vision language models”. In: arXiv preprint arXiv:2403.18814 (2024)

  80. [80]

    Lian et al

    W. Lian et al. OpenOrca: An Open Dataset of GPT Augmented FLAN Reasoning Traces . https://https://huggingface.co/Open-Orca/OpenOrca. 2023

Showing first 80 references.