arxiv: 2406.16860 · v2 · submitted 2024-06-24 · 💻 cs.CV

Recognition: 2 theorem links

Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs

Shengbang Tong , Ellis Brown , Penghao Wu , Sanghyun Woo , Manoj Middepogu , Sai Charitha Akula , Jihan Yang , Shusheng Yang

show 6 more authors

Adithya Iyer Xichen Pan Ziteng Wang Rob Fergus Yann LeCun Saining Xie

Authors on Pith no claims yet

Pith reviewed 2026-05-16 23:59 UTC · model grok-4.3

classification 💻 cs.CV

keywords multimodal LLMsvision-centric designvisual instruction tuningCV-BenchSpatial Vision Aggregatorvision encodersopen MLLM recipes

0 comments

The pith

Cambrian-1 shows vision-centric design across twenty encoders plus new benchmarks produces stronger sensory grounding in multimodal LLMs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Cambrian-1 introduces a family of multimodal LLMs built by prioritizing vision components over language scaling alone. The work treats instruction tuning as a testbed for more than twenty vision encoders drawn from self-supervised, supervised, and hybrid training regimes. Existing benchmarks are examined for weak visual coverage, prompting the creation of CV-Bench to measure grounding more directly. A Spatial Vision Aggregator is added to fuse high-resolution features without excessive tokens, while data curation rules emphasize source balance. The models reach state-of-the-art results and release every component as an open recipe for future instruction-tuned MLLMs.

Core claim

Cambrian-1 achieves state-of-the-art performance on multimodal tasks by using a vision-centric approach that includes evaluating multiple vision encoders, introducing the CV-Bench for better measurement of visual capabilities, and employing the Spatial Vision Aggregator to integrate features spatially. The work also details the curation of instruction-tuning data and releases all components openly as a cookbook for future MLLM development.

What carries the argument

The Spatial Vision Aggregator, a dynamic spatially-aware connector that fuses high-resolution vision features with an LLM while cutting token count.

If this is right

Balanced selection of public visual instruction data improves model performance without new private datasets.
Hybrid vision encoders outperform single-paradigm encoders when paired with the same LLM backbone.
CV-Bench scores correlate more closely with real-world visual grounding than earlier multimodal suites.
Releasing full weights, code, and tuning recipes allows direct reproduction and extension by other groups.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same encoder-comparison method could be applied to test whether newer self-supervised vision models close the remaining gap to supervised ones.
SVA-style connectors might be adapted to reduce token usage in other high-resolution multimodal pipelines beyond instruction tuning.
Widespread adoption of the open cookbook could shift research focus from scaling language models to systematic visual-representation choices.

Load-bearing premise

Current MLLM benchmarks do not capture visual grounding accurately enough, so new tests like CV-Bench will give a truer picture without adding their own selection biases.

What would settle it

An experiment in which Cambrian-1 models underperform prior MLLMs on a held-out set of real-world tasks that demand fine-grained visual discrimination would refute the central claim.

read the original abstract

We introduce Cambrian-1, a family of multimodal LLMs (MLLMs) designed with a vision-centric approach. While stronger language models can enhance multimodal capabilities, the design choices for vision components are often insufficiently explored and disconnected from visual representation learning research. This gap hinders accurate sensory grounding in real-world scenarios. Our study uses LLMs and visual instruction tuning as an interface to evaluate various visual representations, offering new insights into different models and architectures -- self-supervised, strongly supervised, or combinations thereof -- based on experiments with over 20 vision encoders. We critically examine existing MLLM benchmarks, address the difficulties involved in consolidating and interpreting results from various tasks, and introduce a new vision-centric benchmark, CV-Bench. To further improve visual grounding, we propose the Spatial Vision Aggregator (SVA), a dynamic and spatially-aware connector that integrates high-resolution vision features with LLMs while reducing the number of tokens. Additionally, we discuss the curation of high-quality visual instruction-tuning data from publicly available sources, emphasizing the importance of data source balancing and distribution ratio. Collectively, Cambrian-1 not only achieves state-of-the-art performance but also serves as a comprehensive, open cookbook for instruction-tuned MLLMs. We provide model weights, code, supporting tools, datasets, and detailed instruction-tuning and evaluation recipes. We hope our release will inspire and accelerate advancements in multimodal systems and visual representation learning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Cambrian-1 runs a wide empirical comparison of vision encoders in MLLMs, ships CV-Bench and the SVA connector, and releases the full stack openly, but the vision-centric superiority claims rest on benchmark choices that still need tighter validation against language priors.

read the letter

The core of this paper is a large-scale test of more than 20 vision encoders inside multimodal LLMs, paired with a new benchmark called CV-Bench and a dynamic Spatial Vision Aggregator that cuts token count while keeping spatial detail. They also lay out how they balanced instruction-tuning data from public sources. That combination of scale and openness is the main practical value here. Readers get concrete numbers on which encoders help with spatial tasks and a reusable recipe for building similar models. The release of weights, code, datasets, and tuning scripts lowers the barrier for follow-up work, which is worth noting on its own. The experiments show consistent patterns across self-supervised and supervised encoders, and the SVA component appears to deliver measurable gains in efficiency without obvious accuracy drops in the reported tables. Those parts hold up as useful additions to the literature. The softer area is the claim that CV-Bench gives a cleaner read on sensory grounding. The paper criticizes prior benchmarks for weak vision focus, yet the new tasks are presented without detailed ablations on how they block language-only shortcuts or how well they align with human judgments of visual necessity. If rankings on CV-Bench end up similar to older suites once language priors are controlled, the measurement improvement is smaller than advertised. The SOTA numbers are reported clearly, but the absence of error bars or significance tests on the key comparisons leaves some room for doubt about robustness. This paper is aimed at people already working on MLLM architectures who want to swap vision backbones or improve grounding without starting from scratch. It is not a theoretical advance, but the empirical map and the open artifacts make it worth a careful read for that group. I would send it to peer review. The empirical scope and the public release give referees something concrete to evaluate, even if the benchmark validation needs tightening.

Referee Report

2 major / 2 minor

Summary. The paper introduces Cambrian-1, a family of vision-centric multimodal LLMs. It reports experiments with over 20 vision encoders (self-supervised, supervised, and combinations), proposes the Spatial Vision Aggregator (SVA) as a dynamic connector for high-resolution features, introduces CV-Bench as a new vision-centric benchmark to address limitations in existing MLLM evaluations, details curation and balancing of visual instruction-tuning data, and claims state-of-the-art performance while releasing models, code, datasets, and recipes as an open cookbook.

Significance. If the empirical claims hold under rigorous validation, the work is significant for systematically exploring under-studied vision components in MLLMs and for releasing a comprehensive open resource that could accelerate research on visual grounding and representation learning.

major comments (2)

[CV-Bench] CV-Bench section: The claim that CV-Bench delivers more accurate measurement of sensory grounding than prior benchmarks rests on reduced interpretation biases, yet the manuscript provides no ablations on task selection criteria, inter-rater reliability, or explicit controls for language-prior leakage; without these, the benchmark's superiority for vision-centric evaluation remains unverified and load-bearing for the paper's central thesis.
[Method and Experiments] SVA and vision-encoder experiments: The reported gains from combining encoders and using SVA for token-efficient integration lack detailed ablations on the free parameters (encoder selection, balancing ratios, SVA hyperparameters) and do not include statistical significance or error bars, which are required to substantiate the SOTA performance claims over baselines.

minor comments (2)

[Experiments] Evaluation protocols: Add explicit details on data splits, exact scoring procedures, and statistical tests for all reported metrics to allow reproduction and assessment of robustness.
[Figures and Notation] Notation and figures: Clarify the exact token-reduction formula for SVA and ensure all figures include axis labels, legends, and confidence intervals where applicable.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We have carefully considered each major comment and outline our responses below, including planned revisions to address the concerns raised while preserving the core contributions of Cambrian-1.

read point-by-point responses

Referee: [CV-Bench] CV-Bench section: The claim that CV-Bench delivers more accurate measurement of sensory grounding than prior benchmarks rests on reduced interpretation biases, yet the manuscript provides no ablations on task selection criteria, inter-rater reliability, or explicit controls for language-prior leakage; without these, the benchmark's superiority for vision-centric evaluation remains unverified and load-bearing for the paper's central thesis.

Authors: We appreciate the referee's emphasis on rigorous validation for CV-Bench. While the benchmark was designed with tasks that prioritize direct visual perception (e.g., spatial relations and object attributes) to reduce reliance on language priors compared to existing MLLM benchmarks, we acknowledge that explicit documentation of these design choices is needed. In the revised manuscript, we will expand the CV-Bench section to detail the task selection criteria, report inter-rater reliability scores from the annotation process, and include controls for language-prior leakage such as ablation studies comparing model performance with and without visual inputs. These additions will better substantiate the benchmark's utility for vision-centric evaluation. revision: yes
Referee: [Method and Experiments] SVA and vision-encoder experiments: The reported gains from combining encoders and using SVA for token-efficient integration lack detailed ablations on the free parameters (encoder selection, balancing ratios, SVA hyperparameters) and do not include statistical significance or error bars, which are required to substantiate the SOTA performance claims over baselines.

Authors: We thank the referee for underscoring the importance of comprehensive ablations and statistical rigor to support our empirical claims. Our experiments systematically evaluated over 20 vision encoders and their combinations, with SVA hyperparameters selected via validation performance and balancing ratios informed by data distribution analysis. To strengthen this, the revised version will incorporate additional ablation studies on encoder selection, data balancing ratios, and SVA hyperparameters in the main text and appendix. We will also report error bars from multiple random seeds and include statistical significance tests (e.g., paired t-tests) for key comparisons against baselines to more robustly substantiate the performance gains. revision: yes

Circularity Check

0 steps flagged

No significant circularity: empirical contributions and new benchmarks stand independently

full rationale

The paper's central claims rest on new empirical evaluations across over 20 vision encoders, architectural proposals such as SVA, curation of instruction-tuning data with explicit balancing choices, and the introduction of CV-Bench to address benchmark limitations. These elements do not reduce by the paper's own equations or definitions to previously fitted parameters, self-referential derivations, or load-bearing self-citations. No step matches the enumerated circularity patterns; the work is self-contained against external benchmarks and falsifiable via reported model rankings and ablation-style experiments on vision components.

Axiom & Free-Parameter Ledger

3 free parameters · 2 axioms · 1 invented entities

The paper is an empirical exploration rather than a first-principles derivation. Central claims depend on experimental validation of vision encoder choices, the design of CV-Bench, and the SVA module, with several hyperparameters and data balancing decisions made through trial rather than derived.

free parameters (3)

Vision encoder selection and combination
Choice of which of the over 20 encoders to test and how to combine self-supervised and supervised ones is experimental.
Data source balancing ratios
Distribution ratios for high-quality visual instruction-tuning data curated from public sources.
SVA architectural hyperparameters
Parameters controlling dynamic spatial aggregation and token reduction in the new connector.

axioms (2)

domain assumption Visual instruction tuning serves as a reliable interface to evaluate different visual representations
Used to compare self-supervised, strongly supervised, and combined encoders.
domain assumption Existing MLLM benchmarks have difficulties in consolidation and interpretation that a new vision-centric benchmark can address
Motivation for introducing CV-Bench.

invented entities (1)

Spatial Vision Aggregator (SVA) no independent evidence
purpose: Dynamic and spatially-aware connector that integrates high-resolution vision features with LLMs while reducing token count
Proposed to improve visual grounding in multimodal LLMs.

pith-pipeline@v0.9.0 · 5610 in / 1684 out tokens · 113616 ms · 2026-05-16T23:59:20.630148+00:00 · methodology

discussion (0)

Forward citations

Cited by 19 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark
cs.CL 2024-09 accept novelty 8.0

MMMU-Pro is a stricter multimodal benchmark that removes text-only solvable questions, augments options, and requires reading text from images, yielding substantially lower model scores of 16.8-26.9%.
MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans?
cs.CV 2024-08 conditional novelty 8.0

MME-RealWorld is the largest manually annotated high-resolution benchmark for MLLMs, where even the best models achieve less than 60% accuracy on challenging real-world tasks.
SVAgent: Storyline-Guided Long Video Understanding via Cross-Modal Multi-Agent Collaboration
cs.CV 2026-04 unverdicted novelty 7.0

SVAgent improves long video question answering by constructing storylines via multi-agent collaboration and aligning cross-modal predictions for more robust, human-like reasoning.
R1-VL: Learning to Reason with Multimodal Large Language Models via Step-wise Group Relative Policy Optimization
cs.AI 2025-03 conditional novelty 7.0

R1-VL uses StepGRPO with rule-based StepRAR and StepRVR rewards to let MLLMs learn step-by-step reasoning beyond imitation of positive paths.
WorldSense: Evaluating Real-world Omnimodal Understanding for Multimodal LLMs
cs.CV 2025-02 unverdicted novelty 7.0

WorldSense provides the first benchmark requiring synergistic audio-video-text understanding on 1,662 real-world videos and 3,172 QA pairs, where the best current multimodal LLM reaches only 65.1% accuracy.
ViLL-E: Video LLM Embeddings for Retrieval
cs.CV 2026-04 unverdicted novelty 6.0

ViLL-E introduces a dynamic embedding mechanism and joint contrastive-generative training for VideoLLMs, delivering up to 7% gains in temporal localization and 4% in video retrieval while enabling new zero-shot capabilities.
CLEAR: Unlocking Generative Potential for Degraded Image Understanding in Unified Multimodal Models
cs.CV 2026-04 unverdicted novelty 6.0

CLEAR uses degradation-aware fine-tuning, a latent representation bridge, and interleaved reinforcement learning to connect generative and reasoning capabilities in multimodal models for better degraded image understanding.
Visual Funnel: Resolving Contextual Blindness in Multimodal Large Language Models
cs.CV 2025-12 unverdicted novelty 6.0

Visual Funnel resolves contextual blindness in MLLMs by constructing an entropy-scaled portfolio of hierarchically structured image crops that preserves both local detail and global context.
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency
cs.CV 2025-08 unverdicted novelty 6.0

InternVL3.5 advances open-source multimodal models with Cascade RL for +16% reasoning gains and ViR for 4x inference speedup, with the 241B model reaching SOTA among open-source MLLMs on multimodal, reasoning, and age...
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models
cs.CV 2025-04 conditional novelty 6.0

InternVL3-78B sets a new open-source SOTA of 72.2 on MMMU via native joint multimodal pre-training, V2PE, MPO, and test-time scaling while remaining competitive with proprietary models.
Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling
cs.CV 2024-12 unverdicted novelty 6.0

InternVL 2.5 is the first open-source MLLM to surpass 70% on the MMMU benchmark via model, data, and test-time scaling, with a 3.7-point gain from chain-of-thought reasoning.
LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding
cs.CV 2024-10 unverdicted novelty 6.0

LongVU adaptively compresses long video tokens using DINOv2-based frame deduplication, text-guided cross-modal selection, and temporal spatial reduction to improve video-language understanding in MLLMs with minimal de...
Show-o: One Single Transformer to Unify Multimodal Understanding and Generation
cs.CV 2024-08 unverdicted novelty 5.0

Show-o unifies autoregressive and discrete diffusion modeling inside one transformer to support multimodal understanding and generation tasks with competitive benchmark performance.
LLaVA-OneVision: Easy Visual Task Transfer
cs.CV 2024-08 unverdicted novelty 5.0

LLaVA-OneVision is the first single open LMM to simultaneously achieve strong performance in single-image, multi-image, and video scenarios with cross-scenario transfer capabilities.
MiniCPM-V: A GPT-4V Level MLLM on Your Phone
cs.CV 2024-08 conditional novelty 5.0

MiniCPM-Llama3-V 2.5 delivers GPT-4V-level multimodal performance on phones through architecture, pretraining, and alignment optimizations.
SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features
cs.CV 2025-02 unverdicted novelty 4.0

SigLIP 2 models trained with a unified recipe of captioning, self-supervised losses, and curated diverse data outperform prior SigLIP versions on classification, retrieval, localization, dense prediction, and multilin...
VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding
cs.CV 2025-01 unverdicted novelty 4.0

VideoLLaMA3 uses a vision-centric training paradigm and token-reduction design to reach competitive results on image and video benchmarks.
PaliGemma 2: A Family of Versatile VLMs for Transfer
cs.CV 2024-12 unverdicted novelty 4.0

PaliGemma 2 is a family of vision-language models that achieves state-of-the-art results on transfer tasks like table structure recognition and radiography report generation by combining SigLIP with Gemma 2 models at ...
PaliGemma: A versatile 3B VLM for transfer
cs.CV 2024-07 unverdicted novelty 4.0

PaliGemma is an open 3B VLM based on SigLIP and Gemma that achieves strong performance on nearly 40 diverse open-world tasks including benchmarks, remote-sensing, and segmentation.

Reference graph

Works this paper leans on

163 extracted references · 163 canonical work pages · cited by 19 Pith papers · 18 internal anchors

[1]

TallyQA: Answering complex counting questions

M. Acharya, K. Kafle, and C. Kanan. “TallyQA: Answering complex counting questions”. In: AAAI. 2019

work page 2019
[2]

Don’t just assume; look and answer: Overcoming priors for visual question answering

A. Agrawal et al. “Don’t just assume; look and answer: Overcoming priors for visual question answering”. In: CVPR. 2018

work page 2018
[3]

Objectron: A Large Scale Dataset of Object-Centric Videos in the Wild with Pose Annotations

A. Ahmadyan et al. “Objectron: A Large Scale Dataset of Object-Centric Videos in the Wild with Pose Annotations”. In: CVPR (2021)

work page 2021
[4]

Llama 3 Model Card

AI@Meta. “Llama 3 Model Card”. In: (2024)

work page 2024
[5]

Enhancing Textbook Question Answering Task with Large Lan- guage Models and Retrieval Augmented Generation

H. A. Alawwad et al. “Enhancing Textbook Question Answering Task with Large Lan- guage Models and Retrieval Augmented Generation”. In: arXiv preprint arXiv:2402.05128 (2024)

work page arXiv 2024
[6]

Flamingo: a visual language model for few-shot learning

J.-B. Alayrac et al. “Flamingo: a visual language model for few-shot learning”. In: NeurIPS. 2022

work page 2022
[7]

T. Aquinas. Quaestiones Disputatae de Veritate. q.2 a.3 arg.19, 1259

work page
[8]

Metaphysics

Aristotle. Metaphysics. Ed. by T. by W. D. Ross. The Internet Classics Archive, 350BCE

work page
[9]

Self-supervised learning from images with a joint-embedding predictive architecture

M. Assran et al. “Self-supervised learning from images with a joint-embedding predictive architecture”. In: CVPR. 2023

work page 2023
[10]

Qwen Technical Report

J. Bai et al. “Qwen Technical Report”. In: arXiv preprint arXiv:2309.16609 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[11]

Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond

J. Bai et al. “Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond”. In: (2023)

work page 2023
[12]

Probing the 3D Awareness of Visual Foundation Models

M. E. Banani et al. “Probing the 3D Awareness of Visual Foundation Models”. In:arXiv preprint arXiv:2404.08636 (2024)

work page arXiv 2024
[13]

ARKitScenes - A Diverse Real-World Dataset for 3D Indoor Scene Understanding Using Mobile RGB-D Data

G. Baruch et al. “ARKitScenes - A Diverse Real-World Dataset for 3D Indoor Scene Understanding Using Mobile RGB-D Data”. In: NeurIPS Datasets and Benchmarks Track (Round 1). 2021

work page 2021
[14]

Automatikz: Text-guided synthesis of scientific vector graphics with tikz

J. Belouadi, A. Lauscher, and S. Eger. “Automatikz: Text-guided synthesis of scientific vector graphics with tikz”. In: ICLR. 2024

work page 2024
[15]

Midas v3. 1–a model zoo for robust monocular relative depth estimation

R. Birkl, D. Wofk, and M. Müller. “Midas v3. 1–a model zoo for robust monocular relative depth estimation”. In: arXiv preprint arXiv:2307.14460 (2023)

work page arXiv 2023
[16]

Latr: Layout-aware transformer for scene-text vqa

A. F. Biten et al. “Latr: Layout-aware transformer for scene-text vqa”. In: CVPR. 2022

work page 2022
[17]

Scene text visual question answering

A. F. Biten et al. “Scene text visual question answering”. In: ICCV. 2019

work page 2019
[18]

Omni3d: A large benchmark and model for 3d object detection in the wild

G. Brazil et al. “Omni3d: A large benchmark and model for 3d object detection in the wild”. In: CVPR. 2023

work page 2023
[19]

J. Buchner. imagehash (fork).https://github.com/JohannesBuchner/imagehash . 2021

work page 2021
[20]

nuscenes: A multimodal dataset for autonomous driving

H. Caesar et al. “nuscenes: A multimodal dataset for autonomous driving”. In: CVPR. 2020

work page 2020
[21]

Honeybee: Locality-enhanced projector for multimodal llm

J. Cha et al. “Honeybee: Locality-enhanced projector for multimodal llm”. In: CVPR. 2024

work page 2024
[22]

Visually Dehallucinative Instruction Generation: Know What You Don’t Know

S. Cha et al. “Visually Dehallucinative Instruction Generation: Know What You Don’t Know”. In: arXiv preprint arXiv:2402.09717 (2024)

work page arXiv 2024
[23]

Does Thought Require Sensory Grounding? From Pure Thinkers to Large Language Models

D. J. Chalmers. “Does Thought Require Sensory Grounding? From Pure Thinkers to Large Language Models”. In: Proceedings and Addresses of the American Philosophical Association 97 (2023), pp. 22–45. 23

work page 2023
[24]

A survey on evaluation of large language models

Y. Chang et al. “A survey on evaluation of large language models”. In: ACM Transactions on Intelligent Systems and Technology 15.3 (2024), pp. 1–45

work page 2024
[25]

Allava: Harnessing gpt4v-synthesized data for a lite vision-language model

G. H. Chen et al. “ALLaVA: Harnessing GPT4V-synthesized Data for A Lite Vision- Language Model”. In: arXiv preprint arXiv:2402.11684 (2024)

work page arXiv 2024
[26]

Are We on the Right Way for Evaluating Large Vision-Language Models?

L. Chen et al. “Are We on the Right Way for Evaluating Large Vision-Language Models?” In: arXiv preprint arXiv:2403.20330 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[27]

ShareGPT4V: Improving Large Multi-Modal Models with Better Captions

L. Chen et al. “Sharegpt4v: Improving large multi-modal models with better captions”. In: arXiv preprint arXiv:2311.12793 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[28]

Pali: A jointly-scaled multilingual language-image model

X. Chen et al. “Pali: A jointly-scaled multilingual language-image model”. In:ICLR. 2023

work page 2023
[29]

An empirical study of training self-supervised vision transformers

X. Chen, S. Xie, and K. He. “An empirical study of training self-supervised vision transformers”. In: ICCV. 2021

work page 2021
[30]

How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites

Z. Chen et al. “How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites”. In: arXiv preprint arXiv:2404.16821 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[31]

Finqa: A dataset of numerical reasoning over financial data

Z. Chen et al. “Finqa: A dataset of numerical reasoning over financial data”. In: EMNLP. 2021

work page 2021
[32]

HiTab: A hierarchical table dataset for question answering and natural language generation

Z. Cheng et al. “HiTab: A hierarchical table dataset for question answering and natural language generation”. In: ACL. 2022

work page 2022
[33]

Reproducible scaling laws for contrastive language-image learning

M. Cherti et al. “Reproducible scaling laws for contrastive language-image learning”. In: CVPR. 2023

work page 2023
[34]

Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference

W.-L. Chiang et al. “Chatbot arena: An open platform for evaluating llms by human preference”. In: arXiv preprint arXiv:2403.04132 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[35]

Mobilevlm v2: Faster and stronger baseline for vision language model

X. Chu et al. “Mobilevlm v2: Faster and stronger baseline for vision language model”. In: arXiv preprint arXiv:2402.03766 (2024)

work page arXiv 2024
[36]

Conover et al.Free Dolly: Introducing the World’s First Truly Open Instruction-Tuned LLM

M. Conover et al.Free Dolly: Introducing the World’s First Truly Open Instruction-Tuned LLM

work page
[37]

U R L: https://www.databricks.com/blog/2023/04/12/dolly-first- open-commercially-viable-instruction-tuned-llm (visited on 06/30/2023)

work page 2023
[38]

Instructblip: Towards general-purpose vision-language models with instruction tuning

W. Dai et al. “Instructblip: Towards general-purpose vision-language models with instruction tuning”. In: NeurIPS. 2024

work page 2024
[39]

Rlhf workflow: From reward modeling to online rlhf

H. Dong et al. “Rlhf workflow: From reward modeling to online rlhf”. In: arXiv preprint arXiv:2405.07863 (2024)

work page arXiv 2024
[40]

An image is worth 16x16 words: Transformers for image recogni- tion at scale

A. Dosovitskiy et al. “An image is worth 16x16 words: Transformers for image recogni- tion at scale”. In: ICLR. 2021

work page 2021
[41]

Data filtering networks

A. Fang et al. “Data filtering networks”. In: ICLR. 2024

work page 2024
[42]

BLINK: Multimodal Large Language Models Can See but Not Perceive

X. Fu et al. “BLINK: Multimodal Large Language Models Can See but Not Perceive”. In: arXiv preprint arXiv:2404.12390 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[43]

Datacomp: In search of the next generation of multimodal datasets

S. Y. Gadre et al. “Datacomp: In search of the next generation of multimodal datasets”. In: vol. 36. 2024

work page 2024
[44]

arXiv preprint arXiv:2312.11370 (2023)

J. Gao et al. “G-llava: Solving geometric problem with multi-modal large language model”. In: arXiv preprint arXiv:2312.11370 (2023)

work page arXiv 2023
[45]

LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model

P . Gao et al. “Llama-adapter v2: Parameter-efficient visual instruction model”. In:arXiv preprint arXiv:2304.15010 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[46]

arXiv preprint arXiv:2402.05935 (2024) 15

P . Gao et al. “SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large Language Models”. In: arXiv preprint arXiv:2402.05935 (2024)

work page arXiv 2024
[47]

Planting a seed of vision in large language model

Y. Ge et al. “Planting a seed of vision in large language model”. In: arXiv preprint arXiv:2307.08041 (2023). 24

work page arXiv 2023
[48]

Are we ready for Autonomous Driving? The KITTI Vision Benchmark Suite

A. Geiger, P . Lenz, and R. Urtasun. “Are we ready for Autonomous Driving? The KITTI Vision Benchmark Suite”. In: CVPR. 2012

work page 2012
[49]

Shortcut learning in deep neural networks

R. Geirhos et al. “Shortcut learning in deep neural networks”. In: Nature Machine Intelli- gence (2020)

work page 2020
[50]

Rich feature hierarchies for accurate object detection and semantic segmentation

R. Girshick et al. “Rich feature hierarchies for accurate object detection and semantic segmentation”. In: CVPR. 2014

work page 2014
[51]

Google. Gemini. 2023

work page 2023
[52]

Making the v in vqa matter: Elevating the role of image understanding in visual question answering

Y. Goyal et al. “Making the v in vqa matter: Elevating the role of image understanding in visual question answering”. In: CVPR. 2017

work page 2017
[53]

Vizwiz grand challenge: Answering visual questions from blind people

D. Gurari et al. “Vizwiz grand challenge: Answering visual questions from blind people”. In: CVPR. 2018

work page 2018
[54]

Masked autoencoders are scalable vision learners

K. He et al. “Masked autoencoders are scalable vision learners”. In: CVPR. 2022

work page 2022
[55]

PathVQA: 30000+ Questions for Medical Visual Question Answering

X. He et al. “PathVQA: 30000+ Questions for Medical Visual Question Answering”. In: CoRR abs/2003.10286 (2020)

work page internal anchor Pith review Pith/arXiv arXiv 2003
[56]

AI2D-RST: A multimodal corpus of 1000 primary school science diagrams

T. Hiippala et al. “AI2D-RST: A multimodal corpus of 1000 primary school science diagrams”. In: Language Resources and Evaluation 55 (2021), pp. 661–688

work page 2021
[57]

Training compute-optimal large language models

J. Hoffmann et al. “Training compute-optimal large language models”. In: NeurIPS (2023)

work page 2023
[58]

Screenqa: Large-scale question-answer pairs over mobile app screenshots

Y.-C. Hsiao, F. Zubach, M. Wang, et al. “Screenqa: Large-scale question-answer pairs over mobile app screenshots”. In: arXiv preprint arXiv:2209.08199 (2022)

work page arXiv 2022
[59]

GQA: A New Dataset for Real-World Visual Reason- ing and Compositional Question Answering

D. A. Hudson and C. D. Manning. “GQA: A New Dataset for Real-World Visual Reason- ing and Compositional Question Answering”. In: CVPR. 2019

work page 2019
[60]

Perceiver: General perception with iterative attention

A. Jaegle et al. “Perceiver: General perception with iterative attention”. In: ICML. 2021

work page 2021
[61]

Clevr: A diagnostic dataset for compositional language and elementary visual reasoning

J. Johnson et al. “Clevr: A diagnostic dataset for compositional language and elementary visual reasoning”. In: CVPR. 2017

work page 2017
[62]

Tpu v4: An optically reconfigurable supercomputer for machine learning with hardware support for embeddings

N. Jouppi et al. “Tpu v4: An optically reconfigurable supercomputer for machine learning with hardware support for embeddings”. In: Proceedings of the 50th Annual International Symposium on Computer Architecture. 2023

work page 2023
[63]

Dvqa: Understanding data visualizations via question answering

K. Kafle et al. “Dvqa: Understanding data visualizations via question answering”. In: CVPR. 2018

work page 2018
[64]

Chart-to-text: A large-scale benchmark for chart summarization

S. Kantharaj et al. “Chart-to-text: A large-scale benchmark for chart summarization”. In: ACL. 2022

work page 2022
[65]

Prismatic vlms: Investigating the design space of visually-conditioned language models

S. Karamcheti et al. “Prismatic vlms: Investigating the design space of visually-conditioned language models”. In: arXiv preprint arXiv:2402.07865 (2024)

work page arXiv 2024
[66]

Geomverse: A systematic evaluation of large models for geometric reasoning

M. Kazemi et al. “Geomverse: A systematic evaluation of large models for geometric reasoning”. In: 2023

work page 2023
[67]

A diagram is worth a dozen images

A. Kembhavi et al. “A diagram is worth a dozen images”. In: ECCV. 2016

work page 2016
[68]

The hateful memes challenge: Detecting hate speech in multimodal memes

D. Kiela et al. “The hateful memes challenge: Detecting hate speech in multimodal memes”. In: NeurIPS. 2020

work page 2020
[69]

Donut: Document understanding transformer without ocr

G. Kim et al. “Donut: Document understanding transformer without ocr”. In: ECCV. 2022

work page 2022
[70]

Segment anything

A. Kirillov et al. “Segment anything”. In: ICCV. 2023

work page 2023
[71]

Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

R. Krishna et al. “Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations”. In: IJCV (2016). 25

work page 2016
[72]

laion/gpt4v-dataset

LAION. laion/gpt4v-dataset. 2023

work page 2023
[73]

Unlocking the conversion of Web Screenshots into HTML Code with the WebSight Dataset

H. Laurençon, L. Tronchon, and V . Sanh. “Unlocking the conversion of Web Screenshots into HTML Code with the WebSight Dataset”. In: arXiv preprint arXiv:2403.09029 (2024)

work page arXiv 2024
[74]

What matters when building vision-language models?

H. Laurençon et al. “What matters when building vision-language models?” In: arXiv preprint arXiv:2405.02246 (2024)

work page arXiv 2024
[75]

Internet Explorer: Targeted Representation Learning on the Open Web

A. C. Li et al. “Internet Explorer: Targeted Representation Learning on the Open Web”. In: ICML. 2023

work page 2023
[76]

Your diffusion model is secretly a zero-shot classifier

A. C. Li et al. “Your diffusion model is secretly a zero-shot classifier”. In: ICCV. 2023

work page 2023
[77]

Li et al

B. Li et al. LLaV A-NeXT: Stronger LLMs Supercharge Multimodal Capabilities in the Wild. 2024

work page 2024
[78]

Multimodal ArXiv: A Dataset for Improving Scientific Comprehension of Large Vision-Language Models

L. Li et al. “Multimodal ArXiv: A Dataset for Improving Scientific Comprehension of Large Vision-Language Models”. In: arXiv preprint arXiv:2403.00231 (2024)

work page arXiv 2024
[79]

Mini-gemini: Mining the potential of multi-modality vision language models

Y. Li et al. “Mini-gemini: Mining the potential of multi-modality vision language models”. In: arXiv preprint arXiv:2403.18814 (2024)

work page arXiv 2024
[80]

Lian et al

W. Lian et al. OpenOrca: An Open Dataset of GPT Augmented FLAN Reasoning Traces . https://https://huggingface.co/Open-Orca/OpenOrca. 2023

work page 2023

Showing first 80 references.