arxiv: 2409.17146 · v2 · submitted 2024-09-25 · 💻 cs.CV · cs.CL· cs.LG

Recognition: 2 theorem links

Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models

Matt Deitke , Christopher Clark , Sangho Lee , Rohun Tripathi , Yue Yang , Jae Sung Park , Mohammadreza Salehi , Niklas Muennighoff

show 42 more authors

Kyle Lo Luca Soldaini Jiasen Lu Taira Anderson Erin Bransom Kiana Ehsani Huong Ngo YenSung Chen Ajay Patel Mark Yatskar Chris Callison-Burch Andrew Head Rose Hendrix Favyen Bastani Eli VanderBilt Nathan Lambert Yvonne Chou Arnavi Chheda Jenna Sparks Sam Skjonsberg Michael Schmitz Aaron Sarnat Byron Bischoff Pete Walsh Chris Newell Piper Wolters Tanmay Gupta Kuo-Hao Zeng Jon Borchardt Dirk Groeneveld Crystal Nam Sophie Lebrecht Caitlin Wittlif Carissa Schoenick Oscar Michel Ranjay Krishna Luca Weihs Noah A. Smith Hannaneh Hajishirzi Ross Girshick Ali Farhadi Aniruddha Kembhavi

Authors on Pith no claims yet

Pith reviewed 2026-05-15 01:50 UTC · model grok-4.3

classification 💻 cs.CV cs.CLcs.LG

keywords vision-language modelsopen datasetsimage captioningvisual question answeringpointing datamultimodal learningopen weightsdata collection

0 comments

The pith

New independently collected datasets enable open vision-language models that outperform most proprietary alternatives on benchmarks and human evaluations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to establish that vision-language models can reach state-of-the-art performance in their openness class by training from scratch on freshly gathered data instead of distilling from closed systems. It introduces datasets of detailed image captions for pre-training, free-form question-answer pairs for fine-tuning, and 2D pointing annotations, all assembled without external model assistance. A sympathetic reader would care because this supplies the missing foundation for building capable models openly and reduces reliance on proprietary data pipelines. If the claim holds, the community obtains practical resources to replicate, scale, and extend high-performing systems independently. The largest model in the presented family demonstrates the outcome by surpassing other open models and several larger closed ones according to standard tests and a large-scale human study.

Core claim

The authors establish that a collection of human-gathered datasets allows construction of state-of-the-art open vision-language models without synthetic data from proprietary systems. These datasets comprise highly detailed image captions for pre-training, free-form image question-answer pairs for fine-tuning, and an innovative 2D pointing dataset. The resulting models, particularly the 72 billion parameter variant, achieve performance that exceeds other open-weight models and larger proprietary models including Claude 3.5 Sonnet and Gemini 1.5 variants, ranking second only to the leading closed model on both academic benchmarks and human evaluations.

What carries the argument

The PixMo datasets of highly detailed image captions, free-form image question-answer pairs, and 2D pointing data, all collected without the use of external vision-language models, which serve as the primary training resource enabling performance gains.

If this is right

Open-weight vision-language models can surpass several larger proprietary systems without depending on distillation from closed models.
Data collection focused on detailed captions, free-form questions, and pointing annotations yields measurable gains in both automated metrics and human judgments.
Full release of weights, datasets, and code enables direct replication and extension by the broader community.
Training pipelines that avoid external model-generated data restore independent development paths for multimodal systems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same independent collection approach could be extended to create datasets for related tasks such as video or document understanding.
Full openness of both data and weights may encourage standardized evaluation practices that reduce hidden dependencies across the field.
Testing whether the 2D pointing component specifically improves spatial reasoning accuracy would isolate one data contribution.
Researchers could combine these datasets with alternative model scales or architectures to determine the relative importance of data versus design choices.

Load-bearing premise

The performance gains arise primarily from the quality and independence of the newly collected datasets rather than from specific undisclosed modeling choices or training pipeline details.

What would settle it

Retrain the same model architecture using only synthetic data generated by existing proprietary vision-language models in place of the new datasets and measure whether benchmark scores and human evaluation results drop to levels typical of prior open models.

read the original abstract

Today's most advanced vision-language models (VLMs) remain proprietary. The strongest open-weight models rely heavily on synthetic data from proprietary VLMs to achieve good performance, effectively distilling these closed VLMs into open ones. As a result, the community has been missing foundational knowledge about how to build performant VLMs from scratch. We present Molmo, a new family of VLMs that are state-of-the-art in their class of openness. Our key contribution is a collection of new datasets called PixMo, including a dataset of highly detailed image captions for pre-training, a free-form image Q&A dataset for fine-tuning, and an innovative 2D pointing dataset, all collected without the use of external VLMs. The success of our approach relies on careful modeling choices, a well-tuned training pipeline, and, most critically, the quality of our newly collected datasets. Our best-in-class 72B model not only outperforms others in the class of open weight and data models, but also outperforms larger proprietary models including Claude 3.5 Sonnet, and Gemini 1.5 Pro and Flash, second only to GPT-4o based on both academic benchmarks and on a large human evaluation. Our model weights, new datasets, and source code are available at https://molmo.allenai.org/blog.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Molmo ships new independent PixMo datasets and open models with competitive results, but the paper does not isolate how much the data drives the gains versus modeling and pipeline choices.

read the letter

The main takeaway is that this work releases PixMo, a set of new datasets for detailed image captions, free-form QA, and 2D pointing that were collected without distilling from closed VLMs. That fills a practical gap for people who want to train VLMs from more independent sources. The 72B Molmo model is presented as outperforming other open-weight systems and some larger proprietary ones like Claude 3.5 Sonnet on both benchmarks and human evaluations, with weights, data, and code all made public.

Referee Report

1 major / 1 minor

Summary. The paper introduces Molmo, a family of open-weight vision-language models, and the PixMo datasets (highly detailed image captions for pre-training, free-form image QA for fine-tuning, and a novel 2D pointing dataset), all collected without using external VLMs. The authors claim that their best 72B model achieves state-of-the-art results among open-weight and open-data models, outperforming several larger proprietary models (Claude 3.5 Sonnet, Gemini 1.5 Pro, Gemini 1.5 Flash) and ranking second only to GPT-4o on academic benchmarks and a large-scale human evaluation. Model weights, datasets, and code are released at https://molmo.allenai.org/blog.

Significance. If the performance claims hold, the work is significant for supplying fully open weights, data, and code that enable the community to study and replicate strong VLMs without distilling from closed models. The independent collection of PixMo data (detailed captions, free-form QA, and 2D pointing) offers a concrete alternative to synthetic data pipelines and can support further research on data quality for vision-language modeling.

major comments (1)

[Abstract] Abstract: the central claim that PixMo data quality is 'most critically' responsible for the reported gains (outperforming open peers and several proprietary models) is not supported by isolating ablations. No experiments hold model architecture, optimizer schedule, data volume, and training pipeline fixed while swapping PixMo for standard open corpora or synthetic data from closed VLMs; therefore the causal attribution remains unisolated.

minor comments (1)

[Abstract] Abstract: benchmark results are described only qualitatively ('strong benchmark and human evaluation results'); adding a brief table or sentence with key metrics, baselines, and error bars would improve immediate verifiability.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and the recommendation of minor revision. We address the concern about the strength of evidence for attributing performance gains to PixMo data quality below.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that PixMo data quality is 'most critically' responsible for the reported gains (outperforming open peers and several proprietary models) is not supported by isolating ablations. No experiments hold model architecture, optimizer schedule, data volume, and training pipeline fixed while swapping PixMo for standard open corpora or synthetic data from closed VLMs; therefore the causal attribution remains unisolated.

Authors: We agree that the manuscript lacks a single, fully isolated ablation that holds model architecture, optimizer schedule, data volume, and training pipeline exactly fixed while only swapping the data source between PixMo and standard open corpora or synthetic data from closed VLMs. Performing such a controlled swap at the 72B scale is computationally prohibitive. That said, we provide supporting evidence via smaller-scale controlled ablations (reported in Section 4) that compare PixMo captions against LAION-5B and COCO while keeping architecture and training recipe fixed, as well as direct performance comparisons against open models trained on synthetic data from proprietary VLMs. Human preference studies further corroborate the higher quality of PixMo annotations. To address the referee's point, we will revise the abstract to replace 'most critically' with 'significantly' and add an explicit limitations paragraph acknowledging the absence of a full-scale isolating ablation. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical results from new independent datasets

full rationale

The paper's claims rest on the creation of new PixMo datasets (detailed captions, free-form QA, 2D pointing) collected without external VLMs, followed by standard training of Molmo models using described architecture choices and pipeline tuning. Performance is evaluated on public benchmarks and human studies. No equations, derivations, or first-principles predictions are presented that reduce to fitted parameters or self-referential definitions. Self-citations, if present, support background context rather than load-bearing justification for the SOTA results. The derivation chain is self-contained empirical work with no reduction by construction.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The central claim depends on standard deep learning training assumptions and the premise that human-collected high-quality data drives performance gains, with no new physical entities postulated.

free parameters (2)

training hyperparameters
A well-tuned training pipeline implies selection or fitting of learning rates, batch sizes, and optimization settings.
model architecture parameters
Careful modeling choices include selection of model size, layers, and vision encoder details.

axioms (2)

standard math Neural network optimization converges to useful minima under standard training regimes.
Implicit in any large-scale VLM training described.
domain assumption Human-collected image annotations provide higher quality signals than synthetic data from closed VLMs.
Core premise for why PixMo datasets enable superior performance.

pith-pipeline@v0.9.0 · 5757 in / 1417 out tokens · 49543 ms · 2026-05-15T01:50:34.877948+00:00 · methodology

discussion (0)

Forward citations

Cited by 23 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

20/20 Vision Language Models: A Prescription for Better VLMs through Data Curation Alone
cs.LG 2026-05 conditional novelty 6.0

Data curation alone raises VLM accuracy by more than 11 points on average across many benchmarks while cutting required training compute by up to 87 times.
20/20 Vision Language Models: A Prescription for Better VLMs through Data Curation Alone
cs.LG 2026-05 unverdicted novelty 6.0

Data curation alone raises VLM accuracy by 11+ points on average, improves reliability and OOD generalization, and achieves near-frontier results at far lower training and inference cost.
PRTS: A Primitive Reasoning and Tasking System via Contrastive Representations
cs.AI 2026-04 unverdicted novelty 6.0

PRTS pretrains VLA models with contrastive goal-conditioned RL to embed goal-reachability probabilities from offline data, yielding SOTA results on robotic benchmarks especially for long-horizon and novel instructions.
Entropy-Gradient Grounding: Training-Free Evidence Retrieval in Vision-Language Models
cs.CV 2026-04 unverdicted novelty 6.0

Entropy-gradient grounding uses model uncertainty to retrieve evidence regions in VLMs, improving performance on detail-critical and compositional tasks across multiple architectures.
ABMAMBA: Multimodal Large Language Model with Aligned Hierarchical Bidirectional Scan for Efficient Video Captioning
cs.CV 2026-04 unverdicted novelty 6.0

ABMamba uses Mamba-based linear-complexity processing plus a novel Aligned Hierarchical Bidirectional Scan to deliver competitive video captioning on VATEX and MSR-VTT at roughly 3x higher throughput than typical Tran...
A1: A Fully Transparent Open-Source, Adaptive and Efficient Truncated Vision-Language-Action Model
cs.RO 2026-04 unverdicted novelty 6.0

A1 is a transparent VLA framework achieving state-of-the-art robot manipulation success with up to 72% lower latency via adaptive layer truncation and inter-layer flow matching.
InternVLA-M1: A Spatially Guided Vision-Language-Action Framework for Generalist Robot Policy
cs.RO 2025-10 unverdicted novelty 6.0

InternVLA-M1 uses spatially guided pre-training on 2.3M examples followed by action post-training to deliver up to 17% gains on robot manipulation benchmarks and 20.6% on unseen objects.
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency
cs.CV 2025-08 unverdicted novelty 6.0

InternVL3.5 advances open-source multimodal models with Cascade RL for +16% reasoning gains and ViR for 4x inference speedup, with the 241B model reaching SOTA among open-source MLLMs on multimodal, reasoning, and age...
Perception Encoder: The best visual embeddings are not at the output of the network
cs.CV 2025-04 unverdicted novelty 6.0

Intermediate layers of a contrastively trained vision-language encoder yield stronger general embeddings than the output layer, enabling state-of-the-art performance across image/video classification, multimodal QA, a...
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models
cs.CV 2025-04 conditional novelty 6.0

InternVL3-78B sets a new open-source SOTA of 72.2 on MMMU via native joint multimodal pre-training, V2PE, MPO, and test-time scaling while remaining competitive with proprietary models.
SmolVLM: Redefining small and efficient multimodal models
cs.AI 2025-04 unverdicted novelty 6.0

SmolVLM-256M outperforms a 300-times larger model using under 1 GB GPU memory, while the 2.2B version matches state-of-the-art VLMs at half the memory cost.
Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling
cs.CV 2024-12 unverdicted novelty 6.0

InternVL 2.5 is the first open-source MLLM to surpass 70% on the MMMU benchmark via model, data, and test-time scaling, with a 3.7-point gain from chain-of-thought reasoning.
Pixtral 12B
cs.CV 2024-10 unverdicted novelty 6.0

Pixtral-12B is a 12B multimodal LLM with a custom vision encoder that ingests images at native resolution and aspect ratio, achieving leading benchmark results among open models while preserving text capabilities.
Visibility-Aware Mobile Grasping in Dynamic Environments
cs.RO 2026-05 unverdicted novelty 5.0

A visibility-aware mobile grasping system with iterative whole-body planning and behavior-tree subgoal generation achieves 68.8% success in unknown static and 58% in dynamic environments, outperforming a baseline by 2...
UniMesh: Unifying 3D Mesh Understanding and Generation
cs.CV 2026-04 unverdicted novelty 5.0

UniMesh unifies 3D mesh generation and understanding in one model via a Mesh Head interface, Chain of Mesh iterative editing, and an Actor-Evaluator self-reflection loop.
Anthropogenic Regional Adaptation in Multimodal Vision-Language Model
cs.AI 2026-04 unverdicted novelty 5.0

Anthropogenic Regional Adaptation with GG-EZ improves cultural relevance in multimodal vision-language models for Southeast Asia by 5-15% while retaining over 98% of global performance.
Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs
cs.CL 2025-03 unverdicted novelty 5.0

Phi-4-Mini achieves strong math and coding performance with only 3.8B parameters via high-quality synthetic data, while Phi-4-Multimodal uses Mixture-of-LoRAs to integrate modalities and top speech recognition leaderboards.
DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding
cs.CV 2024-12 accept novelty 5.0

DeepSeek-VL2 is a series of MoE vision-language models using dynamic tiling and latent attention that reach competitive or state-of-the-art results on VQA, OCR, document understanding and grounding with 1.0B to 4.5B a...
When Relations Break: Analyzing Relation Hallucination in Vision-Language Model Under Rotation and Noise
cs.CV 2026-05 unverdicted novelty 4.0

Mild rotations and noise significantly increase relation hallucinations in VLMs across models and datasets, with prompt augmentation and preprocessing offering only partial mitigation.
When Relations Break: Analyzing Relation Hallucination in Vision-Language Model Under Rotation and Noise
cs.CV 2026-05 unverdicted novelty 4.0

Mild rotations and noise significantly increase relation hallucinations in VLMs across models and datasets, with prompt and preprocessing fixes providing only partial relief.
Visibility-Aware Mobile Grasping in Dynamic Environments
cs.RO 2026-05 unverdicted novelty 4.0

A unified visibility-aware mobile grasping system using whole-body planning, active perception, and behavior trees improves success rates in unknown static and dynamic environments.
Seed1.5-VL Technical Report
cs.CV 2025-05 unverdicted novelty 4.0

Seed1.5-VL is a compact multimodal model that sets new records on dozens of vision-language benchmarks and outperforms prior systems on agent-style tasks.
VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding
cs.CV 2025-01 unverdicted novelty 4.0

VideoLLaMA3 uses a vision-centric training paradigm and token-reduction design to reach competitive results on image and video benchmarks.

Reference graph

Works this paper leans on

137 extracted references · 137 canonical work pages · cited by 20 Pith papers · 26 internal anchors

[1]

Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone

Marah Abdin, Jyoti Aneja, Hany Awadalla, Ahmed Awadal- lah, Ammar Ahmad Awan, Nguyen Bach, Amit Bahree, Arash Bakhtiari, Jianmin Bao, Harkirat Behl, et al. Phi-3 technical report: A highly capable language model locally on your phone. arXiv preprint arXiv:2404.14219, 2024. 6, 14, 20

work page internal anchor Pith review Pith/arXiv arXiv 2024
[2]

TallyQA: Answering complex counting questions

Manoj Acharya, Kushal Kafle, and Christopher Kanan. TallyQA: Answering complex counting questions. In AAAI, 2019. 5

work page 2019
[3]

Pixtral 12B

Pravesh Agrawal, Szymon Antoniak, Emma Bou Hanna, Devendra Chaplot, Jessica Chudnovsky, Saurabh Garg, Theophile Gervet, So- ham Ghosh, Am ´elie H´eliou, Paul Jacob, et al. Pixtral 12b. arXiv preprint arXiv:2410.07073, 2024. 6, 14, 20

work page internal anchor Pith review Pith/arXiv arXiv 2024
[4]

Yi: Open Foundation Models by 01.AI

01. AI, :, Alex Young, Bei Chen, Chao Li, Chengen Huang, Ge Zhang, Guanwei Zhang, Heng Li, Jiangcheng Zhu, Jianqun Chen, Jing Chang, Kaidong Yu, Peng Liu, and more. Yi: Open foundation models by 01.ai. arXiv preprint arXiv:2403.04652, 2024. 14

work page internal anchor Pith review Pith/arXiv arXiv 2024
[5]

The Llama 3 Herd of Models

Meta AI. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024. 6, 12, 14, 15, 17, 19, 20

work page internal anchor Pith review Pith/arXiv arXiv 2024
[6]

Flamingo: a visual language model for few-shot learning

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Mil- lican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. In NeurIPS, 2022. 19

work page 2022
[7]

The claude 3 model family: Opus, sonnet, haiku, 2024

Anthropic. The claude 3 model family: Opus, sonnet, haiku, 2024. 3, 6, 14, 18, 19

work page 2024
[8]

Layer normalization

Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. In NeurIPS Deep Learning Symposium, 2016. 10

work page 2016
[9]

Fuyu-8b: A multimodal architecture for ai agents, 2023

Rohan Bavishi, Erich Elsen, Curtis Hawthorne, Maxwell Nye, Au- gustus Odena, Arushi Somani, and Sa ˘gnak Tas ¸ırlar. Fuyu-8b: A multimodal architecture for ai agents, 2023. 19

work page 2023
[10]

PaliGemma: A versatile 3B VLM for transfer

Lucas Beyer, Andreas Steiner, Andr ´e Susano Pinto, Alexander Kolesnikov, Xiao Wang, Daniel Salz, Maxim Neumann, Ibrahim Alabdulmohsin, Michael Tschannen, Emanuele Bugliarello, Thomas Unterthiner, Daniel Keysers, Skanda Koppula, Fangyu Liu, Adam Grycner, Alexey Gritsenko, Neil Houlsby, Manoj Kumar, Keran Rong, Julian Eisenschlos, Rishabh Kabra, Matthias B...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[11]

Scene text visual question answering

Ali Furkan Biten, Ruben Tito, Andres Mafla, Lluis Gomez, Marc ¸al Rusinol, Ernest Valveny, CV Jawahar, and Dimosthenis Karatzas. Scene text visual question answering. In ICCV, 2019. 5

work page 2019
[12]

Honeybee: Locality-enhanced projector for multimodal llm

Junbum Cha, Wooyoung Kang, Jonghwan Mun, and Byungseok Roh. Honeybee: Locality-enhanced projector for multimodal llm. In CVPR, 2024. 19

work page 2024
[13]

H., Chen, S., Zhang, R., Chen, J., Wu, X., Zhang, Z., Chen, Z., Li, J., Wan, X., and Wang, B

Guiming Hardy Chen, Shunian Chen, Ruifei Zhang, Junying Chen, Xiangbo Wu, Zhiyi Zhang, Zhihong Chen, Jianquan Li, Xiang Wan, and Benyou Wang. Allava: Harnessing gpt4v-synthesized data for a lite vision-language model. arXiv preprint arXiv:2402.11684 ,

work page arXiv
[14]

Evlm: An efficient vision-language model for visual understanding

Kaibing Chen, Dong Shen, Hanwen Zhong, Huasong Zhong, Kui Xia, Di Xu, Wei Yuan, Yifei Hu, Bin Wen, Tianke Zhang, Changyi Liu, Dewen Fan, Huihui Xiao, Jiahong Wu, Fan Yang, Size Li, and Di Zhang. Evlm: An efficient vision-language model for visual understanding. arXiv preprint arXiv:2407.14177, 2024. 19

work page arXiv 2024
[15]

ShareGPT4V: Improving Large Multi-Modal Models with Better Captions

Lin Chen, Jisong Li, Xiaoyi Dong, Pan Zhang, Conghui He, Ji- aqi Wang, Feng Zhao, and Dahua Lin. ShareGPT4V: Improv- ing large multi-modal models with better captions. arXiv preprint arXiv:2311.12793, 2023. 1, 17, 20

work page internal anchor Pith review Pith/arXiv arXiv 2023
[16]

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Hen- rique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavari...

work page internal anchor Pith review Pith/arXiv arXiv 2021
[17]

Microsoft COCO Captions: Data Collection and Evaluation Server

Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Doll ´ar, and C Lawrence Zitnick. Microsoft COCO captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325, 2015. 2, 4, 11

work page internal anchor Pith review Pith/arXiv arXiv 2015
[18]

Pali-3 vi- sion language models: Smaller, faster, stronger

Xi Chen, Xiao Wang, Lucas Beyer, Alexander Kolesnikov, Jialin Wu, Paul V oigtlaender, Basil Mustafa, Sebastian Good- man, Ibrahim Alabdulmohsin, Piotr Padlewski, et al. Pali-3 vi- sion language models: Smaller, faster, stronger. arXiv preprint arXiv:2310.09199, 2023. 19

work page arXiv 2023
[19]

How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites

Zhe Chen, Weiyun Wang, Hao Tian, Shenglong Ye, Zhangwei Gao, Erfei Cui, Wenwen Tong, Kongzhi Hu, Jiapeng Luo, Zheng Ma, Ji Ma, Jiaqi Wang, Xiaoyi Dong, Hang Yan, Hewei Guo, Conghui He, Botian Shi, Zhenjiang Jin, Chao Xu, Bin Wang, Xingjian Wei, Wei Li, Wenjian Zhang, Bo Zhang, Pinlong Cai, Licheng Wen, Xi- angchao Yan, Min Dou, Lewei Lu, Xizhou Zhu, Tong ...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[20]

Reproducible scaling laws for contrastive language-image learning

Mehdi Cherti, Romain Beaumont, Ross Wightman, Mitchell Worts- man, Gabriel Ilharco, Cade Gordon, Christoph Schuhmann, Ludwig Schmidt, and Jenia Jitsev. Reproducible scaling laws for contrastive language-image learning. In CVPR, 2023. 18

work page 2023
[21]

Chatbot arena: An open platform for evaluating LLMs by human preference

Wei-Lin Chiang, Lianmin Zheng, Ying Sheng, Anastasios Nikolas Angelopoulos, Tianle Li, Dacheng Li, Hao Zhang, Banghua Zhu, Michael Jordan, Joseph E Gonzalez, and Ion Stoica. Chatbot arena: An open platform for evaluating LLMs by human preference. In ICML, 2024. 6, 13

work page 2024
[22]

Mobilevlm : A fast, strong and open vision language assistant for mobile devices

Xiangxiang Chu, Limeng Qiao, Xinyang Lin, Shuang Xu, Yang Yang, Yiming Hu, Fei Wei, Xinyu Zhang, Bo Zhang, Xiaolin Wei, and Chunhua Shen. Mobilevlm : A fast, strong and open vision language assistant for mobile devices. arXiv preprint arXiv:2312.16886, 2023. 19

work page arXiv 2023
[23]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sab- harwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457, 2018. 16

work page internal anchor Pith review Pith/arXiv arXiv 2018
[24]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Ja- cob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schul- man. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021. 16

work page internal anchor Pith review Pith/arXiv arXiv 2021
[25]

On implementing 2d rectangular assignment algo- rithms

David F Crouse. On implementing 2d rectangular assignment algo- rithms. IEEE Transactions on Aerospace and Electronic Systems ,

work page
[26]

Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Albert Li, Pascale Fung, and Steven C. H. Hoi. Instructblip: Towards general-purpose vision- language models with instruction tuning. In NeurIPS, 2023. 18

work page 2023
[27]

NVLM: Open frontier-class multimodal LLMs

Wenliang Dai, Nayeon Lee, Boxin Wang, Zhuoling Yang, Zihan Liu, Jon Barker, Tuomas Rintamaki, Mohammad Shoeybi, Bryan Catanzaro, and Wei Ping. NVLM: Open frontier-class multimodal LLMs. arXiv preprint arXiv:2409.11402, 2024. 5

work page arXiv 2024
[28]

FlashAttention-2: Faster attention with better parallelism and work partitioning

Tri Dao. FlashAttention-2: Faster attention with better parallelism and work partitioning. In ICLR, 2024. 10 26

work page 2024
[29]

Fu, Stefano Ermon, Atri Rudra, and Christopher R´e

Tri Dao, Daniel Y . Fu, Stefano Ermon, Atri Rudra, and Christopher R´e. FlashAttention: Fast and memory-efficient exact attention with IO-awareness. In NeurIPS, 2022. 10

work page 2022
[30]

InternLM- XComposer2-4KHD: A Pioneering Large Vision-Language Model Handling Resolutions from 336 Pixels to 4K HD

Xiaoyi Dong, Pan Zhang, Yuhang Zang, Yuhang Cao, Bin Wang, Linke Ouyang, Songyang Zhang, Haodong Duan, Wenwei Zhang, Yining Li, Hang Yan, Yang Gao, Zhe Chen, Xinyue Zhang, Wei Li, Jingwen Li, Wenhai Wang, Kai Chen, Conghui He, Xingcheng Zhang, Jifeng Dai, Yu Qiao, Dahua Lin, and Jiaqi Wang. InternLM- XComposer2-4KHD: A Pioneering Large Vision-Language Mod...

work page arXiv 2024
[31]

An image is worth 16x16 words: Transformers for image recognition at scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa De- hghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2021. 3

work page 2021
[32]

Vila2: Vila augmented vila

Yunhao Fang, Ligeng Zhu, Yao Lu, Yan Wang, Pavlo Molchanov, Jan Kautz, Jang Hyun Cho, Marco Pavone, Song Han, and Hongxu Yin. Vila2: Vila augmented vila. arXiv preprint arXiv:2407.17453, 2024

work page arXiv 2024
[33]

Devise: A deep visual-semantic embedding model

Andrea Frome, Greg S Corrado, Jon Shlens, Samy Bengio, Jeff Dean, Marc’Aurelio Ranzato, and Tomas Mikolov. Devise: A deep visual-semantic embedding model. NeurIPS, 2013. 18

work page 2013
[34]

Vita: Towards open-source interactive omni multimodal llm

Chaoyou Fu, Haojia Lin, Zuwei Long, Yunhang Shen, Meng Zhao, Yifan Zhang, Xiong Wang, Di Yin, Long Ma, Xiawu Zheng, et al. Vita: Towards open-source interactive omni multimodal llm. arXiv preprint arXiv:2408.05211, 2024. 19

work page arXiv 2024
[35]

Scaling synthetic data creation with 1,000,000,000 personas.arXiv:2406.20094, 2024

Tao Ge, Xin Chan, Xiaoyang Wang, Dian Yu, Haitao Mi, and Dong Yu. Scaling synthetic data creation with 1,000,000,000 personas. arXiv preprint arXiv:2406.20094, 2024. 18

work page arXiv 2024
[36]

Making the V in VQA matter: Elevating the role of image understanding in visual question answering

Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the V in VQA matter: Elevating the role of image understanding in visual question answering. In CVPR, 2017. 5, 6

work page 2017
[37]

Dirk Groeneveld, Iz Beltagy, Pete Walsh, Akshita Bhagia, Rod- ney Kinney, Oyvind Tafjord, A. Jha, Hamish Ivison, Ian Magnus- son, Yizhong Wang, Shane Arora, David Atkinson, Russell Au- thur, Khyathi Raghavi Chandu, Arman Cohan, Jennifer Dumas, Yanai Elazar, Yuling Gu, Jack Hessel, Tushar Khot, William Mer- rill, Jacob Daniel Morrison, Niklas Muennighoff, ...

work page 2024
[38]

Measuring massive multitask language understanding

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. In ICLR, 2021. 16

work page 2021
[39]

Mea- suring mathematical problem solving with the math dataset

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Mea- suring mathematical problem solving with the math dataset. In NeurIPS Track on Datasets and Benchmarks, 2021. 16

work page 2021
[40]

Accu- mulated gradient normalization

Joeri R Hermans, Gerasimos Spanakis, and Rico M ¨ockel. Accu- mulated gradient normalization. In Asian Conference on Machine Learning, pages 439–454. PMLR, 2017. 10

work page 2017
[41]

Cogvlm2: Visual language models for image and video un- derstanding

Wenyi Hong, Weihan Wang, Ming Ding, Wenmeng Yu, Qingsong Lv, Yan Wang, Yean Cheng, Shiyu Huang, Junhui Ji, Zhao Xue, et al. Cogvlm2: Visual language models for image and video un- derstanding. arXiv preprint arXiv:2408.16500, 2024. 18

work page arXiv 2024
[42]

mplug-docowl 1.5: Unified structure learning for ocr-free document understanding

Anwen Hu, Haiyang Xu, Jiabo Ye, Mingshi Yan, Liang Zhang, Bo Zhang, Chen Li, Ji Zhang, Qin Jin, Fei Huang, and Jingren Zhou. mplug-docowl 1.5: Unified structure learning for ocr-free document understanding. In Findings of EMNLP, 2024

work page 2024
[43]

Scaling up visual and vision-language representation learning with noisy text supervision

Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representation learning with noisy text supervision. In ICML, 2021. 18

work page 2021
[44]

Mantis: Interleaved multi-image instruction tuning

Dongfu Jiang, Xuan He, Huaye Zeng, Cong Wei, Max Ku, Qian Liu, and Wenhu Chen. Mantis: Interleaved multi-image instruction tuning. arXiv preprint arXiv:2405.01483, 2024. 20

work page arXiv 2024
[45]

A shortest augmenting path algo- rithm for dense and sparse linear assignment problems.Computing,

Roy Jonker and Ton V olgenant. A shortest augmenting path algo- rithm for dense and sparse linear assignment problems.Computing,

work page
[46]

DVQA: Understanding data visualizations via question answering

Kushal Kafle, Brian Price, Scott Cohen, and Christopher Kanan. DVQA: Understanding data visualizations via question answering. In CVPR, 2018. 5, 11, 20

work page 2018
[47]

FigureQA: An Annotated Figure Dataset for Visual Reasoning

Samira Ebrahimi Kahou, Vincent Michalski, Adam Atkinson, ´Akos K ´ad´ar, Adam Trischler, and Yoshua Bengio. FigureQA: An annotated figure dataset for visual reasoning. arXiv preprint arXiv:1710.07300, 2017. 5, 20

work page internal anchor Pith review Pith/arXiv arXiv 2017
[48]

Prismatic vlms: Inves- tigating the design space of visually-conditioned language models

Siddharth Karamcheti, Suraj Nair, Ashwin Balakrishna, Percy Liang, Thomas Kollar, and Dorsa Sadigh. Prismatic vlms: Inves- tigating the design space of visually-conditioned language models. arXiv preprint arXiv:2402.07865, 2024. 5, 19

work page arXiv 2024
[49]

A diagram is worth a dozen images

Aniruddha Kembhavi, Mike Salvato, Eric Kolve, Minjoon Seo, Hannaneh Hajishirzi, and Ali Farhadi. A diagram is worth a dozen images. In ECCV, 2016. 5, 6

work page 2016
[50]

Adam: A method for stochastic optimization

Diederik P Kingma. Adam: A method for stochastic optimization. In ICLR, 2015. 5, 9

work page 2015
[51]

Berg, Wan-Yen Lo, Piotr Doll´ar, and Ross Girshick

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexan- der C. Berg, Wan-Yen Lo, Piotr Doll´ar, and Ross Girshick. Segment anything. In ICCV, 2023. 7, 14

work page 2023
[52]

Shamma, Michael S

Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A. Shamma, Michael S. Bernstein, and Li Fei-Fei. Vi- sual genome: Connecting language and vision using crowdsourced dense image annotations.International Journal of Computer Vision, 123:32 – 73, 2016. 11, 20

work page 2016
[53]

The open images dataset v4: Unified image classification, object detection, and visual relation- ship detection at scale

Alina Kuznetsova, Hassan Rom, Neil Alldrin, Jasper Uijlings, Ivan Krasin, Jordi Pont-Tuset, Shahab Kamali, Stefan Popov, Matteo Malloci, Alexander Kolesnikov, et al. The open images dataset v4: Unified image classification, object detection, and visual relation- ship detection at scale. IJCV, 2020. 13

work page 2020
[54]

Building and better understanding vision-language models: insights and future directions

Hugo Laurenc ¸on, Andr ´es Marafioti, Victor Sanh, and L ´eo Tron- chon. Building and better understanding vision-language models: insights and future directions. arXiv preprint arXiv:2408.12637 ,

work page arXiv
[55]

Unlocking the conversion of web screenshots into HTML code with the websight dataset

Hugo Laurenc ¸on, L´eo Tronchon, and Victor Sanh. Unlocking the conversion of web screenshots into HTML code with the websight dataset. arXiv preprint arXiv:2403.09029, 2024. 20

work page arXiv 2024
[56]

Otterhd: A high-resolution multi-modality model

Bo Li, Peiyuan Zhang, Jingkang Yang, Yuanhan Zhang, Fanyi Pu, and Ziwei Liu. Otterhd: A high-resolution multi-modality model. arXiv preprint arXiv:2311.04219, 2023. 19

work page arXiv 2023
[57]

Mimic-it: Multi-modal in-context instruction tuning

Bo Li, Yuanhan Zhang, Liangyu Chen, Jinghao Wang, Fanyi Pu, Jingkang Yang, C. Li, and Ziwei Liu. Mimic-it: Multi-modal in- context instruction tuning. arXiv preprint arXiv:2306.05425, 2023. 20

work page arXiv 2023
[58]

Otter: A Multi-Modal Model with In-Context Instruction Tuning

Bo Li, Yuanhan Zhang, Liangyu Chen, Jinghao Wang, Jingkang Yang, and Ziwei Liu. Otter: A multi-modal model with in-context instruction tuning. arXiv preprint arXiv:2305.03726, 2023. 19

work page internal anchor Pith review arXiv 2023
[59]

LLaVA-OneVision: Easy Visual Task Transfer

Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Yanwei Li, Ziwei Liu, and Chunyuan Li. LLaV A-OneVision: Easy visual task transfer. arXiv preprint arXiv:2408.03326, 2024. 5, 6, 14, 18, 20

work page internal anchor Pith review Pith/arXiv arXiv 2024
[60]

Blip: Boot- strapping language-image pre-training for unified vision-language understanding and generation

Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Boot- strapping language-image pre-training for unified vision-language understanding and generation. In ICML, 2022. 27

work page 2022
[61]

Covlm: Composing visual entities and relationships in large language models via communicative de- coding

Junyan Li, Delin Chen, Yining Hong, Zhenfang Chen, Peihao Chen, Yikang Shen, and Chuang Gan. Covlm: Composing visual entities and relationships in large language models via communicative de- coding. In ICLR, 2024. 20

work page 2024
[62]

On the effects of data scale on computer control agents.arXiv preprint arXiv:2406.03679, 2024

Wei Li, William Bishop, Alice Li, Chris Rawles, Folawiyo Campbell-Ajala, Divya Tyamagundlu, and Oriana Riva. On the effects of data scale on computer control agents. arXiv preprint arXiv:2406.03679, 2024. 5, 7

work page arXiv 2024
[63]

Monkey: Image resolution and text label are important things for large multi-modal models

Zhang Li, Biao Yang, Qiang Liu, Zhiyin Ma, Shuo Zhang, Jingxu Yang, Yabo Sun, Yuliang Liu, and Xiang Bai. Monkey: Image resolution and text label are important things for large multi-modal models. CVPR, 2024

work page 2024
[64]

Moe-llava: Mix- ture of experts for large vision-language models

Bin Lin, Zhenyu Tang, Yang Ye, Jiaxi Cui, Bin Zhu, Peng Jin, Junwu Zhang, Munan Ning, and Li Yuan. Moe-llava: Mix- ture of experts for large vision-language models. arXiv preprint arXiv:2401.15947, 2024. 19

work page arXiv 2024
[65]

Mi- crosoft coco: Common objects in context

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C Lawrence Zitnick. Mi- crosoft coco: Common objects in context. In ECCV, 2014. 13

work page 2014
[66]

GRES: Generalized referring expression segmentation

Chang Liu, Henghui Ding, and Xudong Jiang. GRES: Generalized referring expression segmentation. In CVPR, 2023. 18, 20

work page 2023
[67]

SPHINX-x: Scaling data and parameters for a family of multi- modal large language models

Dongyang Liu, Renrui Zhang, Longtian Qiu, Siyuan Huang, Weifeng Lin, Shitian Zhao, Shijie Geng, Ziyi Lin, Peng Jin, Kaipeng Zhang, Wenqi Shao, Chao Xu, Conghui He, Junjun He, Hao Shao, Pan Lu, Yu Qiao, Hongsheng Li, and Peng Gao. SPHINX-x: Scaling data and parameters for a family of multi- modal large language models. In ICML, 2024. 19

work page 2024
[68]

Mmc: Ad- vancing multimodal chart understanding with large-scale instruc- tion tuning

Fuxiao Liu, Xiaoyang Wang, Wenlin Yao, Jianshu Chen, Kaiqiang Song, Sangwoo Cho, Yaser Yacoob, and Dong Yu. Mmc: Ad- vancing multimodal chart understanding with large-scale instruc- tion tuning. arXiv preprint arXiv:2311.10774, 2023. 20

work page arXiv 2023
[69]

Visual instruction tuning

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. In NeurIPS, 2023. 1, 3, 5, 6, 14, 18, 19, 20

work page 2023
[70]

Improved baselines with visual instruction tuning

Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. In CVPR, 2024. 3

work page 2024
[71]

Llava-next: Improved reasoning, ocr, and world knowledge, 2024

Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. Llava-next: Improved reasoning, ocr, and world knowledge, 2024. 12, 18, 20

work page 2024
[72]

SGDR: Stochastic Gradient Descent with Warm Restarts

Ilya Loshchilov and Frank Hutter. Sgdr: Stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983, 2016. 9

work page internal anchor Pith review Pith/arXiv arXiv 2016
[73]

Decoupled weight decay regu- larization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regu- larization. In ICLR, 2019. 5, 9

work page 2019
[74]

DeepSeek-VL: Towards Real-World Vision-Language Understanding

Haoyu Lu, Wen Liu, Bo Zhang, Bingxuan Wang, Kai Dong, Bo Liu, Jingxiang Sun, Tongzheng Ren, Zhuoshu Li, Hao Yang, et al. Deepseek-vl: towards real-world vision-language understanding. arXiv preprint arXiv:2403.05525, 2024. 19

work page internal anchor Pith review Pith/arXiv arXiv 2024
[75]

Unified-io 2: Scaling autoregressive multimodal models with vision language audio and action

Jiasen Lu, Christopher Clark, Sangho Lee, Zichen Zhang, Savya Khosla, Ryan Marten, Derek Hoiem, and Aniruddha Kembhavi. Unified-io 2: Scaling autoregressive multimodal models with vision language audio and action. In CVPR, 2024. 20

work page 2024
[76]

Learn to explain: Multimodal reasoning via thought chains for sci- ence question answering

Pan Lu, Swaroop Mishra, Tanglin Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for sci- ence question answering. In NeurIPS, 2022. 5

work page 2022
[77]

Dy- namic prompt learning via policy gradient for semi-structured math- ematical reasoning

Pan Lu, Liang Qiu, Kai-Wei Chang, Ying Nian Wu, Song-Chun Zhu, Tanmay Rajpurohit, Peter Clark, and Ashwin Kalyan. Dy- namic prompt learning via policy gradient for semi-structured math- ematical reasoning. In ICLR, 2023. 5

work page 2023
[78]

MathVista: Evaluating mathematical reasoning of foundation models in visual contexts

Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Han- naneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. MathVista: Evaluating mathematical reasoning of foundation models in visual contexts. In ICLR, 2024. 6

work page 2024
[79]

Cheap and quick: Efficient vision-language in- struction tuning for large language models

Gen Luo, Yiyi Zhou, Tianhe Ren, Shengxin Chen, Xiaoshuai Sun, and Rongrong Ji. Cheap and quick: Efficient vision-language in- struction tuning for large language models. In NeurIPS, 2023. 19

work page 2023
[80]

ExpertQA: Expert-curated questions and attributed answers

Chaitanya Malaviya, Subin Lee, Sihao Chen, Elizabeth Sieber, Mark Yatskar, and Dan Roth. ExpertQA: Expert-curated questions and attributed answers. arXiv preprint arXiv:2309.07852, 2023. 20

work page arXiv 2023

Showing first 80 references.