Recognition: 2 theorem links
Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models
Pith reviewed 2026-05-15 01:50 UTC · model grok-4.3
The pith
New independently collected datasets enable open vision-language models that outperform most proprietary alternatives on benchmarks and human evaluations.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors establish that a collection of human-gathered datasets allows construction of state-of-the-art open vision-language models without synthetic data from proprietary systems. These datasets comprise highly detailed image captions for pre-training, free-form image question-answer pairs for fine-tuning, and an innovative 2D pointing dataset. The resulting models, particularly the 72 billion parameter variant, achieve performance that exceeds other open-weight models and larger proprietary models including Claude 3.5 Sonnet and Gemini 1.5 variants, ranking second only to the leading closed model on both academic benchmarks and human evaluations.
What carries the argument
The PixMo datasets of highly detailed image captions, free-form image question-answer pairs, and 2D pointing data, all collected without the use of external vision-language models, which serve as the primary training resource enabling performance gains.
If this is right
- Open-weight vision-language models can surpass several larger proprietary systems without depending on distillation from closed models.
- Data collection focused on detailed captions, free-form questions, and pointing annotations yields measurable gains in both automated metrics and human judgments.
- Full release of weights, datasets, and code enables direct replication and extension by the broader community.
- Training pipelines that avoid external model-generated data restore independent development paths for multimodal systems.
Where Pith is reading between the lines
- The same independent collection approach could be extended to create datasets for related tasks such as video or document understanding.
- Full openness of both data and weights may encourage standardized evaluation practices that reduce hidden dependencies across the field.
- Testing whether the 2D pointing component specifically improves spatial reasoning accuracy would isolate one data contribution.
- Researchers could combine these datasets with alternative model scales or architectures to determine the relative importance of data versus design choices.
Load-bearing premise
The performance gains arise primarily from the quality and independence of the newly collected datasets rather than from specific undisclosed modeling choices or training pipeline details.
What would settle it
Retrain the same model architecture using only synthetic data generated by existing proprietary vision-language models in place of the new datasets and measure whether benchmark scores and human evaluation results drop to levels typical of prior open models.
read the original abstract
Today's most advanced vision-language models (VLMs) remain proprietary. The strongest open-weight models rely heavily on synthetic data from proprietary VLMs to achieve good performance, effectively distilling these closed VLMs into open ones. As a result, the community has been missing foundational knowledge about how to build performant VLMs from scratch. We present Molmo, a new family of VLMs that are state-of-the-art in their class of openness. Our key contribution is a collection of new datasets called PixMo, including a dataset of highly detailed image captions for pre-training, a free-form image Q&A dataset for fine-tuning, and an innovative 2D pointing dataset, all collected without the use of external VLMs. The success of our approach relies on careful modeling choices, a well-tuned training pipeline, and, most critically, the quality of our newly collected datasets. Our best-in-class 72B model not only outperforms others in the class of open weight and data models, but also outperforms larger proprietary models including Claude 3.5 Sonnet, and Gemini 1.5 Pro and Flash, second only to GPT-4o based on both academic benchmarks and on a large human evaluation. Our model weights, new datasets, and source code are available at https://molmo.allenai.org/blog.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Molmo, a family of open-weight vision-language models, and the PixMo datasets (highly detailed image captions for pre-training, free-form image QA for fine-tuning, and a novel 2D pointing dataset), all collected without using external VLMs. The authors claim that their best 72B model achieves state-of-the-art results among open-weight and open-data models, outperforming several larger proprietary models (Claude 3.5 Sonnet, Gemini 1.5 Pro, Gemini 1.5 Flash) and ranking second only to GPT-4o on academic benchmarks and a large-scale human evaluation. Model weights, datasets, and code are released at https://molmo.allenai.org/blog.
Significance. If the performance claims hold, the work is significant for supplying fully open weights, data, and code that enable the community to study and replicate strong VLMs without distilling from closed models. The independent collection of PixMo data (detailed captions, free-form QA, and 2D pointing) offers a concrete alternative to synthetic data pipelines and can support further research on data quality for vision-language modeling.
major comments (1)
- [Abstract] Abstract: the central claim that PixMo data quality is 'most critically' responsible for the reported gains (outperforming open peers and several proprietary models) is not supported by isolating ablations. No experiments hold model architecture, optimizer schedule, data volume, and training pipeline fixed while swapping PixMo for standard open corpora or synthetic data from closed VLMs; therefore the causal attribution remains unisolated.
minor comments (1)
- [Abstract] Abstract: benchmark results are described only qualitatively ('strong benchmark and human evaluation results'); adding a brief table or sentence with key metrics, baselines, and error bars would improve immediate verifiability.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and the recommendation of minor revision. We address the concern about the strength of evidence for attributing performance gains to PixMo data quality below.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central claim that PixMo data quality is 'most critically' responsible for the reported gains (outperforming open peers and several proprietary models) is not supported by isolating ablations. No experiments hold model architecture, optimizer schedule, data volume, and training pipeline fixed while swapping PixMo for standard open corpora or synthetic data from closed VLMs; therefore the causal attribution remains unisolated.
Authors: We agree that the manuscript lacks a single, fully isolated ablation that holds model architecture, optimizer schedule, data volume, and training pipeline exactly fixed while only swapping the data source between PixMo and standard open corpora or synthetic data from closed VLMs. Performing such a controlled swap at the 72B scale is computationally prohibitive. That said, we provide supporting evidence via smaller-scale controlled ablations (reported in Section 4) that compare PixMo captions against LAION-5B and COCO while keeping architecture and training recipe fixed, as well as direct performance comparisons against open models trained on synthetic data from proprietary VLMs. Human preference studies further corroborate the higher quality of PixMo annotations. To address the referee's point, we will revise the abstract to replace 'most critically' with 'significantly' and add an explicit limitations paragraph acknowledging the absence of a full-scale isolating ablation. revision: yes
Circularity Check
No circularity: empirical results from new independent datasets
full rationale
The paper's claims rest on the creation of new PixMo datasets (detailed captions, free-form QA, 2D pointing) collected without external VLMs, followed by standard training of Molmo models using described architecture choices and pipeline tuning. Performance is evaluated on public benchmarks and human studies. No equations, derivations, or first-principles predictions are presented that reduce to fitted parameters or self-referential definitions. Self-citations, if present, support background context rather than load-bearing justification for the SOTA results. The derivation chain is self-contained empirical work with no reduction by construction.
Axiom & Free-Parameter Ledger
free parameters (2)
- training hyperparameters
- model architecture parameters
axioms (2)
- standard math Neural network optimization converges to useful minima under standard training regimes.
- domain assumption Human-collected image annotations provide higher quality signals than synthetic data from closed VLMs.
Forward citations
Cited by 23 Pith papers
-
20/20 Vision Language Models: A Prescription for Better VLMs through Data Curation Alone
Data curation alone raises VLM accuracy by more than 11 points on average across many benchmarks while cutting required training compute by up to 87 times.
-
20/20 Vision Language Models: A Prescription for Better VLMs through Data Curation Alone
Data curation alone raises VLM accuracy by 11+ points on average, improves reliability and OOD generalization, and achieves near-frontier results at far lower training and inference cost.
-
PRTS: A Primitive Reasoning and Tasking System via Contrastive Representations
PRTS pretrains VLA models with contrastive goal-conditioned RL to embed goal-reachability probabilities from offline data, yielding SOTA results on robotic benchmarks especially for long-horizon and novel instructions.
-
Entropy-Gradient Grounding: Training-Free Evidence Retrieval in Vision-Language Models
Entropy-gradient grounding uses model uncertainty to retrieve evidence regions in VLMs, improving performance on detail-critical and compositional tasks across multiple architectures.
-
ABMAMBA: Multimodal Large Language Model with Aligned Hierarchical Bidirectional Scan for Efficient Video Captioning
ABMamba uses Mamba-based linear-complexity processing plus a novel Aligned Hierarchical Bidirectional Scan to deliver competitive video captioning on VATEX and MSR-VTT at roughly 3x higher throughput than typical Tran...
-
A1: A Fully Transparent Open-Source, Adaptive and Efficient Truncated Vision-Language-Action Model
A1 is a transparent VLA framework achieving state-of-the-art robot manipulation success with up to 72% lower latency via adaptive layer truncation and inter-layer flow matching.
-
InternVLA-M1: A Spatially Guided Vision-Language-Action Framework for Generalist Robot Policy
InternVLA-M1 uses spatially guided pre-training on 2.3M examples followed by action post-training to deliver up to 17% gains on robot manipulation benchmarks and 20.6% on unseen objects.
-
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency
InternVL3.5 advances open-source multimodal models with Cascade RL for +16% reasoning gains and ViR for 4x inference speedup, with the 241B model reaching SOTA among open-source MLLMs on multimodal, reasoning, and age...
-
Perception Encoder: The best visual embeddings are not at the output of the network
Intermediate layers of a contrastively trained vision-language encoder yield stronger general embeddings than the output layer, enabling state-of-the-art performance across image/video classification, multimodal QA, a...
-
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models
InternVL3-78B sets a new open-source SOTA of 72.2 on MMMU via native joint multimodal pre-training, V2PE, MPO, and test-time scaling while remaining competitive with proprietary models.
-
SmolVLM: Redefining small and efficient multimodal models
SmolVLM-256M outperforms a 300-times larger model using under 1 GB GPU memory, while the 2.2B version matches state-of-the-art VLMs at half the memory cost.
-
Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling
InternVL 2.5 is the first open-source MLLM to surpass 70% on the MMMU benchmark via model, data, and test-time scaling, with a 3.7-point gain from chain-of-thought reasoning.
-
Pixtral 12B
Pixtral-12B is a 12B multimodal LLM with a custom vision encoder that ingests images at native resolution and aspect ratio, achieving leading benchmark results among open models while preserving text capabilities.
-
Visibility-Aware Mobile Grasping in Dynamic Environments
A visibility-aware mobile grasping system with iterative whole-body planning and behavior-tree subgoal generation achieves 68.8% success in unknown static and 58% in dynamic environments, outperforming a baseline by 2...
-
UniMesh: Unifying 3D Mesh Understanding and Generation
UniMesh unifies 3D mesh generation and understanding in one model via a Mesh Head interface, Chain of Mesh iterative editing, and an Actor-Evaluator self-reflection loop.
-
Anthropogenic Regional Adaptation in Multimodal Vision-Language Model
Anthropogenic Regional Adaptation with GG-EZ improves cultural relevance in multimodal vision-language models for Southeast Asia by 5-15% while retaining over 98% of global performance.
-
Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs
Phi-4-Mini achieves strong math and coding performance with only 3.8B parameters via high-quality synthetic data, while Phi-4-Multimodal uses Mixture-of-LoRAs to integrate modalities and top speech recognition leaderboards.
-
DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding
DeepSeek-VL2 is a series of MoE vision-language models using dynamic tiling and latent attention that reach competitive or state-of-the-art results on VQA, OCR, document understanding and grounding with 1.0B to 4.5B a...
-
When Relations Break: Analyzing Relation Hallucination in Vision-Language Model Under Rotation and Noise
Mild rotations and noise significantly increase relation hallucinations in VLMs across models and datasets, with prompt augmentation and preprocessing offering only partial mitigation.
-
When Relations Break: Analyzing Relation Hallucination in Vision-Language Model Under Rotation and Noise
Mild rotations and noise significantly increase relation hallucinations in VLMs across models and datasets, with prompt and preprocessing fixes providing only partial relief.
-
Visibility-Aware Mobile Grasping in Dynamic Environments
A unified visibility-aware mobile grasping system using whole-body planning, active perception, and behavior trees improves success rates in unknown static and dynamic environments.
-
Seed1.5-VL Technical Report
Seed1.5-VL is a compact multimodal model that sets new records on dozens of vision-language benchmarks and outperforms prior systems on agent-style tasks.
-
VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding
VideoLLaMA3 uses a vision-centric training paradigm and token-reduction design to reach competitive results on image and video benchmarks.
Reference graph
Works this paper leans on
-
[1]
Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone
Marah Abdin, Jyoti Aneja, Hany Awadalla, Ahmed Awadal- lah, Ammar Ahmad Awan, Nguyen Bach, Amit Bahree, Arash Bakhtiari, Jianmin Bao, Harkirat Behl, et al. Phi-3 technical report: A highly capable language model locally on your phone. arXiv preprint arXiv:2404.14219, 2024. 6, 14, 20
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[2]
TallyQA: Answering complex counting questions
Manoj Acharya, Kushal Kafle, and Christopher Kanan. TallyQA: Answering complex counting questions. In AAAI, 2019. 5
work page 2019
-
[3]
Pravesh Agrawal, Szymon Antoniak, Emma Bou Hanna, Devendra Chaplot, Jessica Chudnovsky, Saurabh Garg, Theophile Gervet, So- ham Ghosh, Am ´elie H´eliou, Paul Jacob, et al. Pixtral 12b. arXiv preprint arXiv:2410.07073, 2024. 6, 14, 20
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[4]
Yi: Open Foundation Models by 01.AI
01. AI, :, Alex Young, Bei Chen, Chao Li, Chengen Huang, Ge Zhang, Guanwei Zhang, Heng Li, Jiangcheng Zhu, Jianqun Chen, Jing Chang, Kaidong Yu, Peng Liu, and more. Yi: Open foundation models by 01.ai. arXiv preprint arXiv:2403.04652, 2024. 14
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[5]
Meta AI. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024. 6, 12, 14, 15, 17, 19, 20
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[6]
Flamingo: a visual language model for few-shot learning
Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Mil- lican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. In NeurIPS, 2022. 19
work page 2022
-
[7]
The claude 3 model family: Opus, sonnet, haiku, 2024
Anthropic. The claude 3 model family: Opus, sonnet, haiku, 2024. 3, 6, 14, 18, 19
work page 2024
-
[8]
Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. In NeurIPS Deep Learning Symposium, 2016. 10
work page 2016
-
[9]
Fuyu-8b: A multimodal architecture for ai agents, 2023
Rohan Bavishi, Erich Elsen, Curtis Hawthorne, Maxwell Nye, Au- gustus Odena, Arushi Somani, and Sa ˘gnak Tas ¸ırlar. Fuyu-8b: A multimodal architecture for ai agents, 2023. 19
work page 2023
-
[10]
PaliGemma: A versatile 3B VLM for transfer
Lucas Beyer, Andreas Steiner, Andr ´e Susano Pinto, Alexander Kolesnikov, Xiao Wang, Daniel Salz, Maxim Neumann, Ibrahim Alabdulmohsin, Michael Tschannen, Emanuele Bugliarello, Thomas Unterthiner, Daniel Keysers, Skanda Koppula, Fangyu Liu, Adam Grycner, Alexey Gritsenko, Neil Houlsby, Manoj Kumar, Keran Rong, Julian Eisenschlos, Rishabh Kabra, Matthias B...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[11]
Scene text visual question answering
Ali Furkan Biten, Ruben Tito, Andres Mafla, Lluis Gomez, Marc ¸al Rusinol, Ernest Valveny, CV Jawahar, and Dimosthenis Karatzas. Scene text visual question answering. In ICCV, 2019. 5
work page 2019
-
[12]
Honeybee: Locality-enhanced projector for multimodal llm
Junbum Cha, Wooyoung Kang, Jonghwan Mun, and Byungseok Roh. Honeybee: Locality-enhanced projector for multimodal llm. In CVPR, 2024. 19
work page 2024
-
[13]
H., Chen, S., Zhang, R., Chen, J., Wu, X., Zhang, Z., Chen, Z., Li, J., Wan, X., and Wang, B
Guiming Hardy Chen, Shunian Chen, Ruifei Zhang, Junying Chen, Xiangbo Wu, Zhiyi Zhang, Zhihong Chen, Jianquan Li, Xiang Wan, and Benyou Wang. Allava: Harnessing gpt4v-synthesized data for a lite vision-language model. arXiv preprint arXiv:2402.11684 ,
-
[14]
Evlm: An efficient vision-language model for visual understanding
Kaibing Chen, Dong Shen, Hanwen Zhong, Huasong Zhong, Kui Xia, Di Xu, Wei Yuan, Yifei Hu, Bin Wen, Tianke Zhang, Changyi Liu, Dewen Fan, Huihui Xiao, Jiahong Wu, Fan Yang, Size Li, and Di Zhang. Evlm: An efficient vision-language model for visual understanding. arXiv preprint arXiv:2407.14177, 2024. 19
-
[15]
ShareGPT4V: Improving Large Multi-Modal Models with Better Captions
Lin Chen, Jisong Li, Xiaoyi Dong, Pan Zhang, Conghui He, Ji- aqi Wang, Feng Zhao, and Dahua Lin. ShareGPT4V: Improv- ing large multi-modal models with better captions. arXiv preprint arXiv:2311.12793, 2023. 1, 17, 20
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[16]
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Hen- rique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavari...
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[17]
Microsoft COCO Captions: Data Collection and Evaluation Server
Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Doll ´ar, and C Lawrence Zitnick. Microsoft COCO captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325, 2015. 2, 4, 11
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[18]
Pali-3 vi- sion language models: Smaller, faster, stronger
Xi Chen, Xiao Wang, Lucas Beyer, Alexander Kolesnikov, Jialin Wu, Paul V oigtlaender, Basil Mustafa, Sebastian Good- man, Ibrahim Alabdulmohsin, Piotr Padlewski, et al. Pali-3 vi- sion language models: Smaller, faster, stronger. arXiv preprint arXiv:2310.09199, 2023. 19
-
[19]
How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites
Zhe Chen, Weiyun Wang, Hao Tian, Shenglong Ye, Zhangwei Gao, Erfei Cui, Wenwen Tong, Kongzhi Hu, Jiapeng Luo, Zheng Ma, Ji Ma, Jiaqi Wang, Xiaoyi Dong, Hang Yan, Hewei Guo, Conghui He, Botian Shi, Zhenjiang Jin, Chao Xu, Bin Wang, Xingjian Wei, Wei Li, Wenjian Zhang, Bo Zhang, Pinlong Cai, Licheng Wen, Xi- angchao Yan, Min Dou, Lewei Lu, Xizhou Zhu, Tong ...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[20]
Reproducible scaling laws for contrastive language-image learning
Mehdi Cherti, Romain Beaumont, Ross Wightman, Mitchell Worts- man, Gabriel Ilharco, Cade Gordon, Christoph Schuhmann, Ludwig Schmidt, and Jenia Jitsev. Reproducible scaling laws for contrastive language-image learning. In CVPR, 2023. 18
work page 2023
-
[21]
Chatbot arena: An open platform for evaluating LLMs by human preference
Wei-Lin Chiang, Lianmin Zheng, Ying Sheng, Anastasios Nikolas Angelopoulos, Tianle Li, Dacheng Li, Hao Zhang, Banghua Zhu, Michael Jordan, Joseph E Gonzalez, and Ion Stoica. Chatbot arena: An open platform for evaluating LLMs by human preference. In ICML, 2024. 6, 13
work page 2024
-
[22]
Mobilevlm : A fast, strong and open vision language assistant for mobile devices
Xiangxiang Chu, Limeng Qiao, Xinyang Lin, Shuang Xu, Yang Yang, Yiming Hu, Fei Wei, Xinyu Zhang, Bo Zhang, Xiaolin Wei, and Chunhua Shen. Mobilevlm : A fast, strong and open vision language assistant for mobile devices. arXiv preprint arXiv:2312.16886, 2023. 19
-
[23]
Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge
Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sab- harwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457, 2018. 16
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[24]
Training Verifiers to Solve Math Word Problems
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Ja- cob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schul- man. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021. 16
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[25]
On implementing 2d rectangular assignment algo- rithms
David F Crouse. On implementing 2d rectangular assignment algo- rithms. IEEE Transactions on Aerospace and Electronic Systems ,
-
[26]
Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Albert Li, Pascale Fung, and Steven C. H. Hoi. Instructblip: Towards general-purpose vision- language models with instruction tuning. In NeurIPS, 2023. 18
work page 2023
-
[27]
NVLM: Open frontier-class multimodal LLMs
Wenliang Dai, Nayeon Lee, Boxin Wang, Zhuoling Yang, Zihan Liu, Jon Barker, Tuomas Rintamaki, Mohammad Shoeybi, Bryan Catanzaro, and Wei Ping. NVLM: Open frontier-class multimodal LLMs. arXiv preprint arXiv:2409.11402, 2024. 5
-
[28]
FlashAttention-2: Faster attention with better parallelism and work partitioning
Tri Dao. FlashAttention-2: Faster attention with better parallelism and work partitioning. In ICLR, 2024. 10 26
work page 2024
-
[29]
Fu, Stefano Ermon, Atri Rudra, and Christopher R´e
Tri Dao, Daniel Y . Fu, Stefano Ermon, Atri Rudra, and Christopher R´e. FlashAttention: Fast and memory-efficient exact attention with IO-awareness. In NeurIPS, 2022. 10
work page 2022
-
[30]
Xiaoyi Dong, Pan Zhang, Yuhang Zang, Yuhang Cao, Bin Wang, Linke Ouyang, Songyang Zhang, Haodong Duan, Wenwei Zhang, Yining Li, Hang Yan, Yang Gao, Zhe Chen, Xinyue Zhang, Wei Li, Jingwen Li, Wenhai Wang, Kai Chen, Conghui He, Xingcheng Zhang, Jifeng Dai, Yu Qiao, Dahua Lin, and Jiaqi Wang. InternLM- XComposer2-4KHD: A Pioneering Large Vision-Language Mod...
-
[31]
An image is worth 16x16 words: Transformers for image recognition at scale
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa De- hghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2021. 3
work page 2021
-
[32]
Yunhao Fang, Ligeng Zhu, Yao Lu, Yan Wang, Pavlo Molchanov, Jan Kautz, Jang Hyun Cho, Marco Pavone, Song Han, and Hongxu Yin. Vila2: Vila augmented vila. arXiv preprint arXiv:2407.17453, 2024
-
[33]
Devise: A deep visual-semantic embedding model
Andrea Frome, Greg S Corrado, Jon Shlens, Samy Bengio, Jeff Dean, Marc’Aurelio Ranzato, and Tomas Mikolov. Devise: A deep visual-semantic embedding model. NeurIPS, 2013. 18
work page 2013
-
[34]
Vita: Towards open-source interactive omni multimodal llm
Chaoyou Fu, Haojia Lin, Zuwei Long, Yunhang Shen, Meng Zhao, Yifan Zhang, Xiong Wang, Di Yin, Long Ma, Xiawu Zheng, et al. Vita: Towards open-source interactive omni multimodal llm. arXiv preprint arXiv:2408.05211, 2024. 19
-
[35]
Scaling synthetic data creation with 1,000,000,000 personas.arXiv:2406.20094, 2024
Tao Ge, Xin Chan, Xiaoyang Wang, Dian Yu, Haitao Mi, and Dong Yu. Scaling synthetic data creation with 1,000,000,000 personas. arXiv preprint arXiv:2406.20094, 2024. 18
-
[36]
Making the V in VQA matter: Elevating the role of image understanding in visual question answering
Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the V in VQA matter: Elevating the role of image understanding in visual question answering. In CVPR, 2017. 5, 6
work page 2017
-
[37]
Dirk Groeneveld, Iz Beltagy, Pete Walsh, Akshita Bhagia, Rod- ney Kinney, Oyvind Tafjord, A. Jha, Hamish Ivison, Ian Magnus- son, Yizhong Wang, Shane Arora, David Atkinson, Russell Au- thur, Khyathi Raghavi Chandu, Arman Cohan, Jennifer Dumas, Yanai Elazar, Yuling Gu, Jack Hessel, Tushar Khot, William Mer- rill, Jacob Daniel Morrison, Niklas Muennighoff, ...
work page 2024
-
[38]
Measuring massive multitask language understanding
Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. In ICLR, 2021. 16
work page 2021
-
[39]
Mea- suring mathematical problem solving with the math dataset
Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Mea- suring mathematical problem solving with the math dataset. In NeurIPS Track on Datasets and Benchmarks, 2021. 16
work page 2021
-
[40]
Accu- mulated gradient normalization
Joeri R Hermans, Gerasimos Spanakis, and Rico M ¨ockel. Accu- mulated gradient normalization. In Asian Conference on Machine Learning, pages 439–454. PMLR, 2017. 10
work page 2017
-
[41]
Cogvlm2: Visual language models for image and video un- derstanding
Wenyi Hong, Weihan Wang, Ming Ding, Wenmeng Yu, Qingsong Lv, Yan Wang, Yean Cheng, Shiyu Huang, Junhui Ji, Zhao Xue, et al. Cogvlm2: Visual language models for image and video un- derstanding. arXiv preprint arXiv:2408.16500, 2024. 18
-
[42]
mplug-docowl 1.5: Unified structure learning for ocr-free document understanding
Anwen Hu, Haiyang Xu, Jiabo Ye, Mingshi Yan, Liang Zhang, Bo Zhang, Chen Li, Ji Zhang, Qin Jin, Fei Huang, and Jingren Zhou. mplug-docowl 1.5: Unified structure learning for ocr-free document understanding. In Findings of EMNLP, 2024
work page 2024
-
[43]
Scaling up visual and vision-language representation learning with noisy text supervision
Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representation learning with noisy text supervision. In ICML, 2021. 18
work page 2021
-
[44]
Mantis: Interleaved multi-image instruction tuning
Dongfu Jiang, Xuan He, Huaye Zeng, Cong Wei, Max Ku, Qian Liu, and Wenhu Chen. Mantis: Interleaved multi-image instruction tuning. arXiv preprint arXiv:2405.01483, 2024. 20
-
[45]
A shortest augmenting path algo- rithm for dense and sparse linear assignment problems.Computing,
Roy Jonker and Ton V olgenant. A shortest augmenting path algo- rithm for dense and sparse linear assignment problems.Computing,
-
[46]
DVQA: Understanding data visualizations via question answering
Kushal Kafle, Brian Price, Scott Cohen, and Christopher Kanan. DVQA: Understanding data visualizations via question answering. In CVPR, 2018. 5, 11, 20
work page 2018
-
[47]
FigureQA: An Annotated Figure Dataset for Visual Reasoning
Samira Ebrahimi Kahou, Vincent Michalski, Adam Atkinson, ´Akos K ´ad´ar, Adam Trischler, and Yoshua Bengio. FigureQA: An annotated figure dataset for visual reasoning. arXiv preprint arXiv:1710.07300, 2017. 5, 20
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[48]
Prismatic vlms: Inves- tigating the design space of visually-conditioned language models
Siddharth Karamcheti, Suraj Nair, Ashwin Balakrishna, Percy Liang, Thomas Kollar, and Dorsa Sadigh. Prismatic vlms: Inves- tigating the design space of visually-conditioned language models. arXiv preprint arXiv:2402.07865, 2024. 5, 19
-
[49]
A diagram is worth a dozen images
Aniruddha Kembhavi, Mike Salvato, Eric Kolve, Minjoon Seo, Hannaneh Hajishirzi, and Ali Farhadi. A diagram is worth a dozen images. In ECCV, 2016. 5, 6
work page 2016
-
[50]
Adam: A method for stochastic optimization
Diederik P Kingma. Adam: A method for stochastic optimization. In ICLR, 2015. 5, 9
work page 2015
-
[51]
Berg, Wan-Yen Lo, Piotr Doll´ar, and Ross Girshick
Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexan- der C. Berg, Wan-Yen Lo, Piotr Doll´ar, and Ross Girshick. Segment anything. In ICCV, 2023. 7, 14
work page 2023
-
[52]
Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A. Shamma, Michael S. Bernstein, and Li Fei-Fei. Vi- sual genome: Connecting language and vision using crowdsourced dense image annotations.International Journal of Computer Vision, 123:32 – 73, 2016. 11, 20
work page 2016
-
[53]
Alina Kuznetsova, Hassan Rom, Neil Alldrin, Jasper Uijlings, Ivan Krasin, Jordi Pont-Tuset, Shahab Kamali, Stefan Popov, Matteo Malloci, Alexander Kolesnikov, et al. The open images dataset v4: Unified image classification, object detection, and visual relation- ship detection at scale. IJCV, 2020. 13
work page 2020
-
[54]
Building and better understanding vision-language models: insights and future directions
Hugo Laurenc ¸on, Andr ´es Marafioti, Victor Sanh, and L ´eo Tron- chon. Building and better understanding vision-language models: insights and future directions. arXiv preprint arXiv:2408.12637 ,
-
[55]
Unlocking the conversion of web screenshots into HTML code with the websight dataset
Hugo Laurenc ¸on, L´eo Tronchon, and Victor Sanh. Unlocking the conversion of web screenshots into HTML code with the websight dataset. arXiv preprint arXiv:2403.09029, 2024. 20
-
[56]
Otterhd: A high-resolution multi-modality model
Bo Li, Peiyuan Zhang, Jingkang Yang, Yuanhan Zhang, Fanyi Pu, and Ziwei Liu. Otterhd: A high-resolution multi-modality model. arXiv preprint arXiv:2311.04219, 2023. 19
-
[57]
Mimic-it: Multi-modal in-context instruction tuning
Bo Li, Yuanhan Zhang, Liangyu Chen, Jinghao Wang, Fanyi Pu, Jingkang Yang, C. Li, and Ziwei Liu. Mimic-it: Multi-modal in- context instruction tuning. arXiv preprint arXiv:2306.05425, 2023. 20
-
[58]
Otter: A Multi-Modal Model with In-Context Instruction Tuning
Bo Li, Yuanhan Zhang, Liangyu Chen, Jinghao Wang, Jingkang Yang, and Ziwei Liu. Otter: A multi-modal model with in-context instruction tuning. arXiv preprint arXiv:2305.03726, 2023. 19
work page internal anchor Pith review arXiv 2023
-
[59]
LLaVA-OneVision: Easy Visual Task Transfer
Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Yanwei Li, Ziwei Liu, and Chunyuan Li. LLaV A-OneVision: Easy visual task transfer. arXiv preprint arXiv:2408.03326, 2024. 5, 6, 14, 18, 20
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[60]
Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Boot- strapping language-image pre-training for unified vision-language understanding and generation. In ICML, 2022. 27
work page 2022
-
[61]
Junyan Li, Delin Chen, Yining Hong, Zhenfang Chen, Peihao Chen, Yikang Shen, and Chuang Gan. Covlm: Composing visual entities and relationships in large language models via communicative de- coding. In ICLR, 2024. 20
work page 2024
-
[62]
On the effects of data scale on computer control agents.arXiv preprint arXiv:2406.03679, 2024
Wei Li, William Bishop, Alice Li, Chris Rawles, Folawiyo Campbell-Ajala, Divya Tyamagundlu, and Oriana Riva. On the effects of data scale on computer control agents. arXiv preprint arXiv:2406.03679, 2024. 5, 7
-
[63]
Monkey: Image resolution and text label are important things for large multi-modal models
Zhang Li, Biao Yang, Qiang Liu, Zhiyin Ma, Shuo Zhang, Jingxu Yang, Yabo Sun, Yuliang Liu, and Xiang Bai. Monkey: Image resolution and text label are important things for large multi-modal models. CVPR, 2024
work page 2024
-
[64]
Moe-llava: Mix- ture of experts for large vision-language models
Bin Lin, Zhenyu Tang, Yang Ye, Jiaxi Cui, Bin Zhu, Peng Jin, Junwu Zhang, Munan Ning, and Li Yuan. Moe-llava: Mix- ture of experts for large vision-language models. arXiv preprint arXiv:2401.15947, 2024. 19
-
[65]
Mi- crosoft coco: Common objects in context
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C Lawrence Zitnick. Mi- crosoft coco: Common objects in context. In ECCV, 2014. 13
work page 2014
-
[66]
GRES: Generalized referring expression segmentation
Chang Liu, Henghui Ding, and Xudong Jiang. GRES: Generalized referring expression segmentation. In CVPR, 2023. 18, 20
work page 2023
-
[67]
SPHINX-x: Scaling data and parameters for a family of multi- modal large language models
Dongyang Liu, Renrui Zhang, Longtian Qiu, Siyuan Huang, Weifeng Lin, Shitian Zhao, Shijie Geng, Ziyi Lin, Peng Jin, Kaipeng Zhang, Wenqi Shao, Chao Xu, Conghui He, Junjun He, Hao Shao, Pan Lu, Yu Qiao, Hongsheng Li, and Peng Gao. SPHINX-x: Scaling data and parameters for a family of multi- modal large language models. In ICML, 2024. 19
work page 2024
-
[68]
Mmc: Ad- vancing multimodal chart understanding with large-scale instruc- tion tuning
Fuxiao Liu, Xiaoyang Wang, Wenlin Yao, Jianshu Chen, Kaiqiang Song, Sangwoo Cho, Yaser Yacoob, and Dong Yu. Mmc: Ad- vancing multimodal chart understanding with large-scale instruc- tion tuning. arXiv preprint arXiv:2311.10774, 2023. 20
-
[69]
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. In NeurIPS, 2023. 1, 3, 5, 6, 14, 18, 19, 20
work page 2023
-
[70]
Improved baselines with visual instruction tuning
Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. In CVPR, 2024. 3
work page 2024
-
[71]
Llava-next: Improved reasoning, ocr, and world knowledge, 2024
Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. Llava-next: Improved reasoning, ocr, and world knowledge, 2024. 12, 18, 20
work page 2024
-
[72]
SGDR: Stochastic Gradient Descent with Warm Restarts
Ilya Loshchilov and Frank Hutter. Sgdr: Stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983, 2016. 9
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[73]
Decoupled weight decay regu- larization
Ilya Loshchilov and Frank Hutter. Decoupled weight decay regu- larization. In ICLR, 2019. 5, 9
work page 2019
-
[74]
DeepSeek-VL: Towards Real-World Vision-Language Understanding
Haoyu Lu, Wen Liu, Bo Zhang, Bingxuan Wang, Kai Dong, Bo Liu, Jingxiang Sun, Tongzheng Ren, Zhuoshu Li, Hao Yang, et al. Deepseek-vl: towards real-world vision-language understanding. arXiv preprint arXiv:2403.05525, 2024. 19
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[75]
Unified-io 2: Scaling autoregressive multimodal models with vision language audio and action
Jiasen Lu, Christopher Clark, Sangho Lee, Zichen Zhang, Savya Khosla, Ryan Marten, Derek Hoiem, and Aniruddha Kembhavi. Unified-io 2: Scaling autoregressive multimodal models with vision language audio and action. In CVPR, 2024. 20
work page 2024
-
[76]
Learn to explain: Multimodal reasoning via thought chains for sci- ence question answering
Pan Lu, Swaroop Mishra, Tanglin Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for sci- ence question answering. In NeurIPS, 2022. 5
work page 2022
-
[77]
Dy- namic prompt learning via policy gradient for semi-structured math- ematical reasoning
Pan Lu, Liang Qiu, Kai-Wei Chang, Ying Nian Wu, Song-Chun Zhu, Tanmay Rajpurohit, Peter Clark, and Ashwin Kalyan. Dy- namic prompt learning via policy gradient for semi-structured math- ematical reasoning. In ICLR, 2023. 5
work page 2023
-
[78]
MathVista: Evaluating mathematical reasoning of foundation models in visual contexts
Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Han- naneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. MathVista: Evaluating mathematical reasoning of foundation models in visual contexts. In ICLR, 2024. 6
work page 2024
-
[79]
Cheap and quick: Efficient vision-language in- struction tuning for large language models
Gen Luo, Yiyi Zhou, Tianhe Ren, Shengxin Chen, Xiaoshuai Sun, and Rongrong Ji. Cheap and quick: Efficient vision-language in- struction tuning for large language models. In NeurIPS, 2023. 19
work page 2023
-
[80]
ExpertQA: Expert-curated questions and attributed answers
Chaitanya Malaviya, Subin Lee, Sihao Chen, Elizabeth Sieber, Mark Yatskar, and Dan Roth. ExpertQA: Expert-curated questions and attributed answers. arXiv preprint arXiv:2309.07852, 2023. 20
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.