arxiv: 2605.11405 · v2 · submitted 2026-05-12 · 💻 cs.LG

Recognition: unknown

20/20 Vision Language Models: A Prescription for Better VLMs through Data Curation Alone

Siddharth Joshi , Haoli Yin , Rishabh Adiga , Haakon Mongstad , Alvin Deng , Aldo Carranza , Alex Fang , Amro Abbas

show 25 more authors

Anshuman Suri Brett Larsen Daniel Zayas Darren Teh David Schwab Diego Kiner Fan Pan Jack Urbanek Jason Lee Jason Telanoff Josh Wills Kaleigh Mentzer Luke Merrick Maximilian B\"other Parth Doshi Paul Burstein Pratyush Maini Ties Robroek Tony Jiang Vidhi Jain Vineeth Dorna Zhengping Wang Bogdan Gaza Ari Morcos Matthew Leavitt

Authors on Pith no claims yet

Pith reviewed 2026-05-14 21:24 UTC · model grok-4.3

classification 💻 cs.LG

keywords data curationvision-language modelsVLM benchmarkstraining efficiencyout-of-distribution generalizationinference costmodel reliability

0 comments

The pith

Data curation alone raises VLM performance by over 11 points across 20 benchmarks while using far less training compute.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests how far careful selection and filtering of training data can push vision-language models when the model architecture, training recipe, and total compute are held fixed. Applying the curation pipeline to the single-image portion of MAmmoTH-VL yields average gains of 11.7 percentage points on 20 public benchmarks that cover grounding, VQA, OCR, spatial reasoning, charts, math, and multi-image tasks, plus 11.3 points across the nine axes of their DatBench suite. The same models also become more consistent across random seeds, generalize better to out-of-distribution and multi-image inputs, give more honest and specific open-ended answers, and reach higher accuracy at lower inference cost than matched baselines.

Core claim

By applying a data curation pipeline to the MAmmoTH-VL single-image subset while keeping architecture and training fixed, the resulting models achieve +11.7pp average gain on 20 VLM benchmarks and +11.3pp on DatBench, surpass InternVL3.5-2B by 9.9pp at roughly 17 times less training compute, close the gap to Qwen3-VL-2B within 1.8pp at 87 times less compute, reduce per-capability variance by about 67 percent, improve OOD averages by 7.2pp, produce more honest and concise responses on open-ended queries, and deliver higher accuracy at lower response FLOPs at 1B, 2B, and 4B scales.

What carries the argument

The data curation pipeline that filters and selects high-quality single-image training examples from the MAmmoTH-VL dataset.

If this is right

Per-capability standard deviation across training seeds drops by roughly 67 percent and the gains persist across a 4k-to-16k context-length sweep.
The nine-eval out-of-distribution average rises by 7.2pp and multi-image BLINK improves by 3.09pp despite single-image-only training.
Across roughly 1,100 open-ended queries the curated 2B model is more honest, specific, concise, and less refusal-prone than matched baselines.
At every tested scale the curated model raises accuracy while lowering response FLOPs relative to the matched-compute baseline.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the curation method generalizes across datasets and scales, research attention may shift from raw model size toward systematic data quality work.
Similar pipelines could be applied to other VLM pretraining corpora to test whether comparable accuracy-compute trade-offs appear at larger parameter counts.
The combination of higher accuracy and lower inference FLOPs suggests curation can simultaneously improve capability and deployment cost.
The observed improvements in honesty and specificity on open-ended queries indicate curation can influence response style beyond benchmark accuracy.

Load-bearing premise

That the reported gains are produced solely by the data curation pipeline and not by any unstated differences in training dynamics or evaluation protocols.

What would settle it

Retrain the exact baseline model on the original uncurated MAmmoTH-VL data using identical random seeds, hyperparameters, and evaluation code to check whether the performance gap disappears.

read the original abstract

Data curation has shifted the quality-compute frontier for language-model and contrastive image-text pretraining, but its role for vision-language models (VLMs) is far less established. We ask how far data curation alone can take VLM performance, holding architecture, training recipe, and compute fixed and varying only the training data. Our pipeline, applied to the MAmmoTH-VL single-image subset, lifts performance by +11.7pp on average across 20 public VLM benchmarks (spanning grounding, VQA, OCR/documents, captioning, spatial/3D, counting, charts, math, brand-ID, and multi-image reasoning) and by +11.3pp on average across all nine capability axes of DatBench, our high-fidelity VLM eval suite. At 2B, our curated model surpasses InternVL3.5-2B by 9.9pp at ~17x less training compute and closes the gap to Qwen3-VL-2B to within 1.8pp at ~87x less compute, from pretraining alone. Beyond accuracy, curation delivers four further properties: (1) Reliability: per-capability std across training seeds drops by ~67% and the lift survives a 4k-to-16k context-length sweep; (2) OOD generalization: the 9-eval OOD average rises by +7.2pp, and multi-image BLINK rises by +3.09pp despite single-image-only training, with Visual Correspondence gaining +11.8pp; (3) Behavioral gains beyond benchmarks: across ~1,100 open-ended queries the curated 2B is more honest and more specific than the matched-compute baseline, and more concise and less refusal-prone than a frontier 2B reference; (4) Pareto-dominance on inference cost: at every scale (1B, 2B, 4B) the curated model raises accuracy while lowering response FLOPs vs. the matched-compute baseline, and the curated 4B matches near-frontier accuracy at 3.3x lower response FLOPs than Qwen3-VL-4B. Data curation is a high-leverage tool for building better VLMs, reaching near-frontier accuracy at up to ~150x less training compute.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Data curation alone produces double-digit VLM gains in a matched setup, but the training controls need explicit verification to confirm the attribution.

read the letter

The main thing to know is that this paper shows a data curation pipeline applied to MAmmoTH-VL single-image data lifts average performance by 11.7 points across 20 benchmarks while also cutting variance across seeds by two-thirds and improving OOD scores, all with architecture, recipe, and compute held fixed. At 2B scale the curated model closes most of the gap to much higher-compute frontier models and shows better inference efficiency too. They further report more honest and specific open-ended responses plus gains on multi-image tasks despite single-image training only. This is new for VLMs, where curation effects were less mapped out than in language models. The paper does well by tracking consistency across many capability axes and by including the reliability and Pareto-inference angles, which make the practical case stronger than accuracy numbers alone. The soft spots are around the controls. The central claim requires that nothing in the training loop changed besides the data content itself, including batch construction, shuffling, and preprocessing. The abstract states these factors are fixed, but the full paper needs to show matching tables or diffs to rule out small confounds that could contribute to the lift. The curation criteria themselves also need to be spelled out clearly for anyone to reproduce the pipeline. This work is for groups focused on efficient VLM training and data quality. The empirical results are concrete and multi-sided enough to deserve a serious referee who can check the training details and test how far the method travels to other datasets.

Referee Report

3 major / 3 minor

Summary. The manuscript claims that a data curation pipeline applied to the MAmmoTH-VL single-image subset, while holding VLM architecture, training recipe, and compute fixed, delivers +11.7pp average gains across 20 public benchmarks (grounding, VQA, OCR, captioning, spatial, counting, charts, math, brand-ID, multi-image) and +11.3pp across all nine DatBench axes. It further reports reduced seed-to-seed variance, improved OOD generalization (including +3.09pp on multi-image BLINK despite single-image training), more honest/specific/concise open-ended behavior, and Pareto improvements in accuracy vs. inference FLOPs at 1B/2B/4B scales, enabling near-frontier performance at up to ~150x lower training compute.

Significance. If the matched-compute attribution holds, the result is significant: it positions data curation as a high-leverage, compute-efficient lever for VLMs that can close much of the gap to frontier models without architectural or scale changes. The multi-axis evaluation (reliability, OOD, behavioral, inference cost) and consistent gains across diverse benchmarks strengthen the case beyond single-metric accuracy. The empirical design with fixed controls is a methodological strength that, if fully verified, would make the findings actionable for the field.

major comments (3)

[Abstract and Methods] Abstract and Methods: The central claim attributes all gains to data curation alone under identical architecture, recipe, and compute. However, the manuscript provides no explicit confirmation (e.g., hyperparameter tables, seed values, batch-construction logic, data-loading order, or optimizer details) that effective training dynamics were unchanged between baseline and curated runs. This verification is load-bearing for crediting the +11.7pp lift solely to curation rather than subtle implementation differences.
[§3.2] §3.2 (Data Curation Pipeline): The curation criteria, quality metrics, filtering thresholds, and selection procedures are described at a high level without quantitative details, example filtered samples, or ablation on individual curation steps. This lack of specificity is load-bearing for reproducibility and for confirming that the reported gains generalize beyond the particular MAmmoTH-VL subset and are not artifacts of unstated implementation choices.
[Results and Evaluation] Results and Evaluation sections: While average improvements are highlighted, the manuscript lacks per-benchmark statistical significance tests, confidence intervals, or variance estimates across the 20 benchmarks and 9 DatBench axes. Given the breadth of capabilities tested, this weakens the ability to rule out that gains are driven by a subset of benchmarks or evaluation-protocol sensitivities.

minor comments (3)

[Figures] Figure captions and Pareto plots should explicitly annotate the training compute (tokens or FLOPs) for each scale (1B/2B/4B) and reference model to facilitate direct comparison.
[Abstract] The abstract introduces several acronyms (VLM, OOD, DatBench) without expansion on first use; a brief parenthetical definition would improve readability for a broad audience.
[§3.2] Consider adding a short table summarizing the exact number of samples retained after each curation stage for transparency.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thoughtful and constructive report. The three major comments identify important areas where additional detail and rigor will strengthen the manuscript. We address each point below and will incorporate the suggested revisions in the next version.

read point-by-point responses

Referee: [Abstract and Methods] Abstract and Methods: The central claim attributes all gains to data curation alone under identical architecture, recipe, and compute. However, the manuscript provides no explicit confirmation (e.g., hyperparameter tables, seed values, batch-construction logic, data-loading order, or optimizer details) that effective training dynamics were unchanged between baseline and curated runs. This verification is load-bearing for crediting the +11.7pp lift solely to curation rather than subtle implementation differences.

Authors: We agree that explicit verification of matched training dynamics is essential. The manuscript states in Section 3.1 that the same architecture, optimizer, learning-rate schedule, batch size, and random seeds were used for both the baseline and curated runs, with compute matched by token count. However, we did not include a consolidated hyperparameter table or low-level details such as data-loading order. In the revision we will add an appendix table listing all hyperparameters, seeds, batch-construction logic, and optimizer settings, together with a short statement confirming that the only controlled difference between runs was the training data subset. revision: yes
Referee: [§3.2] §3.2 (Data Curation Pipeline): The curation criteria, quality metrics, filtering thresholds, and selection procedures are described at a high level without quantitative details, example filtered samples, or ablation on individual curation steps. This lack of specificity is load-bearing for reproducibility and for confirming that the reported gains generalize beyond the particular MAmmoTH-VL subset and are not artifacts of unstated implementation choices.

Authors: We acknowledge that Section 3.2 currently presents the pipeline at a high level. To improve reproducibility we will expand the section with the exact quality metrics (CLIP similarity, caption length, OCR density, etc.), numerical filtering thresholds, and the precise selection procedure. We will also include representative examples of filtered-out and retained samples and add an ablation table quantifying the contribution of each curation step to the final gains. These additions will make the pipeline fully specified and allow readers to assess generalization beyond the MAmmoTH-VL subset. revision: yes
Referee: [Results and Evaluation] Results and Evaluation sections: While average improvements are highlighted, the manuscript lacks per-benchmark statistical significance tests, confidence intervals, or variance estimates across the 20 benchmarks and 9 DatBench axes. Given the breadth of capabilities tested, this weakens the ability to rule out that gains are driven by a subset of benchmarks or evaluation-protocol sensitivities.

Authors: The manuscript reports average gains and notes reduced seed-to-seed variance, but does not provide per-benchmark confidence intervals or formal significance tests. In the revised version we will add per-benchmark standard deviations where multiple seeds were run, include 95% confidence intervals for the main averages, and perform paired t-tests (or Wilcoxon tests where appropriate) on the 20 benchmarks and 9 DatBench axes to establish statistical significance of the reported lifts. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical results from matched-compute runs

full rationale

The paper reports benchmark gains from applying a data curation pipeline to MAmmoTH-VL while holding architecture, training recipe, and compute fixed. No equations, derivations, fitted parameters, or predictions appear in the abstract or described claims. Performance lifts (+11.7pp average) are measured directly on external benchmarks rather than derived from self-referential definitions or self-citations. The central attribution to curation alone is an empirical claim open to verification via replication, not a reduction by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the validity of the chosen benchmarks as measures of capability and on the assumption that the curation pipeline improves data quality in a generalizable way; no free parameters, new entities, or ad-hoc axioms are introduced in the abstract.

axioms (1)

domain assumption The 20 public VLM benchmarks and DatBench accurately measure the intended capabilities without substantial bias or leakage
All reported lifts are measured using these benchmarks.

pith-pipeline@v0.9.0 · 5869 in / 1316 out tokens · 40060 ms · 2026-05-14T21:24:14.830513+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

39 extracted references · 39 canonical work pages · 14 internal anchors

[1]

Abbas, K

A. Abbas, K. Tirumala, D. Simig, S. Ganguli, and A. S. Morcos. SemDeDup: Data-efficient learning at web-scale through semantic deduplication.arXiv preprint arXiv:2303.09540,

work page arXiv
[2]

Adiga, L

R. Adiga, L. Subramanian, and V. Chandrasekaran. Designing informative metrics for few-shot example selection. In L.-W. Ku, A. Martins, and V. Srikumar, editors,Findings of the Association for Computational Linguistics: ACL 2024, pages 10127–10135, Bangkok, Thailand, Aug

work page 2024
[3]

doi: 10.18653/v1/2024.findings-acl.602

Association for Computational Linguistics. doi: 10.18653/v1/2024.findings-acl.602. URLhttps://aclanthology.org/2024.findings-acl.602/. R. Adiga, B. Nushi, and V. Chandrasekaran. Attention speaks volumes: Localizing and mitigating bias in language models. In W. Che, J. Nabende, E. Shutova, and M. T. Pilehvar, editors,Proceedings of the 63rd Annual Meeting ...

work page doi:10.18653/v1/2024.findings-acl.602 2024
[4]

ISBN 979-8-89176-251-0

Association for Computational Linguistics. ISBN 979-8-89176-251-0. doi: 10.18653/v1/2025.acl-long.1281. URLhttps://aclanthology.org/2025.acl-long.1281/. J.-B. Alayrac, J. Donahue, P. Luc, A. Miech, I. Barr, Y. Hasson, K. Lenc, A. Mensch, K. Millican, M. Reynolds, et al. Flamingo: A visual language model for few-shot learning. InAdvances in Neural Informat...

work page doi:10.18653/v1/2025.acl-long.1281 2025
[5]

C. Baek, R. P. Monti, D. Schwab, A. Abbas, R. Adiga, C. Blakeney, M. Böther, et al. The finetuner’s fallacy: When to pretrain with your finetuning data.arXiv preprint arXiv:2603.16177,

work page arXiv
[6]

J. Bai, S. Bai, S. Yang, S. Wang, S. Tan, P. Wang, J. Lin, C. Zhou, and J. Zhou. Qwen-VL: A versatile vision-language model for understanding, localization, text reading, and beyond.arXiv preprint arXiv:2308.12966,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

PaliGemma: A versatile 3B VLM for transfer

L. Beyer, A. Steiner, A. S. Pinto, A. Kolesnikov, X. Wang, D. Salz, M. Neumann, I. Alabdulmohsin, M. Tschannen, E. Bugliarello, et al. PaliGemma: A versatile 3B VLM for transfer.arXiv preprint arXiv:2407.07726,

work page internal anchor Pith review Pith/arXiv arXiv
[8]

A. G. Carranza, K. Mentzer, R. P. Monti, A. Fang, A. Deng, A. Abbas, A. Suri, et al. Überweb: Insights from multilingual curation for a 20-trillion-token dataset.arXiv preprint arXiv:2602.15210,

work page arXiv
[9]

L. Chen, J. Li, X. Dong, P. Zhang, C. He, J. Wang, F. Zhao, and D. Lin. ShareGPT4V: Improving large multi-modal models with better captions.arXiv preprint arXiv:2311.12793,

work page internal anchor Pith review Pith/arXiv arXiv
[10]

L. Chen, J. Li, X. Dong, P. Zhang, Y. Zang, Z. Chen, H. Duan, J. Wang, Y. Qiao, D. Lin, and F. Zhao. Are we on the right way for evaluating large vision-language models?arXiv preprint arXiv:2403.20330, 2024a. R. Chen, Y. Wu, L. Chen, G. Liu, Q. He, T. Xiong, C. Liu, J. Guo, and H. Huang. Your vision-language model itself is a strong filter: Towards high-q...

work page internal anchor Pith review Pith/arXiv arXiv
[11]

W. Dai, N. Lee, B. Wang, Z. Yang, Z. Liu, J. Barker, T. Rintamaki, M. Shoeybi, B. Catanzaro, and W. Ping. NVLM: Open frontier-class multimodal LLMs.arXiv preprint arXiv:2409.11402,

work page arXiv
[12]

Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models

DatologyAI. CLIP gets a data upgrade: Outperforming SoTA with im- proved data curation only, 2024a. URL https://www.datologyai.com/blog/ clip-gets-a-data-upgrade-outperforming-sota-with-improved-data-curation-only. 17 20/20 Vision Language Models DatologyAI. Technical deep-dive: Curating our way to a state-of-the-art text dataset, 2024b. URLhttps://www. d...

work page internal anchor Pith review arXiv
[13]

Dodge, G

J. Dodge, G. Ilharco, R. Schwartz, A. Farhadi, H. Hajishirzi, and N. A. Smith. Fine-tuning pretrained language models: Weight initializations, data orders, and early stopping. InarXiv preprint arXiv:2002.06305,

work page arXiv 2002
[14]

C. Fu, P. Chen, Y. Shen, Y. Qin, M. Zhang, X. Lin, J. Yang, X. Zheng, K. Li, X. Sun, Y. Wu, and R. Ji. MME: A comprehensive evaluation benchmark for multimodal large language models.arXiv preprint arXiv:2306.13394,

work page internal anchor Pith review Pith/arXiv arXiv
[15]

X. Fu, Y. Hu, B. Li, Y. Feng, H. Wang, X. Lin, D. Roth, N. A. Smith, W.-C. Ma, and R. Krishna. BLINK: Multimodal large language models can see but not perceive.arXiv preprint arXiv:2404.12390,

work page arXiv
[17]

URL https://arxiv.org/abs/2412.05237. W. Huang, B. Jia, Z. Zhai, S. Cao, Z. Ye, F. Zhao, Y. Hu, and S. Lin. Vision-R1: Incentivizing reasoning capability in multimodal large language models.arXiv preprint arXiv:2503.06749,

work page arXiv
[18]

URL https: //proceedings.mlr.press/v238/. S. Joshi, J. Ni, and B. Mirzasoleiman. Dataset distillation via knowledge distillation: Towards efficient self-supervised pre-training of deep networks.International Conference on Learning Representations (ICLR), 2025a. S. Joshi, B. Nushi, V. Balachandran, V. Chandrasekaran, V. Vineet, N. Joshi, and B. Mirzasoleim...

work page arXiv
[19]

Kazemzadeh, V

S. Kazemzadeh, V. Ordonez, M. Matten, and T. Berg. ReferItGame: Referring to objects in photographs of natural scenes. InProceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP),

work page 2014
[20]

Killamsetty, D

K. Killamsetty, D. Sivasubramanian, G. Ramakrishnan, and R. Iyer. GLISTER: Generalization based data subset selection for efficient and robust learning. InProceedings of the AAAI Conference on Artificial Intelligence (AAAI), 2021a. K. Killamsetty, X. Zhao, F. Chen, and R. Iyer. RETRIEVE: Coreset selection for efficient and robust semi-supervised learning....

work page arXiv
[21]

Building and better understanding vision-language models: insights and future directions

H. Laurençon, A. Marafioti, V. Sanh, and L. Tronchon. Building and better understanding vision-language models: insights and future directions.arXiv preprint arXiv:2408.12637, 2024a. 18 20/20 Vision Language Models H. Laurençon, L. Tronchon, M. Cord, and V. Sanh. What matters when building vision-language models?arXiv preprint arXiv:2405.02246, 2024b. K. ...

work page arXiv
[22]

B. Li, R. Wang, G. Wang, Y. Ge, Y. Ge, and Y. Shan. SEED-Bench: Benchmarking multimodal LLMs with generative comprehension.arXiv preprint arXiv:2307.16125, 2023a. J. Li, D. Li, S. Savarese, and S. Hoi. BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InInternational Conference on Machine Learning (ICM...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[23]

URLhttps://arxiv.org/ abs/2503.22655. H. Liu, C. Li, Y. Li, and Y. J. Lee. Improved baselines with visual instruction tuning.arXiv preprint arXiv:2310.03744, 2023a. H. Liu, C. Li, Q. Wu, and Y. J. Lee. Visual instruction tuning. InAdvances in Neural Information Processing Systems (NeurIPS), 2023b. H. Liu, C. Li, Y. Li, B. Li, Y. Zhang, S. Shen, and Y. J. ...

work page arXiv 2024
[24]

URLhttps://arxiv.org/abs/2504.16980. M. Mathew, D. Karatzas, and C. Jawahar. DocVQA: A dataset for VQA on document images. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV),

work page arXiv
[25]

Mm1: Methods, analysis & insights from multimodal llm pre-training

B. McKinzie, Z. Gan, J.-P. Fauconnier, S. Dodge, B. Zhang, P. Dufter, D. Shah, X. Du, F. Peng, F. Weers, et al. MM1: Methods, analysis & insights from multimodal LLM pre-training.arXiv preprint arXiv:2403.09611,

work page arXiv
[26]

F. Meng, L. Du, Z. Liu, Z. Zhou, Q. Lu, D. Fu, B. Shi, W. Wang, J. He, K. Zhang, et al. MM-Eureka: Exploring visual aha moment with rule-based large-scale reinforcement learning.arXiv preprint arXiv:2503.07365,

work page internal anchor Pith review Pith/arXiv arXiv
[27]

Lexical-dense text embeddings used for large-scale data filtering. S. Narang, H. W. Chung, Y. Tay, W. Fedus, T. Fevry, M. Matena, K. Malkan, N. Fedus, D. Bahri, T. Schuster, H. S. Zheng, N. Houlsby, and D. Metzler. Do transformer modifications transfer across implementations and applications? InProceedings of the 2021 Conference on Empirical Methods in Na...

work page 2021
[28]

DINOv2: Learning Robust Visual Features without Supervision

M. Oquab, T. Darcet, T. Moutakanni, H. Vo, M. Szafraniec, V. Khalidov, P. Fernandez, D. Haziza, F. Massa, A. El-Nouby, et al. DINOv2: Learning robust visual features without supervision.arXiv preprint arXiv:2304.07193,

work page internal anchor Pith review Pith/arXiv arXiv
[29]

Qwen3 Technical Report

Qwen Team. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025a. Qwen Team. Qwen3-VL, 2025b. URLhttps://qwenlm.github.io/blog/qwen3-vl/. Qwen Team. Qwen3.5,

work page internal anchor Pith review Pith/arXiv arXiv
[30]

URLhttps://qwen.ai/blog?id=qwen3.5. H. Shen, P. Liu, J. Li, C. Fang, Y. Ma, J. Liao, Q. Shen, Z. Zhang, K. Zhao, Q. Zhang, et al. VLM-R1: A stable and generalizable R1-style large vision-language model.arXiv preprint arXiv:2504.07615,

work page internal anchor Pith review Pith/arXiv arXiv
[31]

M. Shi, F. Liu, S. Wang, S. Liao, S. Radhakrishnan, D.-A. Huang, H. Yin, K. Sapra, Y. Yacoob, H. Shi, et al. Eagle: Exploring the design space for multimodal LLMs with mixture of encoders.arXiv preprint arXiv:2408.15998,

work page arXiv
[32]

URLhttps://arxiv.org/abs/2404.16123. L. Soldaini, R. Kinney, A. Bhagia, D. Schwenk, D. Atkinson, R. Authur, B. Bogin, K. Chandu, J. Dumas, Y. Elazar, et al. Dolma: an open corpus of three trillion tokens for language model pretraining research. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL),

work page arXiv
[33]

D. Su, K. Kong, Y. Lin, J. Jennings, B. Norick, M. Kliegl, M. Patwary, M. Shoeybi, and B. Catanzaro. Nemotron-CC: Transforming Common Crawl into a refined long-horizon pretraining dataset.arXiv preprint arXiv:2412.02595, 2024a. X. Su et al. SK-VQA: Synthetic knowledge generation at scale for training context-augmented multimodal llms, 2024b. URLhttps://ar...

work page arXiv
[34]

S. Tong, E. Brown, P. Wu, S. Woo, M. Middepogu, S. C. Akula, J. Yang, S. Yang, A. Iyer, X. Pan, et al. Cambrian-1: A fully open, vision-centric exploration of multimodal LLMs. InAdvances in Neural Information Processing Systems (NeurIPS), 2024a. S. Tong, Z. Liu, Y. Zhai, Y. Ma, Y. LeCun, and S. Xie. Eyes wide shut? exploring the visual shortcomings of mul...

work page internal anchor Pith review Pith/arXiv arXiv
[35]

P. Wang, S. Bai, S. Tan, S. Wang, Z. Fan, J. Bai, K. Chen, X. Liu, J. Wang, W. Ge, et al. Qwen2-VL: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024a. W. Wang, K. Mrini, L. Yang, S. Kumar, Y. Tian, X. Yan, and H. Wang. Finetuned multimodal language models are high-quality image-text data fil...

work page internal anchor Pith review Pith/arXiv arXiv
[36]

Wiedmann, O

L. Wiedmann, O. Zohar, A. Mahla, X. Wang, R. Li, T. Frere, L. von Werra, A. Roy Gosthipaty, and A. Marafioti. FineVision: Open data is all you need.arXiv preprint arXiv:2510.17269,

work page arXiv
[37]

URLhttps://x.ai/news/grok-1.5v. H. Xu, S. Xie, X. E. Tan, P.-Y. Huang, R. Howes, V. Sharma, S.-W. Li, G. Ghosh, L. Zettlemoyer, and C. Feichtenhofer. Demystifying CLIP data.arXiv preprint arXiv:2309.16671,

work page arXiv
[38]

URLhttps://arxiv.org/abs/2502.14846. L. Yu, P. Poirson, S. Yang, A. C. Berg, and T. L. Berg. Modeling context in referring expressions. InEuropean Conference on Computer Vision (ECCV),

work page arXiv
[39]

W. Yu, Z. Yang, L. Li, J. Wang, K. Lin, Z. Liu, X. Wang, and L. Wang. MM-Vet: Evaluating large multimodal models for integrated capabilities.arXiv preprint arXiv:2308.02490,

work page internal anchor Pith review Pith/arXiv arXiv
[40]

Y. Zhou, C. Cui, R. Rafailov, C. Finn, and H. Yao. Aligning modalities in vision large language models via preference fine-tuning.arXiv preprint arXiv:2402.11411,

work page arXiv