pith. machine review for the scientific record. sign in

arxiv: 2605.11405 · v2 · submitted 2026-05-12 · 💻 cs.LG

Recognition: unknown

20/20 Vision Language Models: A Prescription for Better VLMs through Data Curation Alone

Authors on Pith no claims yet

Pith reviewed 2026-05-14 21:24 UTC · model grok-4.3

classification 💻 cs.LG
keywords data curationvision-language modelsVLM benchmarkstraining efficiencyout-of-distribution generalizationinference costmodel reliability
0
0 comments X

The pith

Data curation alone raises VLM performance by over 11 points across 20 benchmarks while using far less training compute.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests how far careful selection and filtering of training data can push vision-language models when the model architecture, training recipe, and total compute are held fixed. Applying the curation pipeline to the single-image portion of MAmmoTH-VL yields average gains of 11.7 percentage points on 20 public benchmarks that cover grounding, VQA, OCR, spatial reasoning, charts, math, and multi-image tasks, plus 11.3 points across the nine axes of their DatBench suite. The same models also become more consistent across random seeds, generalize better to out-of-distribution and multi-image inputs, give more honest and specific open-ended answers, and reach higher accuracy at lower inference cost than matched baselines.

Core claim

By applying a data curation pipeline to the MAmmoTH-VL single-image subset while keeping architecture and training fixed, the resulting models achieve +11.7pp average gain on 20 VLM benchmarks and +11.3pp on DatBench, surpass InternVL3.5-2B by 9.9pp at roughly 17 times less training compute, close the gap to Qwen3-VL-2B within 1.8pp at 87 times less compute, reduce per-capability variance by about 67 percent, improve OOD averages by 7.2pp, produce more honest and concise responses on open-ended queries, and deliver higher accuracy at lower response FLOPs at 1B, 2B, and 4B scales.

What carries the argument

The data curation pipeline that filters and selects high-quality single-image training examples from the MAmmoTH-VL dataset.

If this is right

  • Per-capability standard deviation across training seeds drops by roughly 67 percent and the gains persist across a 4k-to-16k context-length sweep.
  • The nine-eval out-of-distribution average rises by 7.2pp and multi-image BLINK improves by 3.09pp despite single-image-only training.
  • Across roughly 1,100 open-ended queries the curated 2B model is more honest, specific, concise, and less refusal-prone than matched baselines.
  • At every tested scale the curated model raises accuracy while lowering response FLOPs relative to the matched-compute baseline.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the curation method generalizes across datasets and scales, research attention may shift from raw model size toward systematic data quality work.
  • Similar pipelines could be applied to other VLM pretraining corpora to test whether comparable accuracy-compute trade-offs appear at larger parameter counts.
  • The combination of higher accuracy and lower inference FLOPs suggests curation can simultaneously improve capability and deployment cost.
  • The observed improvements in honesty and specificity on open-ended queries indicate curation can influence response style beyond benchmark accuracy.

Load-bearing premise

That the reported gains are produced solely by the data curation pipeline and not by any unstated differences in training dynamics or evaluation protocols.

What would settle it

Retrain the exact baseline model on the original uncurated MAmmoTH-VL data using identical random seeds, hyperparameters, and evaluation code to check whether the performance gap disappears.

read the original abstract

Data curation has shifted the quality-compute frontier for language-model and contrastive image-text pretraining, but its role for vision-language models (VLMs) is far less established. We ask how far data curation alone can take VLM performance, holding architecture, training recipe, and compute fixed and varying only the training data. Our pipeline, applied to the MAmmoTH-VL single-image subset, lifts performance by +11.7pp on average across 20 public VLM benchmarks (spanning grounding, VQA, OCR/documents, captioning, spatial/3D, counting, charts, math, brand-ID, and multi-image reasoning) and by +11.3pp on average across all nine capability axes of DatBench, our high-fidelity VLM eval suite. At 2B, our curated model surpasses InternVL3.5-2B by 9.9pp at ~17x less training compute and closes the gap to Qwen3-VL-2B to within 1.8pp at ~87x less compute, from pretraining alone. Beyond accuracy, curation delivers four further properties: (1) Reliability: per-capability std across training seeds drops by ~67% and the lift survives a 4k-to-16k context-length sweep; (2) OOD generalization: the 9-eval OOD average rises by +7.2pp, and multi-image BLINK rises by +3.09pp despite single-image-only training, with Visual Correspondence gaining +11.8pp; (3) Behavioral gains beyond benchmarks: across ~1,100 open-ended queries the curated 2B is more honest and more specific than the matched-compute baseline, and more concise and less refusal-prone than a frontier 2B reference; (4) Pareto-dominance on inference cost: at every scale (1B, 2B, 4B) the curated model raises accuracy while lowering response FLOPs vs. the matched-compute baseline, and the curated 4B matches near-frontier accuracy at 3.3x lower response FLOPs than Qwen3-VL-4B. Data curation is a high-leverage tool for building better VLMs, reaching near-frontier accuracy at up to ~150x less training compute.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 3 minor

Summary. The manuscript claims that a data curation pipeline applied to the MAmmoTH-VL single-image subset, while holding VLM architecture, training recipe, and compute fixed, delivers +11.7pp average gains across 20 public benchmarks (grounding, VQA, OCR, captioning, spatial, counting, charts, math, brand-ID, multi-image) and +11.3pp across all nine DatBench axes. It further reports reduced seed-to-seed variance, improved OOD generalization (including +3.09pp on multi-image BLINK despite single-image training), more honest/specific/concise open-ended behavior, and Pareto improvements in accuracy vs. inference FLOPs at 1B/2B/4B scales, enabling near-frontier performance at up to ~150x lower training compute.

Significance. If the matched-compute attribution holds, the result is significant: it positions data curation as a high-leverage, compute-efficient lever for VLMs that can close much of the gap to frontier models without architectural or scale changes. The multi-axis evaluation (reliability, OOD, behavioral, inference cost) and consistent gains across diverse benchmarks strengthen the case beyond single-metric accuracy. The empirical design with fixed controls is a methodological strength that, if fully verified, would make the findings actionable for the field.

major comments (3)
  1. [Abstract and Methods] Abstract and Methods: The central claim attributes all gains to data curation alone under identical architecture, recipe, and compute. However, the manuscript provides no explicit confirmation (e.g., hyperparameter tables, seed values, batch-construction logic, data-loading order, or optimizer details) that effective training dynamics were unchanged between baseline and curated runs. This verification is load-bearing for crediting the +11.7pp lift solely to curation rather than subtle implementation differences.
  2. [§3.2] §3.2 (Data Curation Pipeline): The curation criteria, quality metrics, filtering thresholds, and selection procedures are described at a high level without quantitative details, example filtered samples, or ablation on individual curation steps. This lack of specificity is load-bearing for reproducibility and for confirming that the reported gains generalize beyond the particular MAmmoTH-VL subset and are not artifacts of unstated implementation choices.
  3. [Results and Evaluation] Results and Evaluation sections: While average improvements are highlighted, the manuscript lacks per-benchmark statistical significance tests, confidence intervals, or variance estimates across the 20 benchmarks and 9 DatBench axes. Given the breadth of capabilities tested, this weakens the ability to rule out that gains are driven by a subset of benchmarks or evaluation-protocol sensitivities.
minor comments (3)
  1. [Figures] Figure captions and Pareto plots should explicitly annotate the training compute (tokens or FLOPs) for each scale (1B/2B/4B) and reference model to facilitate direct comparison.
  2. [Abstract] The abstract introduces several acronyms (VLM, OOD, DatBench) without expansion on first use; a brief parenthetical definition would improve readability for a broad audience.
  3. [§3.2] Consider adding a short table summarizing the exact number of samples retained after each curation stage for transparency.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thoughtful and constructive report. The three major comments identify important areas where additional detail and rigor will strengthen the manuscript. We address each point below and will incorporate the suggested revisions in the next version.

read point-by-point responses
  1. Referee: [Abstract and Methods] Abstract and Methods: The central claim attributes all gains to data curation alone under identical architecture, recipe, and compute. However, the manuscript provides no explicit confirmation (e.g., hyperparameter tables, seed values, batch-construction logic, data-loading order, or optimizer details) that effective training dynamics were unchanged between baseline and curated runs. This verification is load-bearing for crediting the +11.7pp lift solely to curation rather than subtle implementation differences.

    Authors: We agree that explicit verification of matched training dynamics is essential. The manuscript states in Section 3.1 that the same architecture, optimizer, learning-rate schedule, batch size, and random seeds were used for both the baseline and curated runs, with compute matched by token count. However, we did not include a consolidated hyperparameter table or low-level details such as data-loading order. In the revision we will add an appendix table listing all hyperparameters, seeds, batch-construction logic, and optimizer settings, together with a short statement confirming that the only controlled difference between runs was the training data subset. revision: yes

  2. Referee: [§3.2] §3.2 (Data Curation Pipeline): The curation criteria, quality metrics, filtering thresholds, and selection procedures are described at a high level without quantitative details, example filtered samples, or ablation on individual curation steps. This lack of specificity is load-bearing for reproducibility and for confirming that the reported gains generalize beyond the particular MAmmoTH-VL subset and are not artifacts of unstated implementation choices.

    Authors: We acknowledge that Section 3.2 currently presents the pipeline at a high level. To improve reproducibility we will expand the section with the exact quality metrics (CLIP similarity, caption length, OCR density, etc.), numerical filtering thresholds, and the precise selection procedure. We will also include representative examples of filtered-out and retained samples and add an ablation table quantifying the contribution of each curation step to the final gains. These additions will make the pipeline fully specified and allow readers to assess generalization beyond the MAmmoTH-VL subset. revision: yes

  3. Referee: [Results and Evaluation] Results and Evaluation sections: While average improvements are highlighted, the manuscript lacks per-benchmark statistical significance tests, confidence intervals, or variance estimates across the 20 benchmarks and 9 DatBench axes. Given the breadth of capabilities tested, this weakens the ability to rule out that gains are driven by a subset of benchmarks or evaluation-protocol sensitivities.

    Authors: The manuscript reports average gains and notes reduced seed-to-seed variance, but does not provide per-benchmark confidence intervals or formal significance tests. In the revised version we will add per-benchmark standard deviations where multiple seeds were run, include 95% confidence intervals for the main averages, and perform paired t-tests (or Wilcoxon tests where appropriate) on the 20 benchmarks and 9 DatBench axes to establish statistical significance of the reported lifts. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical results from matched-compute runs

full rationale

The paper reports benchmark gains from applying a data curation pipeline to MAmmoTH-VL while holding architecture, training recipe, and compute fixed. No equations, derivations, fitted parameters, or predictions appear in the abstract or described claims. Performance lifts (+11.7pp average) are measured directly on external benchmarks rather than derived from self-referential definitions or self-citations. The central attribution to curation alone is an empirical claim open to verification via replication, not a reduction by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the validity of the chosen benchmarks as measures of capability and on the assumption that the curation pipeline improves data quality in a generalizable way; no free parameters, new entities, or ad-hoc axioms are introduced in the abstract.

axioms (1)
  • domain assumption The 20 public VLM benchmarks and DatBench accurately measure the intended capabilities without substantial bias or leakage
    All reported lifts are measured using these benchmarks.

pith-pipeline@v0.9.0 · 5869 in / 1316 out tokens · 40060 ms · 2026-05-14T21:24:14.830513+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

39 extracted references · 39 canonical work pages · 14 internal anchors

  1. [1]

    Abbas, K

    A. Abbas, K. Tirumala, D. Simig, S. Ganguli, and A. S. Morcos. SemDeDup: Data-efficient learning at web-scale through semantic deduplication.arXiv preprint arXiv:2303.09540,

  2. [2]

    Adiga, L

    R. Adiga, L. Subramanian, and V. Chandrasekaran. Designing informative metrics for few-shot example selection. In L.-W. Ku, A. Martins, and V. Srikumar, editors,Findings of the Association for Computational Linguistics: ACL 2024, pages 10127–10135, Bangkok, Thailand, Aug

  3. [3]

    doi: 10.18653/v1/2024.findings-acl.602

    Association for Computational Linguistics. doi: 10.18653/v1/2024.findings-acl.602. URLhttps://aclanthology.org/2024.findings-acl.602/. R. Adiga, B. Nushi, and V. Chandrasekaran. Attention speaks volumes: Localizing and mitigating bias in language models. In W. Che, J. Nabende, E. Shutova, and M. T. Pilehvar, editors,Proceedings of the 63rd Annual Meeting ...

  4. [4]

    ISBN 979-8-89176-251-0

    Association for Computational Linguistics. ISBN 979-8-89176-251-0. doi: 10.18653/v1/2025.acl-long.1281. URLhttps://aclanthology.org/2025.acl-long.1281/. J.-B. Alayrac, J. Donahue, P. Luc, A. Miech, I. Barr, Y. Hasson, K. Lenc, A. Mensch, K. Millican, M. Reynolds, et al. Flamingo: A visual language model for few-shot learning. InAdvances in Neural Informat...

  5. [5]

    C. Baek, R. P. Monti, D. Schwab, A. Abbas, R. Adiga, C. Blakeney, M. Böther, et al. The finetuner’s fallacy: When to pretrain with your finetuning data.arXiv preprint arXiv:2603.16177,

  6. [6]

    J. Bai, S. Bai, S. Yang, S. Wang, S. Tan, P. Wang, J. Lin, C. Zhou, and J. Zhou. Qwen-VL: A versatile vision-language model for understanding, localization, text reading, and beyond.arXiv preprint arXiv:2308.12966,

  7. [7]

    PaliGemma: A versatile 3B VLM for transfer

    L. Beyer, A. Steiner, A. S. Pinto, A. Kolesnikov, X. Wang, D. Salz, M. Neumann, I. Alabdulmohsin, M. Tschannen, E. Bugliarello, et al. PaliGemma: A versatile 3B VLM for transfer.arXiv preprint arXiv:2407.07726,

  8. [8]

    A. G. Carranza, K. Mentzer, R. P. Monti, A. Fang, A. Deng, A. Abbas, A. Suri, et al. Überweb: Insights from multilingual curation for a 20-trillion-token dataset.arXiv preprint arXiv:2602.15210,

  9. [9]

    L. Chen, J. Li, X. Dong, P. Zhang, C. He, J. Wang, F. Zhao, and D. Lin. ShareGPT4V: Improving large multi-modal models with better captions.arXiv preprint arXiv:2311.12793,

  10. [10]

    L. Chen, J. Li, X. Dong, P. Zhang, Y. Zang, Z. Chen, H. Duan, J. Wang, Y. Qiao, D. Lin, and F. Zhao. Are we on the right way for evaluating large vision-language models?arXiv preprint arXiv:2403.20330, 2024a. R. Chen, Y. Wu, L. Chen, G. Liu, Q. He, T. Xiong, C. Liu, J. Guo, and H. Huang. Your vision-language model itself is a strong filter: Towards high-q...

  11. [11]

    W. Dai, N. Lee, B. Wang, Z. Yang, Z. Liu, J. Barker, T. Rintamaki, M. Shoeybi, B. Catanzaro, and W. Ping. NVLM: Open frontier-class multimodal LLMs.arXiv preprint arXiv:2409.11402,

  12. [12]

    Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models

    DatologyAI. CLIP gets a data upgrade: Outperforming SoTA with im- proved data curation only, 2024a. URL https://www.datologyai.com/blog/ clip-gets-a-data-upgrade-outperforming-sota-with-improved-data-curation-only. 17 20/20 Vision Language Models DatologyAI. Technical deep-dive: Curating our way to a state-of-the-art text dataset, 2024b. URLhttps://www. d...

  13. [13]

    Dodge, G

    J. Dodge, G. Ilharco, R. Schwartz, A. Farhadi, H. Hajishirzi, and N. A. Smith. Fine-tuning pretrained language models: Weight initializations, data orders, and early stopping. InarXiv preprint arXiv:2002.06305,

  14. [14]

    C. Fu, P. Chen, Y. Shen, Y. Qin, M. Zhang, X. Lin, J. Yang, X. Zheng, K. Li, X. Sun, Y. Wu, and R. Ji. MME: A comprehensive evaluation benchmark for multimodal large language models.arXiv preprint arXiv:2306.13394,

  15. [15]

    X. Fu, Y. Hu, B. Li, Y. Feng, H. Wang, X. Lin, D. Roth, N. A. Smith, W.-C. Ma, and R. Krishna. BLINK: Multimodal large language models can see but not perceive.arXiv preprint arXiv:2404.12390,

  16. [17]

    URL https://arxiv.org/abs/2412.05237. W. Huang, B. Jia, Z. Zhai, S. Cao, Z. Ye, F. Zhao, Y. Hu, and S. Lin. Vision-R1: Incentivizing reasoning capability in multimodal large language models.arXiv preprint arXiv:2503.06749,

  17. [18]

    URL https: //proceedings.mlr.press/v238/. S. Joshi, J. Ni, and B. Mirzasoleiman. Dataset distillation via knowledge distillation: Towards efficient self-supervised pre-training of deep networks.International Conference on Learning Representations (ICLR), 2025a. S. Joshi, B. Nushi, V. Balachandran, V. Chandrasekaran, V. Vineet, N. Joshi, and B. Mirzasoleim...

  18. [19]

    Kazemzadeh, V

    S. Kazemzadeh, V. Ordonez, M. Matten, and T. Berg. ReferItGame: Referring to objects in photographs of natural scenes. InProceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP),

  19. [20]

    Killamsetty, D

    K. Killamsetty, D. Sivasubramanian, G. Ramakrishnan, and R. Iyer. GLISTER: Generalization based data subset selection for efficient and robust learning. InProceedings of the AAAI Conference on Artificial Intelligence (AAAI), 2021a. K. Killamsetty, X. Zhao, F. Chen, and R. Iyer. RETRIEVE: Coreset selection for efficient and robust semi-supervised learning....

  20. [21]

    Building and better understanding vision-language models: insights and future directions

    H. Laurençon, A. Marafioti, V. Sanh, and L. Tronchon. Building and better understanding vision-language models: insights and future directions.arXiv preprint arXiv:2408.12637, 2024a. 18 20/20 Vision Language Models H. Laurençon, L. Tronchon, M. Cord, and V. Sanh. What matters when building vision-language models?arXiv preprint arXiv:2405.02246, 2024b. K. ...

  21. [22]

    B. Li, R. Wang, G. Wang, Y. Ge, Y. Ge, and Y. Shan. SEED-Bench: Benchmarking multimodal LLMs with generative comprehension.arXiv preprint arXiv:2307.16125, 2023a. J. Li, D. Li, S. Savarese, and S. Hoi. BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InInternational Conference on Machine Learning (ICM...

  22. [23]

    URLhttps://arxiv.org/ abs/2503.22655. H. Liu, C. Li, Y. Li, and Y. J. Lee. Improved baselines with visual instruction tuning.arXiv preprint arXiv:2310.03744, 2023a. H. Liu, C. Li, Q. Wu, and Y. J. Lee. Visual instruction tuning. InAdvances in Neural Information Processing Systems (NeurIPS), 2023b. H. Liu, C. Li, Y. Li, B. Li, Y. Zhang, S. Shen, and Y. J. ...

  23. [24]

    URLhttps://arxiv.org/abs/2504.16980. M. Mathew, D. Karatzas, and C. Jawahar. DocVQA: A dataset for VQA on document images. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV),

  24. [25]

    Mm1: Methods, analysis & insights from multimodal llm pre-training

    B. McKinzie, Z. Gan, J.-P. Fauconnier, S. Dodge, B. Zhang, P. Dufter, D. Shah, X. Du, F. Peng, F. Weers, et al. MM1: Methods, analysis & insights from multimodal LLM pre-training.arXiv preprint arXiv:2403.09611,

  25. [26]

    F. Meng, L. Du, Z. Liu, Z. Zhou, Q. Lu, D. Fu, B. Shi, W. Wang, J. He, K. Zhang, et al. MM-Eureka: Exploring visual aha moment with rule-based large-scale reinforcement learning.arXiv preprint arXiv:2503.07365,

  26. [27]

    Lexical-dense text embeddings used for large-scale data filtering. S. Narang, H. W. Chung, Y. Tay, W. Fedus, T. Fevry, M. Matena, K. Malkan, N. Fedus, D. Bahri, T. Schuster, H. S. Zheng, N. Houlsby, and D. Metzler. Do transformer modifications transfer across implementations and applications? InProceedings of the 2021 Conference on Empirical Methods in Na...

  27. [28]

    DINOv2: Learning Robust Visual Features without Supervision

    M. Oquab, T. Darcet, T. Moutakanni, H. Vo, M. Szafraniec, V. Khalidov, P. Fernandez, D. Haziza, F. Massa, A. El-Nouby, et al. DINOv2: Learning robust visual features without supervision.arXiv preprint arXiv:2304.07193,

  28. [29]

    Qwen3 Technical Report

    Qwen Team. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025a. Qwen Team. Qwen3-VL, 2025b. URLhttps://qwenlm.github.io/blog/qwen3-vl/. Qwen Team. Qwen3.5,

  29. [30]

    URLhttps://qwen.ai/blog?id=qwen3.5. H. Shen, P. Liu, J. Li, C. Fang, Y. Ma, J. Liao, Q. Shen, Z. Zhang, K. Zhao, Q. Zhang, et al. VLM-R1: A stable and generalizable R1-style large vision-language model.arXiv preprint arXiv:2504.07615,

  30. [31]

    M. Shi, F. Liu, S. Wang, S. Liao, S. Radhakrishnan, D.-A. Huang, H. Yin, K. Sapra, Y. Yacoob, H. Shi, et al. Eagle: Exploring the design space for multimodal LLMs with mixture of encoders.arXiv preprint arXiv:2408.15998,

  31. [32]

    URLhttps://arxiv.org/abs/2404.16123. L. Soldaini, R. Kinney, A. Bhagia, D. Schwenk, D. Atkinson, R. Authur, B. Bogin, K. Chandu, J. Dumas, Y. Elazar, et al. Dolma: an open corpus of three trillion tokens for language model pretraining research. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL),

  32. [33]

    D. Su, K. Kong, Y. Lin, J. Jennings, B. Norick, M. Kliegl, M. Patwary, M. Shoeybi, and B. Catanzaro. Nemotron-CC: Transforming Common Crawl into a refined long-horizon pretraining dataset.arXiv preprint arXiv:2412.02595, 2024a. X. Su et al. SK-VQA: Synthetic knowledge generation at scale for training context-augmented multimodal llms, 2024b. URLhttps://ar...

  33. [34]

    S. Tong, E. Brown, P. Wu, S. Woo, M. Middepogu, S. C. Akula, J. Yang, S. Yang, A. Iyer, X. Pan, et al. Cambrian-1: A fully open, vision-centric exploration of multimodal LLMs. InAdvances in Neural Information Processing Systems (NeurIPS), 2024a. S. Tong, Z. Liu, Y. Zhai, Y. Ma, Y. LeCun, and S. Xie. Eyes wide shut? exploring the visual shortcomings of mul...

  34. [35]

    P. Wang, S. Bai, S. Tan, S. Wang, Z. Fan, J. Bai, K. Chen, X. Liu, J. Wang, W. Ge, et al. Qwen2-VL: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024a. W. Wang, K. Mrini, L. Yang, S. Kumar, Y. Tian, X. Yan, and H. Wang. Finetuned multimodal language models are high-quality image-text data fil...

  35. [36]

    Wiedmann, O

    L. Wiedmann, O. Zohar, A. Mahla, X. Wang, R. Li, T. Frere, L. von Werra, A. Roy Gosthipaty, and A. Marafioti. FineVision: Open data is all you need.arXiv preprint arXiv:2510.17269,

  36. [37]

    URLhttps://x.ai/news/grok-1.5v. H. Xu, S. Xie, X. E. Tan, P.-Y. Huang, R. Howes, V. Sharma, S.-W. Li, G. Ghosh, L. Zettlemoyer, and C. Feichtenhofer. Demystifying CLIP data.arXiv preprint arXiv:2309.16671,

  37. [38]

    URLhttps://arxiv.org/abs/2502.14846. L. Yu, P. Poirson, S. Yang, A. C. Berg, and T. L. Berg. Modeling context in referring expressions. InEuropean Conference on Computer Vision (ECCV),

  38. [39]

    W. Yu, Z. Yang, L. Li, J. Wang, K. Lin, Z. Liu, X. Wang, and L. Wang. MM-Vet: Evaluating large multimodal models for integrated capabilities.arXiv preprint arXiv:2308.02490,

  39. [40]

    Y. Zhou, C. Cui, R. Rafailov, C. Finn, and H. Yao. Aligning modalities in vision large language models via preference fine-tuning.arXiv preprint arXiv:2402.11411,