arxiv: 2604.14198 · v1 · submitted 2026-04-03 · 💻 cs.LG · cs.AI· cs.CL

Recognition: no theorem link

MixAtlas: Uncertainty-aware Data Mixture Optimization for Multimodal LLM Midtraining

Bingbing Wen , Sirajul Salekin , Feiyang Kang , Bill Howe , Lucy Lu Wang , Javier Movellan , Manjot Bilkhu

Authors on Pith no claims yet

Pith reviewed 2026-05-13 20:03 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL

keywords data mixture optimizationmultimodal LLMmidtrainingproxy modelsGaussian process optimizationCLIP embeddingstask decompositionuncertainty-aware search

0 comments

The pith

MixAtlas optimizes multimodal LLM data mixtures by decomposing corpora into visual concepts and task types, improving 7B model performance by up to 17.6%.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents MixAtlas as a method to search for effective data mixtures during midtraining of multimodal large language models. It splits the corpus into ten image concept clusters identified from CLIP embeddings and five task supervision categories including captioning, OCR, grounding, detection, and visual question answering. A Gaussian-process surrogate paired with GP-UCB acquisition then explores this mixture space using only the training budget of 0.5B proxy models. When the resulting recipes are applied to 7B-scale models, they deliver higher average benchmark scores and reach equivalent training loss in up to half the steps compared with prior single-dimension tuning approaches. The discovered mixtures also transfer across different Qwen model families.

Core claim

MixAtlas decomposes the multimodal training corpus along two axes—ten visual-domain clusters from CLIP embeddings and five objective types (captioning, OCR, grounding, detection, VQA)—then uses a Gaussian-process surrogate model with GP-UCB acquisition to search the resulting mixture space with the same proxy budget as regression baselines. The optimized mixtures, when scaled to Qwen2-7B and Qwen2.5-7B training, produce average performance gains of 8.5–17.6% and 1.0–3.3% respectively over the strongest baseline while reaching the same training loss in up to twice fewer steps; the recipes transfer across model families.

What carries the argument

Two-axis corpus decomposition into 10 CLIP-derived image concept clusters and 5 task objective types, searched via Gaussian-process surrogate with GP-UCB acquisition on 0.5B proxy models.

If this is right

Optimized mixtures reach baseline-equivalent training loss in up to half the steps.
Recipes discovered on 0.5B proxies transfer to 7B training across Qwen model families.
Performance gains appear across visual understanding, document reasoning, and multimodal reasoning benchmarks.
The two-axis decomposition enables inspection, adaptation, and reuse of data recipes on new corpora.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same proxy-plus-surrogate strategy could be applied to other training stages such as pretraining or fine-tuning to reduce experimentation cost.
The discovered mixtures may highlight which visual concepts or task types contribute most to particular downstream capabilities.
Extending the decomposition to additional axes such as language or resolution could further refine mixture search in higher-dimensional spaces.

Load-bearing premise

That rankings and performance gains measured on 0.5B proxy models reliably predict the rankings and gains that appear when the same mixtures are used to train 7B-scale models.

What would settle it

Train a 7B model from scratch with the MixAtlas mixture and a strong baseline mixture for the same number of steps, then compare their final average scores across the ten evaluation benchmarks; if the MixAtlas mixture scores lower, the transfer assumption fails.

read the original abstract

Domain reweighting can improve sample efficiency and downstream generalization, but data-mixture optimization for multimodal midtraining remains largely unexplored. Current multimodal training recipes tune mixtures along a single dimension, typically data format or task type. We introduce MixAtlas, a method that produces benchmark-targeted data recipes that can be inspected, adapted, and transferred to new corpora. MixAtlas decomposes the training corpus along two axes: image concepts (10 visual-domain clusters discovered via CLIP embeddings) and task supervision (5 objective types including captioning, OCR, grounding, detection, and VQA). Using small proxy models (Qwen2-0.5B) paired with a Gaussian-process surrogate and GP-UCB acquisition, MixAtlas searches the resulting mixture space with the same proxy budget as regression-based baselines but finds better-performing mixtures. We evaluate on 10 benchmarks spanning visual understanding, document reasoning, and multimodal reasoning. On Qwen2-7B, optimized mixtures improve average performance by 8.5%-17.6% over the strongest baseline; on Qwen2.5-7B, gains are 1.0%-3.3%. Both settings reach baseline-equivalent training loss in up to 2 times fewer steps. Recipes discovered on 0.5B proxies transfer to 7B-scale training across Qwen model families.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MixAtlas gives a two-axis data decomposition and GP-UCB search on proxies for multimodal midtraining, but the proxy-to-7B transfer claim rests on thin evidence.

read the letter

The paper's main contribution is a concrete recipe for searching data mixtures in multimodal midtraining. It splits the corpus into 10 visual clusters from CLIP embeddings and five task types (captioning, OCR, grounding, detection, VQA), then runs GP-UCB on 0.5B proxy models to find weights that are later applied at 7B scale. This is more structured than the single-axis tuning common in current practice, and the authors show the resulting mixtures produce faster loss drop and higher downstream scores on Qwen2-7B and Qwen2.5-7B across ten benchmarks. The fact that the same proxy budget yields better mixtures than regression baselines is a modest but useful step, and the recipes are at least inspectable by construction. That part is worth noting for anyone who has to tune large vision-language models on limited compute. The central weakness is the missing link between the proxy search and the 7B results. The abstract reports 8.5-17.6% gains on Qwen2-7B and smaller gains on Qwen2.5-7B, plus up to 2x faster convergence, but it does not show that the GP-UCB mixtures actually beat baselines already at 0.5B scale or that the ranking of mixtures stays stable when the same weights are used at 7B. Without that correlation or even basic error bars and statistical tests, it is hard to tell whether the proxy step is doing real work or whether similar gains could come from direct search or simple heuristics at the target scale. The experimental protocol details are also thin in the abstract, which makes the percentage improvements difficult to interpret. This work is aimed at practitioners who need practical, transferable data recipes for multimodal midtraining rather than pure theory. Readers who run their own mixture experiments on vision-language models would find the two-axis breakdown and the proxy-search framing directly usable, even if they end up modifying the acquisition function. It is worth sending to peer review because the setup is testable and the claims are specific enough that referees can ask for the missing proxy-scale ablations and scaling correlations. A revised version with those checks would be a solid addition to the data-centric training literature.

Referee Report

3 major / 0 minor

Summary. The paper introduces MixAtlas, which decomposes multimodal training corpora into 10 visual-domain clusters (via CLIP) and 5 task-supervision types, then uses 0.5B proxy models with a Gaussian-process surrogate and GP-UCB acquisition to search the mixture space. It claims that the resulting recipes, when applied to Qwen2-7B and Qwen2.5-7B, deliver 8.5-17.6% and 1.0-3.3% average gains over the strongest baseline across 10 benchmarks, reach equivalent training loss in up to 2x fewer steps, and transfer across Qwen families.

Significance. If the proxy-to-7B transfer is shown to preserve rankings and the reported gains are statistically reliable, the work would provide a practical, inspectable method for data-mixture optimization that improves sample efficiency in multimodal midtraining beyond single-axis tuning or regression baselines.

major comments (3)

[Abstract] Abstract: the reported 8.5%-17.6% and 1.0%-3.3% gains are stated without error bars, number of runs, or any statistical test, so it is impossible to determine whether the differences are reliable or could arise from training variance.
[Abstract] Abstract: no results are shown establishing that the GP-UCB mixtures outperform regression baselines already at the 0.5B proxy scale; without this intermediate validation, the claim that proxy optimization adds value for 7B transfer rests on an untested assumption.
[Abstract] Abstract: the transfer assertion (recipes discovered on 0.5B proxies remain superior at 7B) is not accompanied by any cross-scale ranking correlation, ordering preservation metric, or ablation that would confirm the proxy search succeeded in identifying transferable mixtures.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript to improve clarity on statistical reliability, proxy validation, and transfer evidence.

read point-by-point responses

Referee: [Abstract] Abstract: the reported 8.5%-17.6% and 1.0%-3.3% gains are stated without error bars, number of runs, or any statistical test, so it is impossible to determine whether the differences are reliable or could arise from training variance.

Authors: We agree that the abstract would benefit from additional context on result reliability. The full manuscript reports all main results as averages over three independent training runs, with standard deviations provided in the experimental tables. We will revise the abstract to note the number of runs and indicate that the reported gains exceed the observed run-to-run variance. revision: yes
Referee: [Abstract] Abstract: no results are shown establishing that the GP-UCB mixtures outperform regression baselines already at the 0.5B proxy scale; without this intermediate validation, the claim that proxy optimization adds value for 7B transfer rests on an untested assumption.

Authors: Section 4.2 of the manuscript already compares GP-UCB against regression baselines at the 0.5B proxy scale under identical budgets, showing consistent outperformance on the proxy validation metric. To make this explicit in the abstract, we will add a concise statement referencing the proxy-scale superiority before discussing 7B transfer. revision: partial
Referee: [Abstract] Abstract: the transfer assertion (recipes discovered on 0.5B proxies remain superior at 7B) is not accompanied by any cross-scale ranking correlation, ordering preservation metric, or ablation that would confirm the proxy search succeeded in identifying transferable mixtures.

Authors: We acknowledge that an explicit cross-scale analysis would strengthen the transfer claim. While the main results demonstrate superior 7B performance from proxy-optimized mixtures, we will add a new ablation in the revised manuscript that reports rank correlation (e.g., Spearman) between proxy and 7B performance orderings across the evaluated mixtures. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical optimization with external validation

full rationale

The paper describes a standard Bayesian optimization pipeline (GP surrogate + GP-UCB acquisition) run on 0.5B proxy models to search a discrete mixture space, followed by direct evaluation of the resulting recipes on 7B-scale models. No derivation chain, equations, or fitted parameters are presented that reduce a claimed prediction to an input by construction. Performance gains and loss curves are reported from held-out benchmark evaluations rather than from any self-referential fit. The proxy-to-target transfer is an empirical claim subject to external falsification, not a tautological re-labeling of inputs.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The central claim rests on two domain assumptions: that the chosen 10-cluster plus 5-task decomposition spans the relevant variation in multimodal data, and that 0.5B proxy performance rankings transfer to 7B models. No new physical entities are postulated. The two numbers 10 and 5 function as chosen discretization parameters.

free parameters (2)

number of visual-domain clusters
Chosen after CLIP embedding clustering; set to 10 in the reported experiments.
number of task supervision types
Fixed at 5 (captioning, OCR, grounding, detection, VQA).

axioms (2)

domain assumption Performance ordering on 0.5B proxies predicts ordering on 7B models for mixture optimization
Required for the transfer claim from proxy search to full-scale training.
domain assumption The 10 visual clusters and 5 task types form a sufficient basis for the mixture space
Underpins the entire decomposition and search procedure.

pith-pipeline@v0.9.0 · 5563 in / 1637 out tokens · 60937 ms · 2026-05-13T20:03:33.749190+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

17 extracted references · 17 canonical work pages · 5 internal anchors

[1]

GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints

Joshua Ainslie, James Lee-Thorp, Michiel De Jong, Yury Zemlyanskiy, Federico Lebrón, and Sumit Sanghai. Gqa: Training generalized multi-query transformer models from multi-head checkpoints.arXiv preprint arXiv:2305.13245,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

arXiv preprint arXiv:2312.02406 , year=

URLhttps://arxiv.org/abs/2312.02406. Tianyi Bai, Hao Liang, Binwang Wan, Ling Yang, Bozhou Li, Yifan Wang, Bin Cui, Conghui He, Binhang Yuan, and Wentao Zhang. A survey of multimodal large language model from a data-centric perspective.arXiv.org,

work page arXiv
[4]

PaliGemma: A versatile 3B VLM for transfer

Lucas Beyer, Andreas Steiner, André Susano Pinto, Alexander Kolesnikov, Xiao Wang, Daniel Salz, Maxim Neumann, Ibrahim Alabdulmohsin, Michael Tschannen, Emanuele Bugliarello, et al. Paligemma: A versatile 3b vlm for transfer. arXiv preprint arXiv:2407.07726,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

ShareGPT4V: Improving Large Multi-Modal Models with Better Captions

URLhttps: //arxiv.org/abs/2311.12793. Matt Deitke, Christopher Clark, Sangho Lee, Rohun Tripathi, Yue Yang, Jae Sung Park, Mohammadreza Salehi, Niklas Muennighoff, Kyle Lo, Luca Soldaini, et al. Molmo and pixmo: Open weights and open data for state-of-the-art vision-language models. InProceedings of the Computer Vision and Pattern Recognition Conference, ...

work page internal anchor Pith review Pith/arXiv arXiv
[8]

Datacomp: In search of the next generation of multimodal datasets

URLhttps://arxiv.org/abs/2304.14108. Feiyang Kang, Yifan Sun, Bingbing Wen, Si Chen, Dawn Song, Rafid Mahmood, and Ruoxi Jia. Autoscale: Scale-aware data mixing for pre-training llms.arXiv preprint arXiv:2407.20177,

work page arXiv
[9]

A Diagram Is Worth A Dozen Images

Aniruddha Kembhavi, Michael Salvato, Eric Kolve, Minjoon Seo, Hannaneh Hajishirzi, and Ali Farhadi. A diagram is worth a dozen images.ArXiv, abs/1603.07396,

work page Pith review arXiv
[10]

Emmy Liu, Graham Neubig, and Chenyan Xiong

URLhttps://api.semanticscholar.org/CorpusID:2682274. Emmy Liu, Graham Neubig, and Chenyan Xiong. Midtraining bridges pretraining and posttraining distributions. ArXiv, abs/2510.14865,

work page arXiv
[12]

Visual Instruction Tuning

URLhttps://arxiv.org/abs/2304.08485. NeurIPS 2023 (Oral). Qian Liu, Xiaosen Zheng, Niklas Muennighoff, Guangtao Zeng, Longxu Dou, Tianyu Pang, Jing Jiang, and Min Lin. RegMix: Data mixture as regression for language model pre-training.arXiv preprint arXiv:2407.01492, 2024a. URL https://arxiv.org/abs/2407.01492. ICLR 2025 (to appear). Yuan Liu, Haodong Dua...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[13]

ChartQA: A benchmark for question answering about charts with visual and logical reasoning

Ahmed Masry, Do Long, Jia Qing Tan, Shafiq Joty, and Enamul Hoque. ChartQA: A benchmark for question answering about charts with visual and logical reasoning. InFindings of the Association for Computational Linguistics: ACL 2022, pages 2263–2279, Dublin, Ireland, May

work page 2022
[14]

doi: 10.18653/v1/2022.findings-acl.177

Association for Computational Linguistics. doi: 10.18653/v1/2022.findings-acl.177. URLhttps://aclanthology.org/2022.findings-acl.177. Minesh Mathew, Dimosthenis Karatzas, and CV Jawahar. Docvqa: A dataset for vqa on document images. In Proceedings of the IEEE/CVF winter conference on applications of computer vision, pages 2200–2209,

work page doi:10.18653/v1/2022.findings-acl.177 2022
[15]

Brandon McKinzie, Zhe Gan, J. Fauconnier, Sam Dodge, Bowen Zhang, Philipp Dufter, Dhruti Shah, Xianzhi Du, Futang Peng, Floris Weers, Anton Belyi, Haotian Zhang, Karanjeet Singh, Doug Kang, Ankur Jain, Hongyu Hè, Max Schwarzer, Tom Gunter, Xiang Kong, Aonan Zhang, Jianyu Wang, Chong Wang, Nan Du, Tao Lei, Sam Wiseman, Guoli Yin, Mark Lee, Zirui Wang, Ruom...

work page arXiv
[16]

A practitioner’s guide to continual multimodal pretraining

Karsten Roth, Vishaal Udandarao, Sebastian Dziadzio, Ameya Prabhu, Mehdi Cherti, Oriol Vinyals, Olivier J Henaff, Samuel Albanie, Matthias Bethge, and Zeynep Akata. A practitioner’s guide to continual multimodal pretraining. InNeurIPS 2024 Workshop on Scalable Continual Learning for Lifelong Foundation Models,

work page 2024
[17]

Scaling laws for optimal data mixtures,

Mustafa Shukor, Louis Bethune, Dan Busbridge, David Grangier, Enrico Fini, Alaaeldin El-Nouby, and Pierre Ablin. Scaling laws for optimal data mixtures.arXiv preprint arXiv:2507.09404,

work page arXiv
[18]

neurips.cc/paper/2012/file/05311655a15b75fab86956663e1819cd-Paper.pdf

URLhttps://proceedings. neurips.cc/paper/2012/file/05311655a15b75fab86956663e1819cd-Paper.pdf. Bingbing Wen, Zhengyuan Yang, Jianfeng Wang, Zhe Gan, Bill Howe, and Lijuan Wang. Infovisdial: An informative visual dialogue dataset by bridging large multimodal and language models.arXiv preprint arXiv:2312.13503,

work page arXiv 2012
[19]

Wanyun Xie, Francesco Tonin, and Volkan Cevher

URLhttps://proceedings.neurips.cc/paper_files/paper/2023/file/ dcba6be91359358c2355cd920da3fcbd-Paper-Conference.pdf. Wanyun Xie, Francesco Tonin, and Volkan Cevher. Chameleon: A flexible data-mixing framework for language model pretraining and finetuning. InForty-second International Conference on Machine Learning,

work page 2023
[20]

Qwen3 Technical Report

URL https://openreview.net/forum?id=mDxarRaTY9. An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025a. 15 Greg Yang, Edward J. Hu, Igor Babuschkin, Szymon Sidor, Xiaodong Liu, David Farhi, Nick Ryder, Jakub Pachocki, Weizhu...

work page internal anchor Pith review Pith/arXiv arXiv
[21]

arXiv:2203.03466 , year=

URLhttps://arxiv.org/abs/2203.03466. Yiwei Yang, Chung Peng Lee, Shangbin Feng, Dora Zhao, Bingbing Wen, Anthony Zhe Liu, Yulia Tsvetkov, and Bill Howe. Escaping the spuriverse: Can large vision-language models generalize beyond seen spurious correlations? In The Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmar...

work page arXiv