arxiv: 2605.11629 · v1 · submitted 2026-05-12 · 💻 cs.CL

Recognition: no theorem link

OmniThoughtVis: A Scalable Distillation Pipeline for Deployable Multimodal Reasoning Models

Yuanhao Yue , Chengyu Wang , Yuanjie Lyu , Lei Shen , Jun Huang

Authors on Pith no claims yet

Pith reviewed 2026-05-13 01:20 UTC · model grok-4.3

classification 💻 cs.CL

keywords multimodal reasoningchain-of-thought distillationdata curation pipelinevision-language modelsmodel compressiondeployable MLLMsbenchmark evaluation

0 comments

The pith

Scalable distillation lets 4B multimodal models match or beat 8B baselines on reasoning tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a pipeline to transfer chain-of-thought reasoning skills from large teacher models to smaller multimodal models that can actually run in deployed systems. It begins with open-source data, generates detailed reasoning traces, adds annotations for difficulty and quality, then applies filters and sampling to produce a clean 1.8 million sample corpus. Training on subsets of this corpus yields consistent gains across model sizes, with the distilled 4B model reaching or exceeding the performance of an undistilled 8B model on several benchmarks. A reader would care because current small models lack strong reasoning, which limits their use in real applications despite their speed and efficiency advantages.

Core claim

OmniThoughtVis is a scalable data curation and distillation pipeline that generates structured CoT traces from teacher models, performs joint annotation of reasoning difficulty, answer quality, and semantic task tags, then combines rule-based filtering, difficulty-aware selection, and tag-based diversity sampling to create a controllable 1.8M sample corpus; distilling Qwen3-VL models from 2B to 8B parameters on this data produces consistent gains across scales, including up to +16.8 points on MathVerse and +5.6 points on MMMU-Pro for the 4B model, such that the distilled 4B model matches or surpasses the undistilled 8B baseline on several tasks.

What carries the argument

The OmniThoughtVis pipeline, which generates structured chain-of-thought traces, adds multi-dimensional annotations, and applies staged filtering plus diversity sampling to produce high-quality, controllable training data for smaller multimodal models.

If this is right

Distilled models show consistent performance improvements across parameter scales from 2B to 8B.
The 4B distilled model reaches or exceeds the 8B baseline on multiple multimodal reasoning benchmarks.
The curated corpus supports controllable subset construction for different training needs.
Reasoning distillation provides a route to high-performance models that fit within deployment resource limits.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the pipeline maintains transferable reasoning quality, the same curation approach could be applied to other vision-language or language-only reasoning tasks.
Lowering model size while preserving benchmark performance could reduce inference costs and latency in production multimodal systems.
The results suggest data curation quality may matter more than raw model scale for certain multimodal reasoning capabilities.

Load-bearing premise

The chain-of-thought traces from the teacher models remain high-quality and free of systematic errors or benchmark leakage after the filtering and selection rules are applied.

What would settle it

Training a 4B model on the same teacher outputs but without the rule-based filtering, difficulty selection, and diversity sampling steps, then observing that it no longer matches or exceeds the 8B baseline on the reported benchmarks, would indicate the curation steps are not responsible for the gains.

Figures

Figures reproduced from arXiv: 2605.11629 by Chengyu Wang, Jun Huang, Lei Shen, Yuanhao Yue, Yuanjie Lyu.

**Figure 1.** Figure 1: Overview of the OMNITHOUGHTVIS data curation and distillation pipeline. Starting from a broad open-source seed pool, we generate structured multimodal CoT traces, apply joint annotation and quality control, and construct training subsets for distilling smaller reasoning-capable MLLMs. ing difficulty, answer quality, and semantic task tags, and (3) filtering and diversity-aware subset selection designed t… view at source ↗

**Figure 2.** Figure 2: Word cloud visualization of the top 400 task [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Difficulty (top) and quality (bottom) distribu [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Hyperparameter search on MMMUPro-V for DistilQwen3-VL-2B. Baseline: 0.2130. the validation set. For DistilQwen3-VL-2B, we evaluate four learning rates (5×10−6 , 1×10−5 , 2× 10−5 , 5×10−5 ) across 1–5 epochs, using MMMUPro Vision as the selection metric. As shown in [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 6.** Figure 6: Per-benchmark scaling for DistilQwen3-VL [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 5.** Figure 5: Average performance across nine benchmarks during DistilQwen3-VL-8B training on OMNITHOUGHTVIS. Performance improves from 59.5 to 63.5 over 19K steps. Scaling Behavior During Training. We further analyze training-time scaling by evaluating DistilQwen3-VL-8B across all nine benchmarks [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

read the original abstract

Recent multimodal large language models (MLLMs) have shown strong chain-of-thought (CoT) reasoning ability on vision-language tasks, but their direct deployment in real-world systems is often limited by latency and resource constraints. In practice, smaller MLLMs are preferred for online serving, yet their reasoning performance is bottlenecked by the lack of large-scale, high-quality multimodal CoT supervision. In this paper, we present OmniThoughtVis, a scalable data curation and distillation pipeline for transferring multimodal reasoning capabilities from high-capacity teacher models to smaller, deployment-oriented MLLMs. Starting from a diverse open-source seed pool, our pipeline generates structured CoT traces and performs joint annotation of reasoning difficulty, answer quality, and semantic task tags. To maintain data quality at scale, we combine rule-based filtering, difficulty-aware selection, and tag-based diversity sampling, resulting in a curated corpus of 1.8M samples that supports controllable subset construction for downstream training. We use OmniThoughtVis to distill Qwen3-VL models from 2B to 8B parameters and evaluate them on nine multimodal reasoning benchmarks. The resulting distilled models show consistent gains across model scales, including improvements of up to +16.8 points on MathVerse and +5.6 points on MMMU-Pro for the 4B model. Notably, the distilled 4B model matches or surpasses the undistilled 8B baseline on several tasks, highlighting the practical value of scalable reasoning distillation for deployment-oriented MLLMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

OmniThoughtVis gives a practical distillation pipeline with joint difficulty-quality-tag annotation that produces usable gains on smaller VLMs, but the large reported improvements rest on unverified data cleanliness.

read the letter

The core of this paper is a data curation pipeline that starts from open seed data, has teacher models generate CoT traces, then jointly labels each sample for reasoning difficulty, answer quality, and semantic tags. It applies rule-based filters plus difficulty-aware selection and tag-based diversity sampling to build a 1.8M sample corpus, which they use to distill Qwen3-VL models from 2B to 8B. The 4B distilled model reportedly matches or beats the undistilled 8B on several of the nine benchmarks, with big deltas like +16.8 on MathVerse and +5.6 on MMMU-Pro. That combination of multi-attribute annotation and controllable subset construction is the concrete engineering step beyond plain teacher-student distillation, and it lets users build datasets of different sizes and compositions without starting from scratch. The scale and the consistent cross-benchmark pattern are the parts that could actually help people shipping smaller multimodal models. The main soft spot is the lack of visible safeguards against benchmark leakage in the teacher-generated traces. Rule-based filtering usually misses paraphrased or semantically close test items, and the abstract gives no decontamination numbers, membership checks, or ablations showing how much the selection steps reduce overlap. If any of the 1.8M samples contain test-like content, the headline gains could partly reflect memorization rather than transferred reasoning. The paper is aimed at practitioners who need to improve reasoning in resource-limited MLLMs and at researchers working on scalable data pipelines for vision-language models. It is coherent enough on its own terms to deserve a serious referee, mainly so the methods section can be pressed on the filtering validation and the results can be checked for fair baselines. I would send it to review with a request for explicit contamination analysis and sampling ablations.

Referee Report

3 major / 2 minor

Summary. The paper introduces OmniThoughtVis, a scalable distillation pipeline that starts from an open-source multimodal seed pool, uses high-capacity teacher models to generate structured CoT traces with joint annotations for difficulty, answer quality, and semantic tags, applies rule-based filtering plus difficulty-aware and tag-based sampling to curate 1.8M samples, and then distills this data into Qwen3-VL models ranging from 2B to 8B parameters. On nine multimodal reasoning benchmarks, the distilled models show consistent gains, with the 4B model achieving up to +16.8 points on MathVerse and +5.6 on MMMU-Pro while matching or surpassing the undistilled 8B baseline on several tasks.

Significance. If the reported gains are not attributable to undetected benchmark leakage or low-quality CoT transfer, the work demonstrates a practical, controllable method for scaling high-quality multimodal reasoning supervision to deployment-friendly model sizes. The 1.8M-sample corpus and cross-scale improvements (particularly the 4B outperforming 8B on select tasks) would be a useful contribution to efficient MLLM deployment.

major comments (3)

[§3 (Data Curation Pipeline)] §3 (Data Curation Pipeline), filtering and sampling subsection: the rule-based filtering combined with difficulty-aware and tag-based selection is presented as sufficient to remove low-quality data, but no quantitative analysis or thresholds are provided for detecting semantic or paraphrased overlap with test sets from benchmarks such as MathVerse and MMMU-Pro. Given that teacher models are high-capacity and the seed pool is open-source, this leaves open the possibility that the +16.8 point gain on MathVerse reflects partial memorization rather than reasoning transfer.
[§4 (Experiments and Results)] §4 (Experiments and Results), baseline comparison paragraph: the claim that the distilled 4B model matches or surpasses the undistilled 8B baseline is load-bearing for the practical value argument, yet the manuscript does not specify whether the 8B baseline was evaluated under identical prompting, decoding, or data conditions, nor whether it received any of the curated CoT data. This ambiguity affects interpretation of the cross-scale results.
[§4 (Experiments)] §4 (Experiments), CoT quality validation: the pipeline relies on rule-based filtering and answer-quality annotation, but the manuscript provides no additional validation (e.g., human evaluation on a held-out subset, inter-annotator agreement, or comparison against gold CoT) to confirm that the 1.8M traces are high-quality and free of systematic teacher biases that could be transferred to the student models.

minor comments (2)

[Abstract and §4] The abstract and §4 results tables would benefit from explicit listing of all nine benchmarks and the exact baseline configurations (model versions, prompting strategies) used for the undistilled 8B comparisons.
[§3] Notation for the difficulty and tag annotations is introduced procedurally but lacks a concise formal definition or pseudocode that would aid reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. We address each major comment below with clarifications and indicate the revisions we will make to strengthen the manuscript.

read point-by-point responses

Referee: [§3 (Data Curation Pipeline)] §3 (Data Curation Pipeline), filtering and sampling subsection: the rule-based filtering combined with difficulty-aware and tag-based selection is presented as sufficient to remove low-quality data, but no quantitative analysis or thresholds are provided for detecting semantic or paraphrased overlap with test sets from benchmarks such as MathVerse and MMMU-Pro. Given that teacher models are high-capacity and the seed pool is open-source, this leaves open the possibility that the +16.8 point gain on MathVerse reflects partial memorization rather than reasoning transfer.

Authors: We appreciate this important concern regarding potential data leakage. While our pipeline includes rule-based filtering to remove samples with direct string matches to known test sets where possible, we acknowledge that we did not perform a comprehensive quantitative analysis such as embedding-based similarity search or n-gram overlap statistics across the entire 1.8M corpus and the benchmark test sets. In the revised manuscript, we will add a dedicated subsection under §3 detailing the filtering steps more explicitly, including any overlap detection methods applied, and report overlap statistics (e.g., percentage of samples with high semantic similarity to test examples). We believe the gains are primarily due to reasoning transfer, as improvements are observed across diverse benchmarks and model scales, but we will strengthen the manuscript with this analysis to address the possibility of memorization. revision: yes
Referee: [§4 (Experiments and Results)] §4 (Experiments and Results), baseline comparison paragraph: the claim that the distilled 4B model matches or surpasses the undistilled 8B baseline is load-bearing for the practical value argument, yet the manuscript does not specify whether the 8B baseline was evaluated under identical prompting, decoding, or data conditions, nor whether it received any of the curated CoT data. This ambiguity affects interpretation of the cross-scale results.

Authors: We agree that clarity on the baseline evaluation is essential. The undistilled 8B baseline refers to the original Qwen3-VL-8B model without any fine-tuning on our curated OmniThoughtVis data. All models, including the baselines, were evaluated using the same prompting templates, decoding parameters (e.g., temperature=0 for deterministic outputs where applicable), and evaluation protocols as described in §4. We will revise the baseline comparison paragraph to explicitly state these details and confirm that the 8B model did not receive the distilled CoT data, thereby clarifying that the comparison highlights the efficiency of distillation. revision: yes
Referee: [§4 (Experiments)] §4 (Experiments), CoT quality validation: the pipeline relies on rule-based filtering and answer-quality annotation, but the manuscript provides no additional validation (e.g., human evaluation on a held-out subset, inter-annotator agreement, or comparison against gold CoT) to confirm that the 1.8M traces are high-quality and free of systematic teacher biases that could be transferred to the student models.

Authors: Thank you for highlighting this gap in validation. Our pipeline uses automated answer-quality annotation based on teacher model confidence and rule-based checks for coherence, but we did not include human evaluation or inter-annotator agreement metrics in the original submission. In the revision, we will add a new subsection in §4 describing a human validation study on a random subset of 500 samples, where annotators assessed CoT quality, logical consistency, and absence of biases, reporting agreement scores. This will provide stronger evidence for the quality of the distilled traces. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical distillation pipeline with external benchmark evaluation

full rationale

The paper describes a procedural pipeline for generating, filtering, and selecting CoT traces from teacher models, then trains smaller MLLMs and reports benchmark scores. No equations, fitted parameters renamed as predictions, self-definitional constructs, or load-bearing self-citations appear in the derivation chain. The central claims (e.g., +16.8 on MathVerse for the 4B model) are direct empirical measurements on standard external benchmarks after training on the curated 1.8M samples; they do not reduce to quantities defined by the pipeline's own rules or inputs. The filtering and selection steps are heuristic but serve as preprocessing, not as tautological definitions of the final performance metrics.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The work rests on standard distillation assumptions plus paper-specific curation rules whose exact parameters are not detailed in the abstract. No new physical entities or fitted constants are introduced.

axioms (2)

domain assumption High-capacity teacher MLLMs generate reliable, structured multimodal CoT traces suitable for distillation
The entire pipeline begins with teacher-generated traces; quality of those traces is taken as given.
ad hoc to paper Rule-based filtering combined with difficulty-aware and tag-based sampling removes low-quality data while preserving diversity and utility
This is the mechanism claimed to maintain quality at 1.8M scale.

pith-pipeline@v0.9.0 · 5583 in / 1543 out tokens · 46979 ms · 2026-05-13T01:20:37.526075+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

34 extracted references · 34 canonical work pages · 6 internal anchors

[1]

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Junyang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shixuan ...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. 2025 b . http://arxiv.org/abs/...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

Wenrui Cai, Chengyu Wang, Junbing Yan, Jun Huang, and Xiangzhong Fang. 2025. Reasoning with omnithought: A large cot dataset with verbosity and cognitive difficulty annotations. arXiv preprint arXiv:2505.10937

work page arXiv 2025
[4]

Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Conghui He, Jiaqi Wang, Feng Zhao, and Dahua Lin. 2024 a . https://doi.org/10.1007/978-3-031-72643-9_22 Sharegpt4v: Improving large multi-modal models with better captions . In Computer Vision -- ECCV 2024: 18th European Conference, Milan, Italy, September 29--October 4, 2024, Proceedings, Part XVII, pages 370...

work page doi:10.1007/978-3-031-72643-9_22 2024
[5]

Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Jiaqi Wang, Yu Qiao, Dahua Lin, and Feng Zhao. 2024 b . https://openreview.net/forum?id=evP9mxNNxJ Are we on the right way for evaluating large vision-language models? In The Thirty-eighth Annual Conference on Neural Information Processing Systems

work page 2024
[6]

Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Yunxuan Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, et al. 2024. Scaling instruction-finetuned language models. Journal of Machine Learning Research, 25(70):1--53

work page 2024
[7]

DeepSeek-AI. 2025. https://doi.org/10.1038/s41586-025-09422-z Deepseek-r1 incentivizes reasoning in llms through reinforcement learning . Nature, 645(8081):633--638

work page doi:10.1038/s41586-025-09422-z 2025
[8]

Martin Ester, Hans-Peter Kriegel, J\" o rg Sander, and Xiaowei Xu. 1996. A density-based algorithm for discovering clusters in large spatial databases with noise. In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, KDD'96, pages 226--231. AAAI Press

work page 1996
[9]

Etash Kumar Guha, Ryan Marten, Sedrick Keh, Negin Raoof, Georgios Smyrnis, Hritik Bansal, Marianna Nezhurina, Jean Mercat, Trung Vu, Zayne Rea Sprague, Ashima Suvarna, Benjamin Feuer, Leon Liangyu Chen, Zaid Khan, Eric Frankel, Sachin Grover, Caroline Choi, Niklas Muennighoff, Shiye Su, Wanjia Zhao, John Yang, Shreyas Pimpalgaonkar, Kartik sharma, Charlie...

work page 2026
[10]

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. 2015. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531

work page internal anchor Pith review Pith/arXiv arXiv 2015
[11]

Namgyu Ho, Laura Schmid, and Se-Young Yun. 2023. https://doi.org/10.18653/v1/2023.acl-long.830 Large language models are reasoning teachers . In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 14852--14882, Toronto, Canada. Association for Computational Linguistics

work page doi:10.18653/v1/2023.acl-long.830 2023
[12]

Cheng-Yu Hsieh, Chun-Liang Li, Chih-kuan Yeh, Hootan Nakhost, Yasuhisa Fujii, Alex Ratner, Ranjay Krishna, Chen-Yu Lee, and Tomas Pfister. 2023. https://doi.org/10.18653/v1/2023.findings-acl.507 Distilling step-by-step! outperforming larger language models with less training data and smaller model sizes . In Findings of the Association for Computational L...

work page doi:10.18653/v1/2023.findings-acl.507 2023
[13]

Aniruddha Kembhavi, Mike Salvato, Eric Kolve, Minjoon Seo, Hannaneh Hajishirzi, and Ali Farhadi. 2016. A diagram is worth a dozen images. In European conference on computer vision, pages 235--251. Springer

work page 2016
[14]

Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, and Chunyuan Li. 2025. https://openreview.net/forum?id=zKv8qULV6n LL a VA -onevision: Easy visual task transfer . Transactions on Machine Learning Research

work page 2025
[15]

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. 2023. Visual instruction tuning. Advances in neural information processing systems, 36:34892--34916

work page 2023
[16]

Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, Kai Chen, and Dahua Lin. 2024. https://doi.org/10.1007/978-3-031-72658-3_13 Mmbench: Is your multi-modal model an all-around player? In Computer Vision -- ECCV 2024: 18th European Conference, Milan, Italy, September 29--October 4, 2024,...

work page doi:10.1007/978-3-031-72658-3_13 2024
[17]

Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. 2024. https://openreview.net/forum?id=KUNzEQMWU7 Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts . In The Twelfth International Conference on Learning Representations

work page 2024
[18]

Pan Lu, Swaroop Mishra, Tanglin Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. 2022. https://proceedings.neurips.cc/paper_files/paper/2022/file/11332b6b6cf4485b84afadb1352d3a9a-Paper-Conference.pdf Learn to explain: Multimodal reasoning via thought chains for science question answering . In Advances in Neural...

work page 2022
[19]

Subhabrata Mukherjee, Arindam Mitra, Ganesh Jawahar, Sahaj Agarwal, Hamid Palangi, and Ahmed Awadallah. 2023. Orca: Progressive learning from complex explanation traces of gpt-4. arXiv preprint arXiv:2306.02707

work page internal anchor Pith review arXiv 2023
[20]

Shengbang Tong, Ellis L Brown II, Penghao Wu, Sanghyun Woo, ADITHYA JAIRAM IYER, Sai Charitha Akula, Shusheng Yang, Jihan Yang, Manoj Middepogu, Ziteng Wang, Xichen Pan, Rob Fergus, Yann LeCun, and Saining Xie. 2024. https://openreview.net/forum?id=Vi8AepAXGy Cambrian-1: A fully open, vision-centric exploration of multimodal LLM s . In The Thirty-eighth A...

work page 2024
[21]

Shubham Toshniwal, Ivan Moshkov, Sean Narenthiran, Daria Gitman, Fei Jia, and Igor Gitman. 2024. Openmathinstruct-1: A 1.8 million math instruction tuning dataset. Advances in Neural Information Processing Systems, 37:34737--34774

work page 2024
[22]

Chengyu Wang, Junbing Yan, Yuanhao Yue, and Jun Huang. 2025. Distilqwen2.5: Industrial practices of training distilled open lightweight language models. arXiv preprint arXiv:2504.15027

work page arXiv 2025
[23]

Ke Wang, Junting Pan, Weikang Shi, Zimu Lu, Houxing Ren, Aojun Zhou, Mingjie Zhan, and Hongsheng Li. 2024. Measuring multimodal mathematical reasoning with math-vision dataset. Advances in Neural Information Processing Systems, 37:95095--95169

work page 2024
[24]

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. 2022. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824--24837

work page 2022
[25]

Luis Wiedmann, Orr Zohar, Amir Mahla, Xiaohan Wang, Rui Li, Thibaud Frere, Leandro von Werra, Aritra Roy Gosthipaty, and Andr \'e s Marafioti. 2025. Finevision: Open data is all you need. arXiv preprint arXiv:2510.17269

work page arXiv 2025
[26]

Chuanpeng Yang, Yao Zhu, Wang Lu, Yidong Wang, Qian Chen, Chenlong Gao, Bingjie Yan, and Yiqiang Chen. 2025. https://doi.org/10.1145/3699518 Survey on knowledge distillation for large language models: Methods, evaluation, and application . ACM Trans. Intell. Syst. Technol., 16(6)

work page doi:10.1145/3699518 2025
[27]

Longhui Yu, Weisen Jiang, Han Shi, Jincheng YU, Zhengying Liu, Yu Zhang, James Kwok, Zhenguo Li, Adrian Weller, and Weiyang Liu. 2024. https://openreview.net/forum?id=N8N0hgNDRt Metamath: Bootstrap your own mathematical questions for large language models . In The Twelfth International Conference on Learning Representations

work page 2024
[28]

Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, et al. 2024 a . Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9556--9567

work page 2024
[29]

Xiang Yue, Tianyu Zheng, Yuansheng Ni, Yubo Wang, Kai Zhang, Shengbang Tong, Yuxuan Sun, Botao Yu, Ge Zhang, Huan Sun, Yu Su, Wenhu Chen, and Graham Neubig. 2024 b . Mmmu-pro: A more robust multi-discipline multimodal understanding benchmark. arXiv preprint arXiv:2409.02813

work page internal anchor Pith review arXiv 2024
[30]

Kaichen Zhang, Keming Wu, Zuhao Yang, Bo Li, Kairui Hu, Bin Wang, Ziwei Liu, Xingxuan Li, and Lidong Bing. 2025 a . Openmmreasoner: Pushing the frontiers for multimodal reasoning with an open and general recipe. arXiv preprint arXiv:2511.16334

work page arXiv 2025
[31]

Renrui Zhang, Dongzhi Jiang, Yichi Zhang, Haokun Lin, Ziyu Guo, Pengshuo Qiu, Aojun Zhou, Pan Lu, Kai-Wei Chang, Yu Qiao, Peng Gao, and Hongsheng Li. 2024. https://doi.org/10.1007/978-3-031-73242-3_10 Mathverse: Does your multi-modal llm truly see the diagrams in visual math problems? In Computer Vision -- ECCV 2024: 18th European Conference, Milan, Italy...

work page doi:10.1007/978-3-031-73242-3_10 2024
[32]

Yanzhao Zhang, Mingxin Li, Dingkun Long, Xin Zhang, Huan Lin, Baosong Yang, Pengjun Xie, An Yang, Dayiheng Liu, Junyang Lin, et al. 2025 b . Qwen3 embedding: Advancing text embedding and reranking through foundation models. arXiv preprint arXiv:2506.05176

work page internal anchor Pith review Pith/arXiv arXiv 2025
[33]

Yi Zhang, Bolin Ni, Xin-Sheng Chen, Hengrui Zhang, Yongming Rao, Houwen Peng, Qinglin Lu, Han Hu, Meng-Hao Guo, and Shi min Hu. 2026. https://openreview.net/forum?id=IVluwK8q9q Bee: A high-quality corpus and full-stack suite to unlock advanced fully open MLLM s . In The Fourteenth International Conference on Learning Representations

work page 2026
[34]

Xing, Hao Zhang, Joseph E

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. 2023. Judging llm-as-a-judge with mt-bench and chatbot arena. In Proceedings of the 37th International Conference on Neural Information Processing Systems, NIPS '23, Red Hook, N...

work page 2023