pith. machine review for the scientific record. sign in

arxiv: 2411.10442 · v2 · submitted 2024-11-15 · 💻 cs.CL · cs.CV

Recognition: 2 theorem links

· Lean Theorem

Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization

Authors on Pith no claims yet

Pith reviewed 2026-05-16 09:12 UTC · model grok-4.3

classification 💻 cs.CL cs.CV
keywords multimodal large language modelspreference optimizationchain-of-thought reasoningMathVistadistribution shiftsMMPR datasetInternVL2
0
0 comments X

The pith

Mixed Preference Optimization lifts an 8B multimodal model to match a 76B model on MathVista reasoning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that distribution shifts after pre-training and supervised fine-tuning limit chain-of-thought reasoning in multimodal large language models. The authors address this by building an automated pipeline that generates a large-scale preference dataset called MMPR and then applying a Mixed Preference Optimization step to the models. A sympathetic reader would care because the resulting InternVL2-8B-MPO reaches 67.0 accuracy on MathVista, closing most of the gap to a model ten times larger without extra parameters. The work demonstrates that preference optimization can be integrated directly with MLLMs to recover reasoning performance lost to distributional mismatch.

Core claim

We introduce an automated preference data construction pipeline that creates the MMPR dataset and a Mixed Preference Optimization (MPO) method that integrates preference optimization with MLLMs. This process enhances multimodal chain-of-thought performance, so that InternVL2-8B-MPO achieves 67.0 accuracy on MathVista, outperforming the base InternVL2-8B by 8.7 points and matching the 10 times larger InternVL2-76B.

What carries the argument

Mixed Preference Optimization (MPO), a post-training method that combines preference optimization with MLLMs using the automatically constructed MMPR preference dataset to improve multimodal chain-of-thought reasoning.

Load-bearing premise

The automated preference data construction pipeline produces high-quality, unbiased multimodal reasoning examples that effectively mitigate distribution shifts without introducing new artifacts that degrade performance.

What would settle it

If applying MPO to InternVL2-8B produces no gain or a drop in MathVista accuracy relative to the base InternVL2-8B model, the central claim would be falsified.

read the original abstract

Existing open-source multimodal large language models (MLLMs) generally follow a training process involving pre-training and supervised fine-tuning. However, these models suffer from distribution shifts, which limit their multimodal reasoning, particularly in the Chain-of-Thought (CoT) performance. To address this, we introduce a preference optimization (PO) process to enhance the multimodal reasoning capabilities of MLLMs. Specifically, (1) on the data side, we design an automated preference data construction pipeline to create MMPR, a high-quality, large-scale multimodal reasoning preference dataset; and (2) on the model side, we explore integrating PO with MLLMs, developing a simple yet effective method, termed Mixed Preference Optimization (MPO), which boosts multimodal CoT performance. Our approach enhances the multimodal reasoning abilities of both InternVL2-8B and InternVL2-76B. Notably, our model, InternVL2-8B-MPO, achieves an accuracy of 67.0 on MathVista, outperforming InternVL2-8B by 8.7 points and achieving performance comparable to the 10$\times$ larger InternVL2-76B. We hope this study could inspire further advancements in MLLMs. Code, data, and model are released.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces Mixed Preference Optimization (MPO) for multimodal large language models to address distribution shifts limiting Chain-of-Thought reasoning. It presents an automated pipeline for constructing the large-scale MMPR multimodal reasoning preference dataset by generating pairs from CoT trajectories via an LLM judge, then applies MPO to InternVL2 models. The central empirical claim is that InternVL2-8B-MPO reaches 67.0 accuracy on MathVista (+8.7 over the base InternVL2-8B) and matches the performance of the 10x larger InternVL2-76B, with code, data, and models released.

Significance. If the reported gains prove causal to MPO rather than data artifacts, the work offers a scalable, open-source route to stronger multimodal reasoning without model scaling. The public release of the MMPR dataset and MPO implementation is a concrete community asset that enables follow-up ablations and extensions in preference optimization for vision-language models.

major comments (3)
  1. [Section 3] Section 3 (data construction pipeline): the description of MMPR generation via LLM judge on CoT trajectories provides no quantitative checks for test-set overlap with MathVista, no human-verified error rate on chosen/rejected pairs, and no contamination analysis. This directly undermines the causal claim that the +8.7 point MathVista gain stems from MPO rather than leakage or label noise.
  2. [Experimental results] Experimental results section: no ablation isolating MPO from supervised fine-tuning on the same MMPR data is reported. Without this comparison, it remains unclear whether the performance delta is attributable to the mixed preference objective or simply to additional multimodal CoT training data.
  3. [Results] Results tables and text: benchmark numbers (e.g., 67.0 on MathVista) are presented without multiple-run statistics, standard deviations, or significance tests, leaving the robustness of the headline improvement only moderately supported.
minor comments (2)
  1. [Methods] Clarify the precise formulation of 'Mixed' Preference Optimization (e.g., how the mixing coefficient or loss terms are defined) in the methods section, as the abstract description is high-level.
  2. [Related Work] Add explicit references to prior multimodal preference optimization works in the related-work section to better situate MPO.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. We address each of the major comments point-by-point below and have revised the paper accordingly to strengthen the empirical support for our claims.

read point-by-point responses
  1. Referee: [Section 3] Section 3 (data construction pipeline): the description of MMPR generation via LLM judge on CoT trajectories provides no quantitative checks for test-set overlap with MathVista, no human-verified error rate on chosen/rejected pairs, and no contamination analysis. This directly undermines the causal claim that the +8.7 point MathVista gain stems from MPO rather than leakage or label noise.

    Authors: We appreciate the referee's concern regarding potential data leakage or noise in the MMPR dataset. The construction pipeline in Section 3 generates preference pairs from CoT trajectories using sources that are designed to be disjoint from MathVista. However, to directly address this, we have conducted additional analyses and will include in the revised manuscript: (1) explicit quantitative checks confirming zero overlap with the MathVista test set, (2) human verification results on a random sample of 100 chosen/rejected pairs showing an error rate below 5%, and (3) a contamination analysis. These additions will support the causal attribution to MPO. revision: yes

  2. Referee: [Experimental results] Experimental results section: no ablation isolating MPO from supervised fine-tuning on the same MMPR data is reported. Without this comparison, it remains unclear whether the performance delta is attributable to the mixed preference objective or simply to additional multimodal CoT training data.

    Authors: We agree that isolating the contribution of the MPO objective is crucial. In the revised version, we have added an ablation study in the Experimental results section that compares MPO directly against supervised fine-tuning (SFT) using the same MMPR dataset. The results show that MPO provides further improvements over SFT alone, confirming the benefit of the mixed preference optimization approach. revision: yes

  3. Referee: [Results] Results tables and text: benchmark numbers (e.g., 67.0 on MathVista) are presented without multiple-run statistics, standard deviations, or significance tests, leaving the robustness of the headline improvement only moderately supported.

    Authors: We acknowledge that reporting statistical measures would enhance the robustness of our results. We have rerun the key experiments across multiple random seeds and will update the results tables and text in the revised manuscript to include means, standard deviations, and appropriate significance tests for the reported improvements. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical gains on independent external benchmarks

full rationale

The paper's derivation consists of (1) an automated pipeline generating MMPR preference pairs from CoT trajectories judged by an LLM and (2) application of the MPO objective to fine-tune InternVL2 models. The headline result (67.0 on MathVista) is measured on a held-out public benchmark whose test set is not part of the MMPR construction process. No equation or claim reduces by definition to a fitted parameter, self-citation chain, or renamed input; the performance delta is an external measurement rather than a statistical tautology. Minor self-references to prior InternVL2 work exist but are not load-bearing for the MPO derivation itself.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on the empirical effectiveness of preference optimization transferred to multimodal models and the assumption that the automated dataset construction yields useful training signals.

axioms (1)
  • domain assumption Preference optimization frameworks developed for text models transfer effectively to multimodal models when applied to reasoning tasks.
    The paper applies existing PO methods to MLLMs without providing new theoretical analysis of why the transfer succeeds.

pith-pipeline@v0.9.0 · 5561 in / 1213 out tokens · 42939 ms · 2026-05-16T09:12:22.032339+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 18 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Breaking the Illusion: When Positive Meets Negative in Multimodal Decoding

    cs.LG 2026-04 unverdicted novelty 7.0

    PND reduces object hallucination in VLMs via a dual-path contrast during decoding that amplifies visual features and penalizes linguistic priors, achieving reported SOTA results on POPE, MME, and CHAIR without retraining.

  2. S-GRPO: Unified Post-Training for Large Vision-Language Models

    cs.LG 2026-04 unverdicted novelty 7.0

    S-GRPO unifies SFT and RL for LVLMs via conditional ground-truth injection that supplies a maximal-reward anchor when group exploration fails completely.

  3. Visual Preference Optimization with Rubric Rewards

    cs.CV 2026-04 unverdicted novelty 7.0

    rDPO uses offline-built rubrics to generate on-policy preference data for DPO, raising benchmark scores in visual tasks over outcome-based filtering and style baselines.

  4. Semantic-Geometric Dual Compression: Training-Free Visual Token Reduction for Ultra-High-Resolution Remote Sensing Understanding

    cs.CV 2026-04 unverdicted novelty 7.0

    DualComp uses a lightweight router to split visual token compression into a semantic stream with size-adaptive clustering and a geometric stream with path-tracing recovery, enabling low-cost high-fidelity UHR remote s...

  5. VSAS-Bench: Real-Time Evaluation of Visual Streaming Assistant Models

    cs.CV 2026-04 unverdicted novelty 7.0

    VSAS-Bench offers temporally dense annotations and synchronous/asynchronous protocols to evaluate streaming VLMs on timeliness, consistency, accuracy, and latency trade-offs, showing that adapted conventional VLMs can...

  6. ST-BiBench: Benchmarking Multi-Stream Multimodal Coordination in Bimanual Embodied Tasks for MLLMs

    cs.RO 2026-02 unverdicted novelty 7.0

    ST-BiBench reveals a coordination paradox in which MLLMs show strong high-level strategic reasoning yet fail at fine-grained 16-dimensional bimanual action synthesis and multi-stream fusion.

  7. Learning to See What You Need: Gaze Attention for Multimodal Large Language Models

    cs.CV 2026-05 unverdicted novelty 6.0

    Gaze Attention groups visual embeddings into selectable regions and dynamically restricts attention to task-relevant ones, matching dense baselines with up to 90% fewer visual KV entries via added context tokens.

  8. Decoding Scientific Experimental Images: The SPUR Benchmark for Perception, Understanding, and Reasoning

    cs.CV 2026-04 unverdicted novelty 6.0

    SPUR benchmark reveals that current multimodal large language models significantly underperform on expert-level perception, cross-panel understanding, and reasoning tasks with complex scientific experimental images.

  9. MONETA: Multimodal Industry Classification through Geographic Information with Multi Agent Systems

    cs.AI 2026-04 unverdicted novelty 6.0

    MONETA is the first multimodal benchmark for industry classification using text and geographic sources, with MLLM baselines at 62-74% accuracy and up to 22.8% gains from multi-turn context enrichment and explanations.

  10. OOWM: Structuring Embodied Reasoning and Planning via Object-Oriented Programmatic World Modeling

    cs.AI 2026-02 unverdicted novelty 6.0

    OOWM models the world as an explicit symbolic tuple with UML diagrams and trains via SFT plus GRPO to outperform text-based CoT on embodied planning benchmarks.

  11. InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

    cs.CV 2025-08 unverdicted novelty 6.0

    InternVL3.5 advances open-source multimodal models with Cascade RL for +16% reasoning gains and ViR for 4x inference speedup, with the 241B model reaching SOTA among open-source MLLMs on multimodal, reasoning, and age...

  12. InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

    cs.CV 2025-04 conditional novelty 6.0

    InternVL3-78B sets a new open-source SOTA of 72.2 on MMMU via native joint multimodal pre-training, V2PE, MPO, and test-time scaling while remaining competitive with proprietary models.

  13. Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

    cs.CV 2024-12 unverdicted novelty 6.0

    InternVL 2.5 is the first open-source MLLM to surpass 70% on the MMMU benchmark via model, data, and test-time scaling, with a 3.7-point gain from chain-of-thought reasoning.

  14. Make Your LVLM KV Cache More Lightweight

    cs.CV 2026-05 unverdicted novelty 5.0

    LightKV compresses vision-token KV cache in LVLMs to 55% size via prompt-guided cross-modality aggregation, halving cache memory, cutting compute 40%, and maintaining performance on benchmarks.

  15. Measure Twice, Click Once: Co-evolving Proposer and Visual Critic via Reinforcement Learning for GUI Grounding

    cs.LG 2026-04 unverdicted novelty 5.0

    A co-evolving proposer-critic RL framework improves GUI grounding accuracy by letting the model critique its own proposals rendered on screenshots.

  16. DT2IT-MRM: Debiased Preference Construction and Iterative Training for Multimodal Reward Modeling

    cs.AI 2026-04 unverdicted novelty 5.0

    DT2IT-MRM proposes a debiased preference construction pipeline, T2I data reformulation, and iterative training to curate multimodal preference data, achieving SOTA on VL-RewardBench, Multimodal RewardBench, and MM-RLH...

  17. Firebolt-VL: Efficient Vision-Language Understanding with Cross-Modality Modulation

    cs.CV 2026-04 unverdicted novelty 5.0

    Firebolt-VL introduces an LFM-based decoder and token-grid correlation to achieve linear-time vision-language inference with improved fine-grained grounding.

  18. MedThink: Enhancing Diagnostic Accuracy in Small Models via Teacher-Guided Reasoning Correction

    cs.CY 2026-04 unverdicted novelty 4.0

    MedThink, a two-stage teacher-guided reasoning correction distillation framework, boosts small language models' medical diagnostic accuracy by up to 12.7% on benchmarks and achieves 56.4% on a gastroenterology dataset.

Reference graph

Works this paper leans on

195 extracted references · 195 canonical work pages · cited by 18 Pith papers · 35 internal anchors

  1. [1]

    GPT-4 Technical Report

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023. 1

  2. [2]

    Flamingo: a visual language model for few-shot learning

    Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, An- toine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. NIPS, 35:23716–23736, 2022. 3

  3. [3]

    Program Synthesis with Large Language Models

    Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. Program synthesis with large language models. arXiv preprint arXiv:2108.07732, 2021. 8

  4. [4]

    A general theoretical paradigm to understand learning from human preferences

    Mohammad Gheshlaghi Azar, Zhaohan Daniel Guo, Bi- lal Piot, Remi Munos, Mark Rowland, Michal Valko, and Daniele Calandriello. A general theoretical paradigm to understand learning from human preferences. In Interna- tional Conference on Artificial Intelligence and Statistics , pages 4447–4455. PMLR, 2024. 3, 4, 7, 1

  5. [5]

    Qwen Technical Report

    Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report. arXiv preprint arXiv:2309.16609, 2023. 1, 3

  6. [6]

    Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

    Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966 ,

  7. [7]

    Introducing our multimodal models, 2023

    Rohan Bavishi, Erich Elsen, Curtis Hawthorne, Maxwell Nye, Augustus Odena, Arushi Somani, and Sa˘gnak Tas ¸ırlar. Introducing our multimodal models, 2023. 3

  8. [8]

    Scene text visual question answering

    Ali Furkan Biten, Ruben Tito, Andres Mafla, Lluis Gomez, Marc ¸al Rusinol, Ernest Valveny, CV Jawahar, and Dimos- thenis Karatzas. Scene text visual question answering. In ICCV, pages 4291–4301, 2019. 4

  9. [9]

    Rank analysis of incomplete block designs: I

    Ralph Allan Bradley and Milton E Terry. Rank analysis of incomplete block designs: I. the method of paired compar- isons. Biometrika, 39(3/4):324–345, 1952. 3, 4

  10. [10]

    Language models are few-shot learners

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Sub- biah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakan- tan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. NIPS, 2020. 1

  11. [11]

    InternLM2 Technical Report

    Zheng Cai, Maosong Cao, Haojiong Chen, Kai Chen, Keyu Chen, Xin Chen, Xun Chen, Zehui Chen, Zhi Chen, Pei Chu, et al. Internlm2 technical report. arXiv preprint arXiv:2403.17297, 2024. 1, 3

  12. [12]

    An augmented benchmark dataset for geometric question answering through dual parallel text encoding

    Jie Cao and Jing Xiao. An augmented benchmark dataset for geometric question answering through dual parallel text encoding. In COLING, pages 1511–1520, 2022. 4

  13. [13]

    Mapqa: A dataset for question answering on choropleth maps

    Shuaichen Chang, David Palzer, Jialin Li, Eric Fosler- Lussier, and Ningchuan Xiao. Mapqa: A dataset for question answering on choropleth maps. arXiv preprint arXiv:2211.08545, 2022. 4

  14. [14]

    Noise contrastive alignment of language models with explicit re- wards

    Huayu Chen, Guande He, Hang Su, and Jun Zhu. Noise contrastive alignment of language models with explicit re- wards. arXiv preprint arXiv:2402.05369, 2024. 4

  15. [15]

    Evaluating Large Language Models Trained on Code

    Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Ed- wards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021. 8

  16. [16]

    M3cot: A novel benchmark for multi- domain multi-step multi-modal chain-of-thought

    Qiguang Chen, Libo Qin, Jin Zhang, Zhi Chen, Xiao Xu, and Wanxiang Che. M3cot: A novel benchmark for multi- domain multi-step multi-modal chain-of-thought. arXiv preprint arXiv:2405.16473, 2024. 4, 5, 6, 7, 2

  17. [17]

    The- oremqa: A theorem-driven question answering dataset

    Wenhu Chen, Ming Yin, Max Ku, Pan Lu, Yixin Wan, Xueguang Ma, Jianyu Xu, Xinyi Wang, and Tony Xia. The- oremqa: A theorem-driven question answering dataset. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 7889–7901, 2023. 8

  18. [18]

    Dress: Instructing large vision-language models to align and interact with humans via natural lan- guage feedback

    Yangyi Chen, Karan Sikka, Michael Cogswell, Heng Ji, and Ajay Divakaran. Dress: Instructing large vision-language models to align and interact with humans via natural lan- guage feedback. In Proceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition , pages 14239–14250, 2024. 2

  19. [19]

    InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks

    Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Zhong Muyan, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision founda- tion models and aligning for generic visual-linguistic tasks. arXiv preprint arXiv:2312.14238, 2023. 3

  20. [20]

    How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites

    Zhe Chen, Weiyun Wang, Hao Tian, Shenglong Ye, Zhang- wei Gao, Erfei Cui, Wenwen Tong, Kongzhi Hu, Jiapeng Luo, Zheng Ma, et al. How far are we to gpt-4v? clos- ing the gap to commercial multimodal models with open- source suites. arXiv preprint arXiv:2404.16821, 2024. 1, 3, 6

  21. [21]

    Provably robust dpo: Aligning language models with noisy feedback

    Sayak Ray Chowdhury, Anush Kini, and Nagarajan Natara- jan. Provably robust dpo: Aligning language models with noisy feedback. arXiv preprint arXiv:2403.00409, 2024. 3, 4, 7, 1

  22. [22]

    Simple and effec- tive multi-paragraph reading comprehension

    Christopher Clark and Matt Gardner. Simple and effec- tive multi-paragraph reading comprehension. InACL, pages 845–855, 2018. 4

  23. [23]

    Training Verifiers to Solve Math Word Problems

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Train- ing verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021. 8

  24. [24]

    Instructblip: Towards general- purpose vision-language models with instruction tuning

    Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale N Fung, and Steven Hoi. Instructblip: Towards general- purpose vision-language models with instruction tuning. NIPS, 36, 2024. 1

  25. [25]

    En- hancing large vision language models with self-training on image comprehension

    Yihe Deng, Pan Lu, Fan Yin, Ziniu Hu, Sheng Shen, Quan- quan Gu, James Zou, Kai-Wei Chang, and Wei Wang. En- hancing large vision language models with self-training on image comprehension. arXiv preprint arXiv:2405.19716 ,

  26. [26]

    Rlhf workflow: From reward mod- eling to online rlhf.arXiv preprint arXiv:2405.07863, 2024

    Hanze Dong, Wei Xiong, Bo Pang, Haoxiang Wang, Han Zhao, Yingbo Zhou, Nan Jiang, Doyen Sahoo, Caiming Xiong, and Tong Zhang. Rlhf workflow: From reward mod- eling to online rlhf.arXiv preprint arXiv:2405.07863, 2024. 3

  27. [27]

    The Llama 3 Herd of Models

    Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Ab- hishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783,

  28. [28]

    G-llava: Solving geometric prob- lem with multi-modal large language model.arXiv preprint arXiv:2312.11370, 2023

    Jiahui Gao, Renjie Pi, Jipeng Zhang, Jiacheng Ye, Wan- jun Zhong, Yufei Wang, Lanqing Hong, Jianhua Han, Hang Xu, Zhenguo Li, et al. G-llava: Solving geometric prob- lem with multi-modal large language model.arXiv preprint arXiv:2312.11370, 2023. 3, 4

  29. [29]

    Learn your reference model for real good alignment

    Alexey Gorbatovski, Boris Shaposhnikov, Alexey Malakhov, Nikita Surnachev, Yaroslav Aksenov, Ian Maksimov, Nikita Balagansky, and Daniil Gavrilov. Learn your reference model for real good alignment. arXiv preprint arXiv:2404.09656, 2024. 3, 7, 1

  30. [30]

    Making the v in vqa matter: El- evating the role of image understanding in visual question answering

    Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the v in vqa matter: El- evating the role of image understanding in visual question answering. In CVPR, pages 6904–6913, 2017. 4

  31. [31]

    Measuring Massive Multitask Language Understanding

    Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Mea- suring massive multitask language understanding. arXiv preprint arXiv:2009.03300, 2020. 8

  32. [32]

    Measuring Mathematical Problem Solving With the MATH Dataset

    Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874, 2021. 8

  33. [33]

    ORPO: Monolithic Preference Optimization without Reference Model

    Jiwoo Hong, Noah Lee, and James Thorne. Orpo: Mono- lithic preference optimization without reference model. arXiv preprint arXiv:2403.07691, 2(4):5, 2024. 3, 4, 7, 1

  34. [34]

    Icdar2019 com- petition on scanned receipt ocr and information extraction

    Zheng Huang, Kai Chen, Jianhua He, Xiang Bai, Dimosthe- nis Karatzas, Shijian Lu, and CV Jawahar. Icdar2019 com- petition on scanned receipt ocr and information extraction. In 2019 International Conference on Document Analysis and Recognition (ICDAR), pages 1516–1520. IEEE, 2019. 4

  35. [35]

    Gqa: A new dataset for real-world visual reasoning and compositional question answering

    Drew A Hudson and Christopher D Manning. Gqa: A new dataset for real-world visual reasoning and compositional question answering. In CVPR, pages 6700–6709, 2019. 4

  36. [36]

    TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension

    Mandar Joshi, Eunsol Choi, Daniel S Weld, and Luke Zettlemoyer. Triviaqa: A large scale distantly super- vised challenge dataset for reading comprehension. arXiv preprint arXiv:1705.03551, 2017. 8

  37. [37]

    Binary classifier optimization for large language model alignment

    Seungjae Jung, Gunsoo Han, Daniel Wontae Nam, and Kyoung-Woon On. Binary classifier optimization for large language model alignment. arXiv preprint arXiv:2404.04656, 2024. 4, 7, 1

  38. [38]

    Dvqa: Understanding data visualizations via ques- tion answering

    Kushal Kafle, Brian Price, Scott Cohen, and Christopher Kanan. Dvqa: Understanding data visualizations via ques- tion answering. In CVPR, pages 5648–5656, 2018. 4

  39. [39]

    Geomverse: A systematic evaluation of large models for geometric reasoning

    Mehran Kazemi, Hamidreza Alvari, Ankit Anand, Jialin Wu, Xi Chen, and Radu Soricut. Geomverse: A systematic evaluation of large models for geometric reasoning. arXiv preprint arXiv:2312.12241, 2023. 4

  40. [40]

    A diagram is worth a dozen images

    Aniruddha Kembhavi, Mike Salvato, Eric Kolve, Minjoon Seo, Hannaneh Hajishirzi, and Ali Farhadi. A diagram is worth a dozen images. In ECCV, pages 235–251, 2016. 4

  41. [41]

    Natural questions: a benchmark for question answering re- search

    Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Ep- stein, Illia Polosukhin, Jacob Devlin, Kenton Lee, et al. Natural questions: a benchmark for question answering re- search. Transactions of the Association for Computational Linguistics, 7:453–466, 2019. 8

  42. [42]

    RACE: Large-scale ReAding Comprehension Dataset From Examinations

    Guokun Lai, Qizhe Xie, Hanxiao Liu, Yiming Yang, and Eduard Hovy. Race: Large-scale reading com- prehension dataset from examinations. arXiv preprint arXiv:1704.04683, 2017. 8

  43. [43]

    Step-dpo: Step-wise preference op- timization for long-chain reasoning of llms

    Xin Lai, Zhuotao Tian, Yukang Chen, Senqiao Yang, Xian- gru Peng, and Jiaya Jia. Step-dpo: Step-wise preference op- timization for long-chain reasoning of llms. arXiv preprint arXiv:2406.18629, 2024. 2, 3

  44. [44]

    Obelics: An open web-scale filtered dataset of interleaved image-text documents

    Hugo Laurenc ¸on, Lucile Saulnier, L´eo Tronchon, Stas Bek- man, Amanpreet Singh, Anton Lozhkov, Thomas Wang, Siddharth Karamcheti, Alexander Rush, Douwe Kiela, et al. Obelics: An open web-scale filtered dataset of interleaved image-text documents. NIPS, 36, 2024. 1

  45. [45]

    LLaVA-OneVision: Easy Visual Task Transfer

    Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Yanwei Li, Ziwei Liu, and Chunyuan Li. Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326, 2024. 1, 6

  46. [46]

    Blip: Bootstrapping language-image pre-training for uni- fied vision-language understanding and generation

    Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for uni- fied vision-language understanding and generation. In ICML, pages 12888–12900, 2022. 3

  47. [47]

    Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models

    Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In ICML, pages 19730–19742. PMLR, 2023. 1, 3

  48. [48]

    Silkie: Preference distillation for large visual lan- guage models

    Lei Li, Zhihui Xie, Mukai Li, Shunian Chen, Peiyi Wang, Liang Chen, Yazheng Yang, Benyou Wang, and Lingpeng Kong. Silkie: Preference distillation for large visual lan- guage models. arXiv preprint arXiv:2312.10665, 2023. 2, 3

  49. [49]

    Omnicorpus: An unified multimodal corpus of 10 billion-level images interleaved with text

    Qingyun Li, Zhe Chen, Weiyun Wang, Wenhai Wang, Shenglong Ye, Zhenjiang Jin, Guanzhou Chen, Yinan He, Zhangwei Gao, Erfei Cui, et al. Omnicorpus: An unified multimodal corpus of 10 billion-level images interleaved with text. arXiv preprint arXiv:2406.08418, 2024. 1, 3

  50. [50]

    Evaluating object hallucination in large vision-language models

    Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao, and Ji-Rong Wen. Evaluating object hallucination in large vision-language models. In EMNLP, pages 292–305,

  51. [51]

    Moma: Efficient early-fusion pre- training with mixture of modality-aware experts

    Xi Victoria Lin, Akshat Shrivastava, Liang Luo, Srini- vasan Iyer, Mike Lewis, Gargi Gosh, Luke Zettlemoyer, and Armen Aghajanyan. Moma: Efficient early-fusion pre- training with mixture of modality-aware experts. arXiv preprint arXiv:2407.21770, 2024. 3

  52. [52]

    Clevr-math: A dataset for compositional lan- 10 guage, visual and mathematical reasoning

    Adam Dahlgren Lindstr ¨om and Savitha Sam Abra- ham. Clevr-math: A dataset for compositional lan- 10 guage, visual and mathematical reasoning. arXiv preprint arXiv:2208.05358, 2022. 4

  53. [53]

    Improved Baselines with Visual Instruction Tuning

    Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. arXiv preprint arXiv:2310.03744, 2023. 1, 6

  54. [54]

    Visual instruction tuning

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. NIPS, 36, 2023. 1, 3

  55. [55]

    Statistical re- jection sampling improves preference optimization

    Tianqi Liu, Yao Zhao, Rishabh Joshi, Misha Khalman, Mo- hammad Saleh, Peter J Liu, and Jialu Liu. Statistical re- jection sampling improves preference optimization. arXiv preprint arXiv:2309.06657, 2023. 3, 4, 7, 1

  56. [56]

    Mminstruct: A high-quality multi-modal instruction tuning dataset with extensive diversity

    Yangzhou Liu, Yue Cao, Zhangwei Gao, Weiyun Wang, Zhe Chen, Wenhai Wang, Hao Tian, Lewei Lu, Xizhou Zhu, Tong Lu, et al. Mminstruct: A high-quality multi-modal instruction tuning dataset with extensive diversity. arXiv preprint arXiv:2407.15838, 2024. 1, 3

  57. [57]

    Interngpt: Solving vision-centric tasks by interacting with chatgpt beyond lan- guage

    Zhaoyang Liu, Yinan He, Wenhai Wang, Weiyun Wang, Yi Wang, Shoufa Chen, Qinglong Zhang, Zeqiang Lai, Yang Yang, Qingyun Li, Jiashuo Yu, et al. Interngpt: Solving vision-centric tasks by interacting with chatgpt beyond lan- guage. arXiv preprint arXiv:2305.05662, 2023. 3

  58. [58]

    Decoupled Weight Decay Regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017. 1

  59. [59]

    Inter-gps: In- terpretable geometry problem solving with formal language and symbolic reasoning

    Pan Lu, Ran Gong, Shibiao Jiang, Liang Qiu, Siyuan Huang, Xiaodan Liang, and Song-Chun Zhu. Inter-gps: In- terpretable geometry problem solving with formal language and symbolic reasoning. arXiv preprint arXiv:2105.04165,

  60. [60]

    Iconqa: A new benchmark for abstract diagram under- standing and visual language reasoning

    Pan Lu, Liang Qiu, Jiaqi Chen, Tony Xia, Yizhou Zhao, Wei Zhang, Zhou Yu, Xiaodan Liang, and Song-Chun Zhu. Iconqa: A new benchmark for abstract diagram under- standing and visual language reasoning. arXiv preprint arXiv:2110.13214, 2021. 4

  61. [61]

    Learn to explain: Multimodal reasoning via thought chains for science question answering

    Pan Lu, Swaroop Mishra, Tanglin Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering. NIPS, 35:2507–2521, 2022. 4

  62. [62]

    MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts

    Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chun- yuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathematical reasoning of foundation models in visual con- texts. arXiv preprint arXiv:2310.02255, 2023. 1, 2, 5, 6

  63. [63]

    Mono-internvl: Pushing the boundaries of monolithic multimodal large language models with endogenous visual pre-training

    Gen Luo, Xue Yang, Wenhan Dou, Zhaokai Wang, Jifeng Dai, Yu Qiao, and Xizhou Zhu. Mono-internvl: Pushing the boundaries of monolithic multimodal large language models with endogenous visual pre-training. arXiv preprint arXiv:2410.08202, 2024. 3

  64. [64]

    Ok-vqa: A visual question answering benchmark requiring external knowledge

    Kenneth Marino, Mohammad Rastegari, Ali Farhadi, and Roozbeh Mottaghi. Ok-vqa: A visual question answering benchmark requiring external knowledge. In CVPR, pages 3195–3204, 2019. 4

  65. [65]

    Chartqa: A benchmark for question answering about charts with visual and logical reasoning

    Ahmed Masry, Xuan Long Do, Jia Qing Tan, Shafiq Joty, and Enamul Hoque. Chartqa: A benchmark for question answering about charts with visual and logical reasoning. In ACL, pages 2263–2279, 2022. 4

  66. [66]

    Docvqa: A dataset for vqa on document images

    Minesh Mathew, Dimosthenis Karatzas, and CV Jawahar. Docvqa: A dataset for vqa on document images. In WACV, pages 2200–2209, 2021. 4

  67. [67]

    Infographicvqa

    Minesh Mathew, Viraj Bagal, Rub `en Tito, Dimosthenis Karatzas, Ernest Valveny, and CV Jawahar. Infographicvqa. In WACV, pages 1697–1706, 2022. 4

  68. [68]

    Distributional preference alignment of llms via optimal transport

    Igor Melnyk, Youssef Mroueh, Brian Belgodere, Mat- tia Rigotti, Apoorva Nitsure, Mikhail Yurochkin, Kristjan Greenewald, Jiri Navratil, and Jerret Ross. Distributional preference alignment of llms via optimal transport. arXiv preprint arXiv:2406.05882, 2024. 4, 7, 1

  69. [69]

    Ocr-vqa: Visual question answering by reading text in images

    Anand Mishra, Shashank Shekhar, Ajeet Kumar Singh, and Anirban Chakraborty. Ocr-vqa: Visual question answering by reading text in images. InICDAR, pages 947–952, 2019. 4

  70. [70]

    A note on dpo with noisy preferences & re- lationship to ipo, 2023

    Eric Mitchell. A note on dpo with noisy preferences & re- lationship to ipo, 2023. 4, 7, 1

  71. [71]

    Seva: Leveraging sketches to evaluate alignment between human and machine visual abstraction

    Kushin Mukherjee, Holly Huey, Xuanchen Lu, Yael Vinker, Rio Aguina-Kang, Ariel Shamir, and Judith Fan. Seva: Leveraging sketches to evaluate alignment between human and machine visual abstraction. Advances in Neural Infor- mation Processing Systems, 36:67138–67155, 2023. 3, 6, 7

  72. [72]

    Gpt-4v(ision) system card

    OpenAI. Gpt-4v(ision) system card. https://cdn. openai.com/papers/GPTV_System_Card.pdf ,

  73. [73]

    Gpt-4o system card

    OpenAI. Gpt-4o system card. https://openai.com/ index/gpt-4o-system-card/ , 2024. 6

  74. [74]

    Training lan- guage models to follow instructions with human feedback

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Car- roll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training lan- guage models to follow instructions with human feedback. Advances in neural information processing systems , 35: 27730–27744, 2022. 3

  75. [75]

    Smaug: Fixing failure modes of preference optimisation with dpo-positive

    Arka Pal, Deep Karkhanis, Samuel Dooley, Manley Roberts, Siddartha Naidu, and Colin White. Smaug: Fixing failure modes of preference optimisation with dpo-positive. arXiv preprint arXiv:2402.13228, 2024. 4

  76. [76]

    Iter- ative reasoning preference optimization

    Richard Yuanzhe Pang, Weizhe Yuan, Kyunghyun Cho, He He, Sainbayar Sukhbaatar, and Jason Weston. Iter- ative reasoning preference optimization. arXiv preprint arXiv:2404.19733, 2024. 2, 3

  77. [77]

    Strengthening multi- modal large language model with bootstrapped preference optimization

    Renjie Pi, Tianyang Han, Wei Xiong, Jipeng Zhang, Run- tao Liu, Rui Pan, and Tong Zhang. Strengthening multi- modal large language model with bootstrapped preference optimization. arXiv preprint arXiv:2403.08730, 2024. 2

  78. [78]

    We-math: Does your large multimodal model achieve human-like mathematical reasoning? arXiv preprint arXiv:2407.01284, 2024

    Runqi Qiao, Qiuna Tan, Guanting Dong, Minhui Wu, Chong Sun, Xiaoshuai Song, Zhuoma GongQue, Shanglin Lei, Zhe Wei, Miaoxuan Zhang, et al. We-math: Does your large multimodal model achieve human-like mathematical reasoning? arXiv preprint arXiv:2407.01284, 2024. 5, 6

  79. [79]

    Direct preference optimization: Your language model is secretly a reward model

    Rafael Rafailov, Archit Sharma, Eric Mitchell, Christo- pher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 36, 2024. 2, 3, 4, 7, 1 11

  80. [80]

    Learning multiple visual domains with residual adapters

    Sylvestre-Alvise Rebuffi, Hakan Bilen, and Andrea Vedaldi. Learning multiple visual domains with residual adapters. NIPS, 30, 2017. 3

Showing first 80 references.