arxiv: 2411.10442 · v2 · submitted 2024-11-15 · 💻 cs.CL · cs.CV

Recognition: 2 theorem links

· Lean Theorem

Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization

Weiyun Wang , Zhe Chen , Wenhai Wang , Yue Cao , Yangzhou Liu , Zhangwei Gao , Jinguo Zhu , Xizhou Zhu

show 3 more authors

Lewei Lu Yu Qiao Jifeng Dai

Authors on Pith no claims yet

Pith reviewed 2026-05-16 09:12 UTC · model grok-4.3

classification 💻 cs.CL cs.CV

keywords multimodal large language modelspreference optimizationchain-of-thought reasoningMathVistadistribution shiftsMMPR datasetInternVL2

0 comments

The pith

Mixed Preference Optimization lifts an 8B multimodal model to match a 76B model on MathVista reasoning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that distribution shifts after pre-training and supervised fine-tuning limit chain-of-thought reasoning in multimodal large language models. The authors address this by building an automated pipeline that generates a large-scale preference dataset called MMPR and then applying a Mixed Preference Optimization step to the models. A sympathetic reader would care because the resulting InternVL2-8B-MPO reaches 67.0 accuracy on MathVista, closing most of the gap to a model ten times larger without extra parameters. The work demonstrates that preference optimization can be integrated directly with MLLMs to recover reasoning performance lost to distributional mismatch.

Core claim

We introduce an automated preference data construction pipeline that creates the MMPR dataset and a Mixed Preference Optimization (MPO) method that integrates preference optimization with MLLMs. This process enhances multimodal chain-of-thought performance, so that InternVL2-8B-MPO achieves 67.0 accuracy on MathVista, outperforming the base InternVL2-8B by 8.7 points and matching the 10 times larger InternVL2-76B.

What carries the argument

Mixed Preference Optimization (MPO), a post-training method that combines preference optimization with MLLMs using the automatically constructed MMPR preference dataset to improve multimodal chain-of-thought reasoning.

Load-bearing premise

The automated preference data construction pipeline produces high-quality, unbiased multimodal reasoning examples that effectively mitigate distribution shifts without introducing new artifacts that degrade performance.

What would settle it

If applying MPO to InternVL2-8B produces no gain or a drop in MathVista accuracy relative to the base InternVL2-8B model, the central claim would be falsified.

read the original abstract

Existing open-source multimodal large language models (MLLMs) generally follow a training process involving pre-training and supervised fine-tuning. However, these models suffer from distribution shifts, which limit their multimodal reasoning, particularly in the Chain-of-Thought (CoT) performance. To address this, we introduce a preference optimization (PO) process to enhance the multimodal reasoning capabilities of MLLMs. Specifically, (1) on the data side, we design an automated preference data construction pipeline to create MMPR, a high-quality, large-scale multimodal reasoning preference dataset; and (2) on the model side, we explore integrating PO with MLLMs, developing a simple yet effective method, termed Mixed Preference Optimization (MPO), which boosts multimodal CoT performance. Our approach enhances the multimodal reasoning abilities of both InternVL2-8B and InternVL2-76B. Notably, our model, InternVL2-8B-MPO, achieves an accuracy of 67.0 on MathVista, outperforming InternVL2-8B by 8.7 points and achieving performance comparable to the 10$\times$ larger InternVL2-76B. We hope this study could inspire further advancements in MLLMs. Code, data, and model are released.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MPO plus the auto-built MMPR dataset lifts InternVL2-8B to 67.0 on MathVista and closes most of the gap to the 76B model, but the paper gives no checks on data noise or overlap.

read the letter

The main thing to know is that Mixed Preference Optimization applied to their new MMPR preference dataset raises InternVL2-8B from 58.3 to 67.0 on MathVista, matching the performance of the ten-times-larger InternVL2-76B. They also report gains on the 76B model itself. The work is straightforward: they generate CoT trajectories, use an LLM judge to build chosen/rejected pairs at scale, and then run a mixed form of preference optimization on top of the base MLLM training recipe. They release the dataset, code, and models, which makes the claim easy to test directly. That release is the clearest practical contribution here. The method is an incremental extension of existing PO techniques rather than a new theoretical step, but the scale of the dataset and the size of the reported lift on a reasoning benchmark are what make it worth attention. The soft spot is the data pipeline. The abstract describes an automated construction process but shows no human validation of pair quality, no overlap statistics with MathVista or other test sets, and no ablation that isolates MPO from the simple effect of adding more multimodal CoT data. Without those checks it is hard to rule out that part of the 8.7-point jump comes from cleaner or leaked training signals rather than the optimization itself. The central empirical claim is still testable because the artifacts are public. This paper is for groups already working on MLLM fine-tuning and alignment who want a ready-to-run recipe for multimodal reasoning. It is not a foundational advance, but the concrete numbers and released resources give it enough substance that a serious referee could usefully press on the missing ablations and data audits. I would send it to peer review.

Referee Report

3 major / 2 minor

Summary. The paper introduces Mixed Preference Optimization (MPO) for multimodal large language models to address distribution shifts limiting Chain-of-Thought reasoning. It presents an automated pipeline for constructing the large-scale MMPR multimodal reasoning preference dataset by generating pairs from CoT trajectories via an LLM judge, then applies MPO to InternVL2 models. The central empirical claim is that InternVL2-8B-MPO reaches 67.0 accuracy on MathVista (+8.7 over the base InternVL2-8B) and matches the performance of the 10x larger InternVL2-76B, with code, data, and models released.

Significance. If the reported gains prove causal to MPO rather than data artifacts, the work offers a scalable, open-source route to stronger multimodal reasoning without model scaling. The public release of the MMPR dataset and MPO implementation is a concrete community asset that enables follow-up ablations and extensions in preference optimization for vision-language models.

major comments (3)

[Section 3] Section 3 (data construction pipeline): the description of MMPR generation via LLM judge on CoT trajectories provides no quantitative checks for test-set overlap with MathVista, no human-verified error rate on chosen/rejected pairs, and no contamination analysis. This directly undermines the causal claim that the +8.7 point MathVista gain stems from MPO rather than leakage or label noise.
[Experimental results] Experimental results section: no ablation isolating MPO from supervised fine-tuning on the same MMPR data is reported. Without this comparison, it remains unclear whether the performance delta is attributable to the mixed preference objective or simply to additional multimodal CoT training data.
[Results] Results tables and text: benchmark numbers (e.g., 67.0 on MathVista) are presented without multiple-run statistics, standard deviations, or significance tests, leaving the robustness of the headline improvement only moderately supported.

minor comments (2)

[Methods] Clarify the precise formulation of 'Mixed' Preference Optimization (e.g., how the mixing coefficient or loss terms are defined) in the methods section, as the abstract description is high-level.
[Related Work] Add explicit references to prior multimodal preference optimization works in the related-work section to better situate MPO.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. We address each of the major comments point-by-point below and have revised the paper accordingly to strengthen the empirical support for our claims.

read point-by-point responses

Referee: [Section 3] Section 3 (data construction pipeline): the description of MMPR generation via LLM judge on CoT trajectories provides no quantitative checks for test-set overlap with MathVista, no human-verified error rate on chosen/rejected pairs, and no contamination analysis. This directly undermines the causal claim that the +8.7 point MathVista gain stems from MPO rather than leakage or label noise.

Authors: We appreciate the referee's concern regarding potential data leakage or noise in the MMPR dataset. The construction pipeline in Section 3 generates preference pairs from CoT trajectories using sources that are designed to be disjoint from MathVista. However, to directly address this, we have conducted additional analyses and will include in the revised manuscript: (1) explicit quantitative checks confirming zero overlap with the MathVista test set, (2) human verification results on a random sample of 100 chosen/rejected pairs showing an error rate below 5%, and (3) a contamination analysis. These additions will support the causal attribution to MPO. revision: yes
Referee: [Experimental results] Experimental results section: no ablation isolating MPO from supervised fine-tuning on the same MMPR data is reported. Without this comparison, it remains unclear whether the performance delta is attributable to the mixed preference objective or simply to additional multimodal CoT training data.

Authors: We agree that isolating the contribution of the MPO objective is crucial. In the revised version, we have added an ablation study in the Experimental results section that compares MPO directly against supervised fine-tuning (SFT) using the same MMPR dataset. The results show that MPO provides further improvements over SFT alone, confirming the benefit of the mixed preference optimization approach. revision: yes
Referee: [Results] Results tables and text: benchmark numbers (e.g., 67.0 on MathVista) are presented without multiple-run statistics, standard deviations, or significance tests, leaving the robustness of the headline improvement only moderately supported.

Authors: We acknowledge that reporting statistical measures would enhance the robustness of our results. We have rerun the key experiments across multiple random seeds and will update the results tables and text in the revised manuscript to include means, standard deviations, and appropriate significance tests for the reported improvements. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical gains on independent external benchmarks

full rationale

The paper's derivation consists of (1) an automated pipeline generating MMPR preference pairs from CoT trajectories judged by an LLM and (2) application of the MPO objective to fine-tune InternVL2 models. The headline result (67.0 on MathVista) is measured on a held-out public benchmark whose test set is not part of the MMPR construction process. No equation or claim reduces by definition to a fitted parameter, self-citation chain, or renamed input; the performance delta is an external measurement rather than a statistical tautology. Minor self-references to prior InternVL2 work exist but are not load-bearing for the MPO derivation itself.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on the empirical effectiveness of preference optimization transferred to multimodal models and the assumption that the automated dataset construction yields useful training signals.

axioms (1)

domain assumption Preference optimization frameworks developed for text models transfer effectively to multimodal models when applied to reasoning tasks.
The paper applies existing PO methods to MLLMs without providing new theoretical analysis of why the transfer succeeds.

pith-pipeline@v0.9.0 · 5561 in / 1213 out tokens · 42939 ms · 2026-05-16T09:12:22.032339+00:00 · methodology

discussion (0)

Forward citations

Cited by 18 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Breaking the Illusion: When Positive Meets Negative in Multimodal Decoding
cs.LG 2026-04 unverdicted novelty 7.0

PND reduces object hallucination in VLMs via a dual-path contrast during decoding that amplifies visual features and penalizes linguistic priors, achieving reported SOTA results on POPE, MME, and CHAIR without retraining.
S-GRPO: Unified Post-Training for Large Vision-Language Models
cs.LG 2026-04 unverdicted novelty 7.0

S-GRPO unifies SFT and RL for LVLMs via conditional ground-truth injection that supplies a maximal-reward anchor when group exploration fails completely.
Visual Preference Optimization with Rubric Rewards
cs.CV 2026-04 unverdicted novelty 7.0

rDPO uses offline-built rubrics to generate on-policy preference data for DPO, raising benchmark scores in visual tasks over outcome-based filtering and style baselines.
Semantic-Geometric Dual Compression: Training-Free Visual Token Reduction for Ultra-High-Resolution Remote Sensing Understanding
cs.CV 2026-04 unverdicted novelty 7.0

DualComp uses a lightweight router to split visual token compression into a semantic stream with size-adaptive clustering and a geometric stream with path-tracing recovery, enabling low-cost high-fidelity UHR remote s...
VSAS-Bench: Real-Time Evaluation of Visual Streaming Assistant Models
cs.CV 2026-04 unverdicted novelty 7.0

VSAS-Bench offers temporally dense annotations and synchronous/asynchronous protocols to evaluate streaming VLMs on timeliness, consistency, accuracy, and latency trade-offs, showing that adapted conventional VLMs can...
ST-BiBench: Benchmarking Multi-Stream Multimodal Coordination in Bimanual Embodied Tasks for MLLMs
cs.RO 2026-02 unverdicted novelty 7.0

ST-BiBench reveals a coordination paradox in which MLLMs show strong high-level strategic reasoning yet fail at fine-grained 16-dimensional bimanual action synthesis and multi-stream fusion.
Learning to See What You Need: Gaze Attention for Multimodal Large Language Models
cs.CV 2026-05 unverdicted novelty 6.0

Gaze Attention groups visual embeddings into selectable regions and dynamically restricts attention to task-relevant ones, matching dense baselines with up to 90% fewer visual KV entries via added context tokens.
Decoding Scientific Experimental Images: The SPUR Benchmark for Perception, Understanding, and Reasoning
cs.CV 2026-04 unverdicted novelty 6.0

SPUR benchmark reveals that current multimodal large language models significantly underperform on expert-level perception, cross-panel understanding, and reasoning tasks with complex scientific experimental images.
MONETA: Multimodal Industry Classification through Geographic Information with Multi Agent Systems
cs.AI 2026-04 unverdicted novelty 6.0

MONETA is the first multimodal benchmark for industry classification using text and geographic sources, with MLLM baselines at 62-74% accuracy and up to 22.8% gains from multi-turn context enrichment and explanations.
OOWM: Structuring Embodied Reasoning and Planning via Object-Oriented Programmatic World Modeling
cs.AI 2026-02 unverdicted novelty 6.0

OOWM models the world as an explicit symbolic tuple with UML diagrams and trains via SFT plus GRPO to outperform text-based CoT on embodied planning benchmarks.
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency
cs.CV 2025-08 unverdicted novelty 6.0

InternVL3.5 advances open-source multimodal models with Cascade RL for +16% reasoning gains and ViR for 4x inference speedup, with the 241B model reaching SOTA among open-source MLLMs on multimodal, reasoning, and age...
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models
cs.CV 2025-04 conditional novelty 6.0

InternVL3-78B sets a new open-source SOTA of 72.2 on MMMU via native joint multimodal pre-training, V2PE, MPO, and test-time scaling while remaining competitive with proprietary models.
Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling
cs.CV 2024-12 unverdicted novelty 6.0

InternVL 2.5 is the first open-source MLLM to surpass 70% on the MMMU benchmark via model, data, and test-time scaling, with a 3.7-point gain from chain-of-thought reasoning.
Make Your LVLM KV Cache More Lightweight
cs.CV 2026-05 unverdicted novelty 5.0

LightKV compresses vision-token KV cache in LVLMs to 55% size via prompt-guided cross-modality aggregation, halving cache memory, cutting compute 40%, and maintaining performance on benchmarks.
Measure Twice, Click Once: Co-evolving Proposer and Visual Critic via Reinforcement Learning for GUI Grounding
cs.LG 2026-04 unverdicted novelty 5.0

A co-evolving proposer-critic RL framework improves GUI grounding accuracy by letting the model critique its own proposals rendered on screenshots.
DT2IT-MRM: Debiased Preference Construction and Iterative Training for Multimodal Reward Modeling
cs.AI 2026-04 unverdicted novelty 5.0

DT2IT-MRM proposes a debiased preference construction pipeline, T2I data reformulation, and iterative training to curate multimodal preference data, achieving SOTA on VL-RewardBench, Multimodal RewardBench, and MM-RLH...
Firebolt-VL: Efficient Vision-Language Understanding with Cross-Modality Modulation
cs.CV 2026-04 unverdicted novelty 5.0

Firebolt-VL introduces an LFM-based decoder and token-grid correlation to achieve linear-time vision-language inference with improved fine-grained grounding.
MedThink: Enhancing Diagnostic Accuracy in Small Models via Teacher-Guided Reasoning Correction
cs.CY 2026-04 unverdicted novelty 4.0

MedThink, a two-stage teacher-guided reasoning correction distillation framework, boosts small language models' medical diagnostic accuracy by up to 12.7% on benchmarks and achieves 56.4% on a gastroenterology dataset.

Reference graph

Works this paper leans on

195 extracted references · 195 canonical work pages · cited by 18 Pith papers · 35 internal anchors

[1]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023. 1

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

Flamingo: a visual language model for few-shot learning

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, An- toine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. NIPS, 35:23716–23736, 2022. 3

work page 2022
[3]

Program Synthesis with Large Language Models

Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. Program synthesis with large language models. arXiv preprint arXiv:2108.07732, 2021. 8

work page internal anchor Pith review Pith/arXiv arXiv 2021
[4]

A general theoretical paradigm to understand learning from human preferences

Mohammad Gheshlaghi Azar, Zhaohan Daniel Guo, Bi- lal Piot, Remi Munos, Mark Rowland, Michal Valko, and Daniele Calandriello. A general theoretical paradigm to understand learning from human preferences. In Interna- tional Conference on Artificial Intelligence and Statistics , pages 4447–4455. PMLR, 2024. 3, 4, 7, 1

work page 2024
[5]

Qwen Technical Report

Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report. arXiv preprint arXiv:2309.16609, 2023. 1, 3

work page internal anchor Pith review Pith/arXiv arXiv 2023
[6]

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966 ,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

Introducing our multimodal models, 2023

Rohan Bavishi, Erich Elsen, Curtis Hawthorne, Maxwell Nye, Augustus Odena, Arushi Somani, and Sa˘gnak Tas ¸ırlar. Introducing our multimodal models, 2023. 3

work page 2023
[8]

Scene text visual question answering

Ali Furkan Biten, Ruben Tito, Andres Mafla, Lluis Gomez, Marc ¸al Rusinol, Ernest Valveny, CV Jawahar, and Dimos- thenis Karatzas. Scene text visual question answering. In ICCV, pages 4291–4301, 2019. 4

work page 2019
[9]

Rank analysis of incomplete block designs: I

Ralph Allan Bradley and Milton E Terry. Rank analysis of incomplete block designs: I. the method of paired compar- isons. Biometrika, 39(3/4):324–345, 1952. 3, 4

work page 1952
[10]

Language models are few-shot learners

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Sub- biah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakan- tan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. NIPS, 2020. 1

work page 2020
[11]

InternLM2 Technical Report

Zheng Cai, Maosong Cao, Haojiong Chen, Kai Chen, Keyu Chen, Xin Chen, Xun Chen, Zehui Chen, Zhi Chen, Pei Chu, et al. Internlm2 technical report. arXiv preprint arXiv:2403.17297, 2024. 1, 3

work page internal anchor Pith review Pith/arXiv arXiv 2024
[12]

An augmented benchmark dataset for geometric question answering through dual parallel text encoding

Jie Cao and Jing Xiao. An augmented benchmark dataset for geometric question answering through dual parallel text encoding. In COLING, pages 1511–1520, 2022. 4

work page 2022
[13]

Mapqa: A dataset for question answering on choropleth maps

Shuaichen Chang, David Palzer, Jialin Li, Eric Fosler- Lussier, and Ningchuan Xiao. Mapqa: A dataset for question answering on choropleth maps. arXiv preprint arXiv:2211.08545, 2022. 4

work page arXiv 2022
[14]

Noise contrastive alignment of language models with explicit re- wards

Huayu Chen, Guande He, Hang Su, and Jun Zhu. Noise contrastive alignment of language models with explicit re- wards. arXiv preprint arXiv:2402.05369, 2024. 4

work page arXiv 2024
[15]

Evaluating Large Language Models Trained on Code

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Ed- wards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021. 8

work page internal anchor Pith review Pith/arXiv arXiv 2021
[16]

M3cot: A novel benchmark for multi- domain multi-step multi-modal chain-of-thought

Qiguang Chen, Libo Qin, Jin Zhang, Zhi Chen, Xiao Xu, and Wanxiang Che. M3cot: A novel benchmark for multi- domain multi-step multi-modal chain-of-thought. arXiv preprint arXiv:2405.16473, 2024. 4, 5, 6, 7, 2

work page arXiv 2024
[17]

The- oremqa: A theorem-driven question answering dataset

Wenhu Chen, Ming Yin, Max Ku, Pan Lu, Yixin Wan, Xueguang Ma, Jianyu Xu, Xinyi Wang, and Tony Xia. The- oremqa: A theorem-driven question answering dataset. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 7889–7901, 2023. 8

work page 2023
[18]

Dress: Instructing large vision-language models to align and interact with humans via natural lan- guage feedback

Yangyi Chen, Karan Sikka, Michael Cogswell, Heng Ji, and Ajay Divakaran. Dress: Instructing large vision-language models to align and interact with humans via natural lan- guage feedback. In Proceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition , pages 14239–14250, 2024. 2

work page 2024
[19]

InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks

Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Zhong Muyan, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision founda- tion models and aligning for generic visual-linguistic tasks. arXiv preprint arXiv:2312.14238, 2023. 3

work page internal anchor Pith review Pith/arXiv arXiv 2023
[20]

How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites

Zhe Chen, Weiyun Wang, Hao Tian, Shenglong Ye, Zhang- wei Gao, Erfei Cui, Wenwen Tong, Kongzhi Hu, Jiapeng Luo, Zheng Ma, et al. How far are we to gpt-4v? clos- ing the gap to commercial multimodal models with open- source suites. arXiv preprint arXiv:2404.16821, 2024. 1, 3, 6

work page internal anchor Pith review Pith/arXiv arXiv 2024
[21]

Provably robust dpo: Aligning language models with noisy feedback

Sayak Ray Chowdhury, Anush Kini, and Nagarajan Natara- jan. Provably robust dpo: Aligning language models with noisy feedback. arXiv preprint arXiv:2403.00409, 2024. 3, 4, 7, 1

work page arXiv 2024
[22]

Simple and effec- tive multi-paragraph reading comprehension

Christopher Clark and Matt Gardner. Simple and effec- tive multi-paragraph reading comprehension. InACL, pages 845–855, 2018. 4

work page 2018
[23]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Train- ing verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021. 8

work page internal anchor Pith review Pith/arXiv arXiv 2021
[24]

Instructblip: Towards general- purpose vision-language models with instruction tuning

Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale N Fung, and Steven Hoi. Instructblip: Towards general- purpose vision-language models with instruction tuning. NIPS, 36, 2024. 1

work page 2024
[25]

En- hancing large vision language models with self-training on image comprehension

Yihe Deng, Pan Lu, Fan Yin, Ziniu Hu, Sheng Shen, Quan- quan Gu, James Zou, Kai-Wei Chang, and Wei Wang. En- hancing large vision language models with self-training on image comprehension. arXiv preprint arXiv:2405.19716 ,

work page arXiv
[26]

Rlhf workflow: From reward mod- eling to online rlhf.arXiv preprint arXiv:2405.07863, 2024

Hanze Dong, Wei Xiong, Bo Pang, Haoxiang Wang, Han Zhao, Yingbo Zhou, Nan Jiang, Doyen Sahoo, Caiming Xiong, and Tong Zhang. Rlhf workflow: From reward mod- eling to online rlhf.arXiv preprint arXiv:2405.07863, 2024. 3

work page arXiv 2024
[27]

The Llama 3 Herd of Models

Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Ab- hishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783,

work page internal anchor Pith review Pith/arXiv arXiv
[28]

G-llava: Solving geometric prob- lem with multi-modal large language model.arXiv preprint arXiv:2312.11370, 2023

Jiahui Gao, Renjie Pi, Jipeng Zhang, Jiacheng Ye, Wan- jun Zhong, Yufei Wang, Lanqing Hong, Jianhua Han, Hang Xu, Zhenguo Li, et al. G-llava: Solving geometric prob- lem with multi-modal large language model.arXiv preprint arXiv:2312.11370, 2023. 3, 4

work page arXiv 2023
[29]

Learn your reference model for real good alignment

Alexey Gorbatovski, Boris Shaposhnikov, Alexey Malakhov, Nikita Surnachev, Yaroslav Aksenov, Ian Maksimov, Nikita Balagansky, and Daniil Gavrilov. Learn your reference model for real good alignment. arXiv preprint arXiv:2404.09656, 2024. 3, 7, 1

work page arXiv 2024
[30]

Making the v in vqa matter: El- evating the role of image understanding in visual question answering

Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the v in vqa matter: El- evating the role of image understanding in visual question answering. In CVPR, pages 6904–6913, 2017. 4

work page 2017
[31]

Measuring Massive Multitask Language Understanding

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Mea- suring massive multitask language understanding. arXiv preprint arXiv:2009.03300, 2020. 8

work page internal anchor Pith review Pith/arXiv arXiv 2009
[32]

Measuring Mathematical Problem Solving With the MATH Dataset

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874, 2021. 8

work page internal anchor Pith review Pith/arXiv arXiv 2021
[33]

ORPO: Monolithic Preference Optimization without Reference Model

Jiwoo Hong, Noah Lee, and James Thorne. Orpo: Mono- lithic preference optimization without reference model. arXiv preprint arXiv:2403.07691, 2(4):5, 2024. 3, 4, 7, 1

work page internal anchor Pith review arXiv 2024
[34]

Icdar2019 com- petition on scanned receipt ocr and information extraction

Zheng Huang, Kai Chen, Jianhua He, Xiang Bai, Dimosthe- nis Karatzas, Shijian Lu, and CV Jawahar. Icdar2019 com- petition on scanned receipt ocr and information extraction. In 2019 International Conference on Document Analysis and Recognition (ICDAR), pages 1516–1520. IEEE, 2019. 4

work page 2019
[35]

Gqa: A new dataset for real-world visual reasoning and compositional question answering

Drew A Hudson and Christopher D Manning. Gqa: A new dataset for real-world visual reasoning and compositional question answering. In CVPR, pages 6700–6709, 2019. 4

work page 2019
[36]

TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension

Mandar Joshi, Eunsol Choi, Daniel S Weld, and Luke Zettlemoyer. Triviaqa: A large scale distantly super- vised challenge dataset for reading comprehension. arXiv preprint arXiv:1705.03551, 2017. 8

work page internal anchor Pith review Pith/arXiv arXiv 2017
[37]

Binary classifier optimization for large language model alignment

Seungjae Jung, Gunsoo Han, Daniel Wontae Nam, and Kyoung-Woon On. Binary classifier optimization for large language model alignment. arXiv preprint arXiv:2404.04656, 2024. 4, 7, 1

work page arXiv 2024
[38]

Dvqa: Understanding data visualizations via ques- tion answering

Kushal Kafle, Brian Price, Scott Cohen, and Christopher Kanan. Dvqa: Understanding data visualizations via ques- tion answering. In CVPR, pages 5648–5656, 2018. 4

work page 2018
[39]

Geomverse: A systematic evaluation of large models for geometric reasoning

Mehran Kazemi, Hamidreza Alvari, Ankit Anand, Jialin Wu, Xi Chen, and Radu Soricut. Geomverse: A systematic evaluation of large models for geometric reasoning. arXiv preprint arXiv:2312.12241, 2023. 4

work page arXiv 2023
[40]

A diagram is worth a dozen images

Aniruddha Kembhavi, Mike Salvato, Eric Kolve, Minjoon Seo, Hannaneh Hajishirzi, and Ali Farhadi. A diagram is worth a dozen images. In ECCV, pages 235–251, 2016. 4

work page 2016
[41]

Natural questions: a benchmark for question answering re- search

Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Ep- stein, Illia Polosukhin, Jacob Devlin, Kenton Lee, et al. Natural questions: a benchmark for question answering re- search. Transactions of the Association for Computational Linguistics, 7:453–466, 2019. 8

work page 2019
[42]

RACE: Large-scale ReAding Comprehension Dataset From Examinations

Guokun Lai, Qizhe Xie, Hanxiao Liu, Yiming Yang, and Eduard Hovy. Race: Large-scale reading com- prehension dataset from examinations. arXiv preprint arXiv:1704.04683, 2017. 8

work page internal anchor Pith review Pith/arXiv arXiv 2017
[43]

Step-dpo: Step-wise preference op- timization for long-chain reasoning of llms

Xin Lai, Zhuotao Tian, Yukang Chen, Senqiao Yang, Xian- gru Peng, and Jiaya Jia. Step-dpo: Step-wise preference op- timization for long-chain reasoning of llms. arXiv preprint arXiv:2406.18629, 2024. 2, 3

work page arXiv 2024
[44]

Obelics: An open web-scale filtered dataset of interleaved image-text documents

Hugo Laurenc ¸on, Lucile Saulnier, L´eo Tronchon, Stas Bek- man, Amanpreet Singh, Anton Lozhkov, Thomas Wang, Siddharth Karamcheti, Alexander Rush, Douwe Kiela, et al. Obelics: An open web-scale filtered dataset of interleaved image-text documents. NIPS, 36, 2024. 1

work page 2024
[45]

LLaVA-OneVision: Easy Visual Task Transfer

Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Yanwei Li, Ziwei Liu, and Chunyuan Li. Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326, 2024. 1, 6

work page internal anchor Pith review Pith/arXiv arXiv 2024
[46]

Blip: Bootstrapping language-image pre-training for uni- fied vision-language understanding and generation

Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for uni- fied vision-language understanding and generation. In ICML, pages 12888–12900, 2022. 3

work page 2022
[47]

Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In ICML, pages 19730–19742. PMLR, 2023. 1, 3

work page 2023
[48]

Silkie: Preference distillation for large visual lan- guage models

Lei Li, Zhihui Xie, Mukai Li, Shunian Chen, Peiyi Wang, Liang Chen, Yazheng Yang, Benyou Wang, and Lingpeng Kong. Silkie: Preference distillation for large visual lan- guage models. arXiv preprint arXiv:2312.10665, 2023. 2, 3

work page arXiv 2023
[49]

Omnicorpus: An unified multimodal corpus of 10 billion-level images interleaved with text

Qingyun Li, Zhe Chen, Weiyun Wang, Wenhai Wang, Shenglong Ye, Zhenjiang Jin, Guanzhou Chen, Yinan He, Zhangwei Gao, Erfei Cui, et al. Omnicorpus: An unified multimodal corpus of 10 billion-level images interleaved with text. arXiv preprint arXiv:2406.08418, 2024. 1, 3

work page arXiv 2024
[50]

Evaluating object hallucination in large vision-language models

Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao, and Ji-Rong Wen. Evaluating object hallucination in large vision-language models. In EMNLP, pages 292–305,

work page
[51]

Moma: Efficient early-fusion pre- training with mixture of modality-aware experts

Xi Victoria Lin, Akshat Shrivastava, Liang Luo, Srini- vasan Iyer, Mike Lewis, Gargi Gosh, Luke Zettlemoyer, and Armen Aghajanyan. Moma: Efficient early-fusion pre- training with mixture of modality-aware experts. arXiv preprint arXiv:2407.21770, 2024. 3

work page arXiv 2024
[52]

Clevr-math: A dataset for compositional lan- 10 guage, visual and mathematical reasoning

Adam Dahlgren Lindstr ¨om and Savitha Sam Abra- ham. Clevr-math: A dataset for compositional lan- 10 guage, visual and mathematical reasoning. arXiv preprint arXiv:2208.05358, 2022. 4

work page arXiv 2022
[53]

Improved Baselines with Visual Instruction Tuning

Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. arXiv preprint arXiv:2310.03744, 2023. 1, 6

work page internal anchor Pith review Pith/arXiv arXiv 2023
[54]

Visual instruction tuning

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. NIPS, 36, 2023. 1, 3

work page 2023
[55]

Statistical re- jection sampling improves preference optimization

Tianqi Liu, Yao Zhao, Rishabh Joshi, Misha Khalman, Mo- hammad Saleh, Peter J Liu, and Jialu Liu. Statistical re- jection sampling improves preference optimization. arXiv preprint arXiv:2309.06657, 2023. 3, 4, 7, 1

work page arXiv 2023
[56]

Mminstruct: A high-quality multi-modal instruction tuning dataset with extensive diversity

Yangzhou Liu, Yue Cao, Zhangwei Gao, Weiyun Wang, Zhe Chen, Wenhai Wang, Hao Tian, Lewei Lu, Xizhou Zhu, Tong Lu, et al. Mminstruct: A high-quality multi-modal instruction tuning dataset with extensive diversity. arXiv preprint arXiv:2407.15838, 2024. 1, 3

work page arXiv 2024
[57]

Interngpt: Solving vision-centric tasks by interacting with chatgpt beyond lan- guage

Zhaoyang Liu, Yinan He, Wenhai Wang, Weiyun Wang, Yi Wang, Shoufa Chen, Qinglong Zhang, Zeqiang Lai, Yang Yang, Qingyun Li, Jiashuo Yu, et al. Interngpt: Solving vision-centric tasks by interacting with chatgpt beyond lan- guage. arXiv preprint arXiv:2305.05662, 2023. 3

work page arXiv 2023
[58]

Decoupled Weight Decay Regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017. 1

work page internal anchor Pith review Pith/arXiv arXiv 2017
[59]

Inter-gps: In- terpretable geometry problem solving with formal language and symbolic reasoning

Pan Lu, Ran Gong, Shibiao Jiang, Liang Qiu, Siyuan Huang, Xiaodan Liang, and Song-Chun Zhu. Inter-gps: In- terpretable geometry problem solving with formal language and symbolic reasoning. arXiv preprint arXiv:2105.04165,

work page arXiv
[60]

Iconqa: A new benchmark for abstract diagram under- standing and visual language reasoning

Pan Lu, Liang Qiu, Jiaqi Chen, Tony Xia, Yizhou Zhao, Wei Zhang, Zhou Yu, Xiaodan Liang, and Song-Chun Zhu. Iconqa: A new benchmark for abstract diagram under- standing and visual language reasoning. arXiv preprint arXiv:2110.13214, 2021. 4

work page arXiv 2021
[61]

Learn to explain: Multimodal reasoning via thought chains for science question answering

Pan Lu, Swaroop Mishra, Tanglin Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering. NIPS, 35:2507–2521, 2022. 4

work page 2022
[62]

MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts

Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chun- yuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathematical reasoning of foundation models in visual con- texts. arXiv preprint arXiv:2310.02255, 2023. 1, 2, 5, 6

work page internal anchor Pith review Pith/arXiv arXiv 2023
[63]

Mono-internvl: Pushing the boundaries of monolithic multimodal large language models with endogenous visual pre-training

Gen Luo, Xue Yang, Wenhan Dou, Zhaokai Wang, Jifeng Dai, Yu Qiao, and Xizhou Zhu. Mono-internvl: Pushing the boundaries of monolithic multimodal large language models with endogenous visual pre-training. arXiv preprint arXiv:2410.08202, 2024. 3

work page arXiv 2024
[64]

Ok-vqa: A visual question answering benchmark requiring external knowledge

Kenneth Marino, Mohammad Rastegari, Ali Farhadi, and Roozbeh Mottaghi. Ok-vqa: A visual question answering benchmark requiring external knowledge. In CVPR, pages 3195–3204, 2019. 4

work page 2019
[65]

Chartqa: A benchmark for question answering about charts with visual and logical reasoning

Ahmed Masry, Xuan Long Do, Jia Qing Tan, Shafiq Joty, and Enamul Hoque. Chartqa: A benchmark for question answering about charts with visual and logical reasoning. In ACL, pages 2263–2279, 2022. 4

work page 2022
[66]

Docvqa: A dataset for vqa on document images

Minesh Mathew, Dimosthenis Karatzas, and CV Jawahar. Docvqa: A dataset for vqa on document images. In WACV, pages 2200–2209, 2021. 4

work page 2021
[67]

Infographicvqa

Minesh Mathew, Viraj Bagal, Rub `en Tito, Dimosthenis Karatzas, Ernest Valveny, and CV Jawahar. Infographicvqa. In WACV, pages 1697–1706, 2022. 4

work page 2022
[68]

Distributional preference alignment of llms via optimal transport

Igor Melnyk, Youssef Mroueh, Brian Belgodere, Mat- tia Rigotti, Apoorva Nitsure, Mikhail Yurochkin, Kristjan Greenewald, Jiri Navratil, and Jerret Ross. Distributional preference alignment of llms via optimal transport. arXiv preprint arXiv:2406.05882, 2024. 4, 7, 1

work page arXiv 2024
[69]

Ocr-vqa: Visual question answering by reading text in images

Anand Mishra, Shashank Shekhar, Ajeet Kumar Singh, and Anirban Chakraborty. Ocr-vqa: Visual question answering by reading text in images. InICDAR, pages 947–952, 2019. 4

work page 2019
[70]

A note on dpo with noisy preferences & re- lationship to ipo, 2023

Eric Mitchell. A note on dpo with noisy preferences & re- lationship to ipo, 2023. 4, 7, 1

work page 2023
[71]

Seva: Leveraging sketches to evaluate alignment between human and machine visual abstraction

Kushin Mukherjee, Holly Huey, Xuanchen Lu, Yael Vinker, Rio Aguina-Kang, Ariel Shamir, and Judith Fan. Seva: Leveraging sketches to evaluate alignment between human and machine visual abstraction. Advances in Neural Infor- mation Processing Systems, 36:67138–67155, 2023. 3, 6, 7

work page 2023
[72]

Gpt-4v(ision) system card

OpenAI. Gpt-4v(ision) system card. https://cdn. openai.com/papers/GPTV_System_Card.pdf ,

work page
[73]

Gpt-4o system card

OpenAI. Gpt-4o system card. https://openai.com/ index/gpt-4o-system-card/ , 2024. 6

work page 2024
[74]

Training lan- guage models to follow instructions with human feedback

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Car- roll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training lan- guage models to follow instructions with human feedback. Advances in neural information processing systems , 35: 27730–27744, 2022. 3

work page 2022
[75]

Smaug: Fixing failure modes of preference optimisation with dpo-positive

Arka Pal, Deep Karkhanis, Samuel Dooley, Manley Roberts, Siddartha Naidu, and Colin White. Smaug: Fixing failure modes of preference optimisation with dpo-positive. arXiv preprint arXiv:2402.13228, 2024. 4

work page arXiv 2024
[76]

Iter- ative reasoning preference optimization

Richard Yuanzhe Pang, Weizhe Yuan, Kyunghyun Cho, He He, Sainbayar Sukhbaatar, and Jason Weston. Iter- ative reasoning preference optimization. arXiv preprint arXiv:2404.19733, 2024. 2, 3

work page arXiv 2024
[77]

Strengthening multi- modal large language model with bootstrapped preference optimization

Renjie Pi, Tianyang Han, Wei Xiong, Jipeng Zhang, Run- tao Liu, Rui Pan, and Tong Zhang. Strengthening multi- modal large language model with bootstrapped preference optimization. arXiv preprint arXiv:2403.08730, 2024. 2

work page arXiv 2024
[78]

We-math: Does your large multimodal model achieve human-like mathematical reasoning? arXiv preprint arXiv:2407.01284, 2024

Runqi Qiao, Qiuna Tan, Guanting Dong, Minhui Wu, Chong Sun, Xiaoshuai Song, Zhuoma GongQue, Shanglin Lei, Zhe Wei, Miaoxuan Zhang, et al. We-math: Does your large multimodal model achieve human-like mathematical reasoning? arXiv preprint arXiv:2407.01284, 2024. 5, 6

work page arXiv 2024
[79]

Direct preference optimization: Your language model is secretly a reward model

Rafael Rafailov, Archit Sharma, Eric Mitchell, Christo- pher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 36, 2024. 2, 3, 4, 7, 1 11

work page 2024
[80]

Learning multiple visual domains with residual adapters

Sylvestre-Alvise Rebuffi, Hakan Bilen, and Andrea Vedaldi. Learning multiple visual domains with residual adapters. NIPS, 30, 2017. 3

work page 2017

Showing first 80 references.