Same Content, Different Answers: Cross-Modal Inconsistency in MLLMs

Ana Lucic; Angela van Sprang; Erman Acar; Laurens Samson; Sennay Ghebreab; Yuki M. Asano

arxiv: 2512.08923 · v2 · submitted 2025-12-09 · 💻 cs.AI

Same Content, Different Answers: Cross-Modal Inconsistency in MLLMs

Angela van Sprang , Laurens Samson , Ana Lucic , Erman Acar , Sennay Ghebreab , Yuki M. Asano This is my paper

Pith reviewed 2026-05-16 23:49 UTC · model grok-4.3

classification 💻 cs.AI

keywords cross-modal inconsistencymultimodal LLMsMLLMsbenchmarksRESTvision-language modelsOCRmodality gap

0 comments

The pith

Multimodal LLMs produce different answers for the same content presented as text or as an image.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper introduces REST and REST+ benchmarks to evaluate cross-modal inconsistency in MLLMs. These tests present the same semantic content as images, text, or a mix, and show that current models cannot reason consistently across them. Neither converting between modalities nor accounting for OCR problems resolves the issue. Visual factors like color and resolution, plus the number of vision tokens, influence performance. A consistency score that correlates with the text-image modality gap provides insight into why this happens.

Core claim

MLLMs are trained to represent vision and language in the same embedding space, yet they cannot perform the same tasks in both modalities. The benchmarks contain samples with the same semantic information in three modalities, and evaluations of 15 MLLMs reveal substantial variation in inconsistency. Even with correct OCR, visual characteristics and vision token count affect results, and consistency correlates with the modality gap.

What carries the argument

REST and REST+ benchmarks that render equivalent semantic content in image, text, and mixed modalities to stress-test cross-modal reasoning consistency.

Load-bearing premise

The constructed samples truly contain identical semantic information across modalities without introducing differences in complexity or presentation.

What would settle it

A model that produces identical correct answers for every REST sample across all three modalities would demonstrate that the inconsistency is not inherent to current MLLMs.

Figures

Figures reproduced from arXiv: 2512.08923 by Ana Lucic, Angela van Sprang, Erman Acar, Laurens Samson, Sennay Ghebreab, Yuki M. Asano.

**Figure 1.** Figure 1: Summary of our work. Left: Our REST benchmark measures whether MLLMs can consistently reason over identical information across modalities. We first verify text recognition (OCR) capability, then evaluate the same question in three modalities (text, image, mixed). Cross-modal inconsistency occurs when models produce different answers depending on the input format. Center: RER consistency score measures th… view at source ↗

**Figure 2.** Figure 2: Cross-modal inconsistency leaves model potential untapped. This figure shows the cumulative distribution of correctly solved questions across sets of modalities (OCR-correct subset). From left to right, the bars represent: the percentage of questions that can be solved in all three modalities, followed by including questions that can only be solved in fewer modalities, ending with the Max Modal Coverage (g… view at source ↗

**Figure 3.** Figure 3: Models generally achieve higher text accuracy despite using fewer text tokens. Current MLLMs need more vision tokens than text tokens to achieve the same accuracy, except for Qwen2.5-VL-32B, where fewer vision tokens obtain higher accuracy (OCR-correct subset) [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 5.** Figure 5: Samples from our 3-type Imagenet categories dataset for [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 4.** Figure 4: Colored text makes models perform better. Relative improvements from either red or yellow text compared to black (OCR-correct subset). 5.2. RQ3b: Impact of font, colour on inconsistency Surprisingly, font families show no clear differences in image accuracy, despite our initial expectations that cursive fonts would be harder to read. Most models stay within 2% absolute difference between fonts (only Phi-3… view at source ↗

**Figure 6.** Figure 6: Benchmark performance is correlated to the similarity of modalities. The similarity between image vs. word (a) and writtendown vs. word (b) representations correlates with RER (as determined using our REST benchmark). R 2 denotes the variance explained by the fitted line, and the grey area shows the bootstrapped 95% confidence interval. and report the maximum score across all layers. We repeat this proce… view at source ↗

**Figure 7.** Figure 7: Examples of ARC, GSM8K, and MMLU questions in mixed and image modalities from our REST benchmark. In the mixed [PITH_FULL_IMAGE:figures/full_fig_p012_7.png] view at source ↗

**Figure 8.** Figure 8: Prompt templates used in the REST benchmark for evaluating cross-modal consistency across MMLU, ARC, GSM-Symbolic. Each modality (text, image, mixed) receives task-specific instructions while maintaining consistent Chain-of-Thought reasoning requirements and standardized answer formatting. The OCR verification prompt (d) ensures that text recognition capabilities are assessed independently from reasoning… view at source ↗

**Figure 9.** Figure 9: Visual permutations in REST+ benchmark showing the same MMLU question rendered with varying resolutions (columns: 50, [PITH_FULL_IMAGE:figures/full_fig_p014_9.png] view at source ↗

**Figure 10.** Figure 10: SOEBENCH examples with increasing complexity from 3 to 5 variables. Top row: mixed images containing only clue equations, used for the mixed modality where the final equation is presented as text. Bottom row: images including all equations with the final equation used for the image modality. Each puzzle requires finding integer values (1-9) for letter variables that satisfy all equations simultaneously. F… view at source ↗

**Figure 11.** Figure 11: Prompt templates used for SOEBENCH evaluation. Each modality receives specific instructions for solving systems of equations with letter variables. For the mixed modality, clue equations are provided as images while the final equation appears as text. For OCR, we instruct the models for a specific format, as models generate different types of correct output formats [PITH_FULL_IMAGE:figures/full_fig_p015_… view at source ↗

**Figure 12.** Figure 12: Cumulative distribution of correctly solved questions across modality combinations for models 1-8. Each step represents [PITH_FULL_IMAGE:figures/full_fig_p020_12.png] view at source ↗

**Figure 13.** Figure 13: Cumulative distribution of correctly solved questions across modality combinations for models 9-15. Models with higher cross [PITH_FULL_IMAGE:figures/full_fig_p021_13.png] view at source ↗

**Figure 14.** Figure 14: Models that perform well on MMMU also score well on REST and REST+. The zoom inset shows that models with high [PITH_FULL_IMAGE:figures/full_fig_p024_14.png] view at source ↗

read the original abstract

We introduce two new benchmarks REST and REST+ (Render-Equivalence Stress Tests) to enable systematic evaluation of cross-modal inconsistency in multimodal large language models (MLLMs). MLLMs are trained to represent vision and language in the same embedding space, yet they cannot perform the same tasks in both modalities. Our benchmarks contain samples with the same semantic information in three modalities (image, text, mixed) and we show that state-of-the-art MLLMs cannot consistently reason over these different modalities. We evaluate 15 MLLMs and find that the degree of modality inconsistency varies substantially, even when accounting for problems with text recognition (OCR). Neither rendering text as image nor rendering an image as text solves the inconsistency. Even if OCR is correct, we find that visual characteristics (text colour and resolution, but not font) and the number of vision tokens have an impact on model performance. Finally, we find that our consistency score correlates with the modality gap between text and images, highlighting a mechanistic interpretation of cross-modal inconsistent MLLMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper introduces REST and REST+ benchmarks to measure cross-modal inconsistency in MLLMs but rendering artifacts may explain part of the observed gaps.

read the letter

The paper shows that state-of-the-art MLLMs give different answers to the same semantic content depending on whether it arrives as text, image, or mixed input. They back this with two new benchmarks, REST and REST+, that try to hold content fixed while varying the modality. Evaluating 15 models, they find the inconsistency holds even after checking OCR accuracy, and they link it to factors like text color, resolution, and vision token count. The correlation between their consistency score and the broader modality gap is a useful observation for thinking about training fixes. This is the concrete new piece: a repeatable stress test focused on equivalence rather than general capability. The evaluation scale and the attention to visual controls are the parts that feel solid and worth building on. The softer spot is sample construction. The benchmarks aim for identical semantics across forms, yet the results themselves show that visual characteristics still shift performance even with correct OCR. That suggests the rendering steps may introduce non-semantic differences in complexity or encoding that are not fully isolated. Without tighter ablations demonstrating that those factors are orthogonal to the measured inconsistency, some of the gaps could trace to input artifacts rather than a pure failure of cross-modal reasoning. Researchers working on MLLM alignment and reliability diagnostics would find the benchmarks practical. It deserves peer review because the empirical setup is substantial and the core observation is timely, though the methods will need close checking on how well the modalities are truly matched.

Referee Report

2 major / 3 minor

Summary. The paper introduces REST and REST+ (Render-Equivalence Stress Tests) benchmarks containing identical semantic content presented in image, text, and mixed modalities. It evaluates 15 MLLMs and reports substantial cross-modal inconsistency in reasoning performance, even after controlling for OCR accuracy. The study finds that visual factors (color, resolution, vision token count) influence results, that neither text-to-image nor image-to-text conversion resolves the gaps, and that a proposed consistency score correlates with the text-image modality gap.

Significance. If the sample equivalence holds, the work is significant for highlighting a practical limitation in state-of-the-art MLLMs' ability to maintain consistent reasoning across equivalent inputs in different modalities. The benchmarks provide a concrete evaluation framework that could inform training objectives and architectural choices aimed at closing modality gaps. The correlation between consistency and modality gap offers a potential mechanistic lens, though the empirical nature of the study means impact depends on the robustness of the controls.

major comments (2)

[Benchmark construction] Benchmark construction section: the central claim that REST/REST+ instances encode precisely the same semantic content and reasoning demands across modalities is load-bearing but insufficiently validated. The abstract itself states that visual characteristics (color, resolution) and vision token count affect performance even when OCR is correct, which directly suggests rendering-induced non-semantic differences that could explain observed gaps rather than a pure cross-modal reasoning failure.
[Evaluation and results] Evaluation and results sections: details on sample construction (how semantic equivalence was ensured and verified), statistical significance of performance differences, and explicit ablations isolating visual factors from modality are missing. Without these, the claim that inconsistency is inherent to MLLMs rather than an artifact of the rendering pipeline cannot be fully assessed.

minor comments (3)

[Methods] Provide the exact formula and computation procedure for the consistency score, including how it is aggregated across samples and modalities.
[Experiments] Clarify the selection criteria and size of the 15 evaluated models, and report per-model breakdowns rather than aggregate trends only.
[Results] Add error bars or confidence intervals to performance tables and figures to support claims of substantial variation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We agree that additional validation and methodological details will strengthen the paper and will incorporate revisions to address the concerns. We respond to each major comment below.

read point-by-point responses

Referee: [Benchmark construction] Benchmark construction section: the central claim that REST/REST+ instances encode precisely the same semantic content and reasoning demands across modalities is load-bearing but insufficiently validated. The abstract itself states that visual characteristics (color, resolution) and vision token count affect performance even when OCR is correct, which directly suggests rendering-induced non-semantic differences that could explain observed gaps rather than a pure cross-modal reasoning failure.

Authors: We agree that the semantic equivalence claim requires stronger validation. REST and REST+ are constructed by starting with text-based reasoning problems and rendering the identical text content into images under controlled conditions, so that the semantic information and required reasoning steps are identical by design. We will revise the benchmark construction section to add: (1) a step-by-step description of the generation pipeline, (2) concrete examples of matched image-text pairs, and (3) results from a human verification study on a random subset of 100 samples confirming that annotators judge the semantic content and reasoning demands to be equivalent. On the visual factors point, the manuscript already reports that inconsistencies persist even when OCR is perfect; we will add explicit discussion clarifying that while rendering parameters influence absolute performance, the cross-modal gaps remain after these controls, supporting a modality-gap interpretation beyond pure rendering artifacts. revision: yes
Referee: [Evaluation and results] Evaluation and results sections: details on sample construction (how semantic equivalence was ensured and verified), statistical significance of performance differences, and explicit ablations isolating visual factors from modality are missing. Without these, the claim that inconsistency is inherent to MLLMs rather than an artifact of the rendering pipeline cannot be fully assessed.

Authors: We accept that these details are currently insufficient and will expand the evaluation and results sections. Revisions will include: (1) expanded description of how semantic equivalence is ensured and verified (cross-referencing the new benchmark construction details), (2) statistical significance testing (e.g., McNemar’s test for paired modality comparisons) on the reported performance differences, and (3) new ablations that hold modality fixed while systematically varying visual factors such as resolution and color to isolate their contribution from cross-modal effects. These changes will allow readers to better assess whether the inconsistencies are inherent to MLLM modality handling rather than rendering artifacts. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical benchmark evaluation

full rationale

The paper introduces REST and REST+ benchmarks and reports empirical performance gaps across modalities in 15 MLLMs. No derivations, equations, fitted parameters presented as predictions, or self-citation chains appear in the abstract or described content. The consistency score is computed directly from observed accuracy differences, and claims about modality inconsistency rest on experimental results rather than reducing to inputs by construction. The noted concern about rendering artifacts affects benchmark validity but does not constitute circularity in any derivation step.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper rests on the standard domain assumption that MLLMs embed vision and language in a shared space yet fail to reason consistently; no free parameters, invented entities, or ad-hoc axioms are introduced beyond benchmark construction.

axioms (1)

domain assumption MLLMs are trained to represent vision and language in the same embedding space, yet they cannot perform the same tasks in both modalities
Stated directly in the abstract as the motivating premise for the benchmarks.

pith-pipeline@v0.9.0 · 5500 in / 1138 out tokens · 77103 ms · 2026-05-16T23:49:13.889119+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We introduce REST and REST+ ... measure whether models produce consistent outputs regardless of modality ... consistency score correlates with the modality gap between text and images
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

the cross-modal cosine similarity in embedding space

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

57 extracted references · 57 canonical work pages · 15 internal anchors

[1]

Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone

Marah Abdin, Jyoti Aneja, Hany Awadalla, Ahmed Awadal- lah, Ammar Ahmad Awan, Nguyen Bach, Amit Bahree, Arash Bakhtiari, Jianmin Bao, Harkirat Behl, et al. Phi-3 technical report: A highly capable language model locally on your phone, 2024.URL https://arxiv. org/abs/2404.14219, 2: 6, 2024. 4

work page internal anchor Pith review Pith/arXiv arXiv 2024
[2]

Phi-4 Technical Report

Marah Abdin, Jyoti Aneja, Harkirat Behl, S ´ebastien Bubeck, Ronen Eldan, Suriya Gunasekar, Michael Harrison, Russell J Hewett, Mojan Javaheripi, Piero Kauffmann, et al. Phi-4 technical report.arXiv preprint arXiv:2412.08905, 2024. 4

work page internal anchor Pith review Pith/arXiv arXiv 2024
[3]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ah- mad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Flamingo: a visual language model for few-shot learning.Advances in neural information processing systems, 35:23716–23736,

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Men- sch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning.Advances in neural information processing systems, 35:23716–23736,

work page
[5]

Vision-language models struggle to align entities across modalities.arXiv preprint arXiv:2503.03854, 2025

I ˜nigo Alonso, Gorka Azkune, Ander Salaberria, Jeremy Barnes, and Oier Lopez de Lacalle. Vision-language models struggle to align entities across modalities.arXiv preprint arXiv:2503.03854, 2025. 1, 8

work page arXiv 2025
[6]

Claude haiku 4.5.https : / / www

Anthropic. Claude haiku 4.5.https : / / www . anthropic.com, 2025. Large language model by An- thropic. 3, 4, 8

work page 2025
[7]

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 3, 4

work page internal anchor Pith review Pith/arXiv arXiv 2025
[8]

Omnixr: Evaluating omni-modality language models on reasoning across modal- ities.arXiv preprint arXiv:2410.12219, 2024

Lichang Chen, Hexiang Hu, Mingda Zhang, Yiwen Chen, Zifeng Wang, Yandong Li, Pranav Shyam, Tianyi Zhou, Heng Huang, Ming-Hsuan Yang, et al. Omnixr: Evaluating omni-modality language models on reasoning across modal- ities.arXiv preprint arXiv:2410.12219, 2024. 1, 4, 8

work page arXiv 2024
[10]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge.arXiv preprint arXiv:1803.05457,

work page internal anchor Pith review Pith/arXiv arXiv
[11]

Instructblip: Towards general- purpose vision-language models with instruction tuning

Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven Hoi. Instructblip: Towards general- purpose vision-language models with instruction tuning. In Advances in Neural Information Processing Systems, 2023. 1

work page 2023
[12]

Imagenet: A large-scale hierarchical image database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In2009 IEEE conference on computer vision and pattern recognition, pages 248–255. IEEE, 2009. 7

work page 2009
[13]

Mitigate the gap: In- vestigating approaches for improving cross-modal alignment in clip, 2024

Sedigheh Eslami and Gerard de Melo. Mitigate the gap: In- vestigating approaches for improving cross-modal alignment in clip, 2024. 1, 8

work page 2024
[14]

Making the v in vqa matter: Elevating the role of image understanding in visual question answer- ing

Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Ba- tra, and Devi Parikh. Making the v in vqa matter: Elevating the role of image understanding in visual question answer- ing. InCVPR, 2017. 8

work page 2017
[16]

Measuring Massive Multitask Language Understanding

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Mea- suring massive multitask language understanding.arXiv preprint arXiv:2009.03300, 2020. 1, 4, 7

work page internal anchor Pith review Pith/arXiv arXiv 2009
[17]

Gqa: A new dataset for real-world visual reasoning and compositional question answering

Drew A Hudson and Christopher D Manning. Gqa: A new dataset for real-world visual reasoning and compositional question answering. InCVPR, 2019. 8

work page 2019
[18]

GPT-4o System Card

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perel- man, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Weli- hinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card. arXiv preprint arXiv:2410.21276, 2024. 4

work page internal anchor Pith review Pith/arXiv arXiv 2024
[19]

Gonzalez, Hao Zhang, and Ion Stoica

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. InPro- ceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023. 2

work page 2023
[20]

Mind the gap: Understanding the modality gap in multi-modal contrastive representation learning.Advances in Neural Information Processing Sys- tems, 35:17612–17625, 2022

Victor Weixin Liang, Yuhui Zhang, Yongchan Kwon, Ser- ena Yeung, and James Y Zou. Mind the gap: Understanding the modality gap in multi-modal contrastive representation learning.Advances in Neural Information Processing Sys- tems, 35:17612–17625, 2022. 1, 8

work page 2022
[21]

Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023. 1

work page 2023
[22]

Improved baselines with visual instruction tuning

Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024. 1

work page 2024
[23]

MMBench: Is Your Multi-modal Model an All-around Player?

Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player?arXiv preprint arXiv:2307.06281, 2023. 1, 8

work page internal anchor Pith review Pith/arXiv arXiv 2023
[24]

MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts

Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathemat- ical reasoning of foundation models in visual contexts.ICLR, 2024.https://arxiv.org/abs/2310.02255. 8

work page internal anchor Pith review Pith/arXiv arXiv 2024
[25]

Chartqa: A benchmark for question an- swering about charts with visual and logical reasoning

Ahmed Masry, Do Xuan Long, Jia Qing Tan, Shafiq Joty, and Enamul Hoque. Chartqa: A benchmark for question an- swering about charts with visual and logical reasoning. In Findings of the Association for Computational Linguistics: ACL 2022, pages 2263–2279, 2022. 1

work page 2022
[26]

Docvqa: A dataset for vqa on document images

Minesh Mathew, Dimosthenis Karatzas, and CV Jawahar. Docvqa: A dataset for vqa on document images. InProceed- ings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 2200–2209, 2021. 1

work page 2021
[27]

GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models

Iman Mirzadeh, Keivan Alizadeh, Hooman Shahrokhi, On- cel Tuzel, Samy Bengio, and Mehrdad Farajtabar. Gsm- symbolic: Understanding the limitations of mathemati- cal reasoning in large language models.arXiv preprint arXiv:2410.05229, 2024. 1, 2, 4, 7

work page internal anchor Pith review Pith/arXiv arXiv 2024
[28]

Mistral-small-3.1-24b-instruct.https:// huggingface.co/mistralai/Mistral- Small- 3.1-24B-Instruct-2503, 2025

Mistral AI. Mistral-small-3.1-24b-instruct.https:// huggingface.co/mistralai/Mistral- Small- 3.1-24B-Instruct-2503, 2025. Version 2503. 4

work page 2025
[29]

Gpt-5 mini.https://openai.com, 2025

OpenAI. Gpt-5 mini.https://openai.com, 2025. Compact large language model by OpenAI. 3, 4, 8

work page 2025
[30]

Kakade, and Stephanie Gil

Isabel Papadimitriou, Huangyuan Su, Thomas Fel, Sham M. Kakade, and Stephanie Gil. Interpreting the linear structure of vision-language model embedding spaces. InSecond Con- ference on Language Modeling, 2025. 1, 8

work page 2025
[31]

Learning transferable visual models from natural language supervi- sion

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 1

work page 2021
[32]

Privacy-aware visual language models.arXiv e-prints, pages arXiv–2405, 2024

Laurens Samson, Nimrod Barazani, Sennay Ghebreab, and Yuki M Asano. Privacy-aware visual language models.arXiv e-prints, pages arXiv–2405, 2024. 1, 8

work page 2024
[33]

Large vision-language model alignment and misalignment: A survey through the lens of explainability,

Dong Shu, Haiyan Zhao, Jingyu Hu, Weiru Liu, Ali Payani, Lu Cheng, and Mengnan Du. Large vision-language model alignment and misalignment: A survey through the lens of explainability.arXiv preprint arXiv:2501.01346, 2025. 1, 8

work page arXiv 2025
[34]

Implicit multimodal alignment: On the generalization of frozen llms to multi- modal inputs.Advances in Neural Information Processing Systems, 37:130848–130886, 2024

Mustafa Shukor and Matthieu Cord. Implicit multimodal alignment: On the generalization of frozen llms to multi- modal inputs.Advances in Neural Information Processing Systems, 37:130848–130886, 2024. 1, 7, 8

work page 2024
[35]

Can vlms actually see and read? a survey on modality collapse in vision-language models

Mong Yuan Sim, Wei Emma Zhang, Xiang Dai, and Biaoyan Fang. Can vlms actually see and read? a survey on modality collapse in vision-language models. InFindings of the As- sociation for Computational Linguistics: ACL 2025, pages 24452–24470, 2025. 1, 8

work page 2025
[36]

Towards vqa models that can read

Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. Towards vqa models that can read. InCVPR,

work page
[37]

Gemini: A Family of Highly Capable Multimodal Models

Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean- Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023. 3, 4

work page internal anchor Pith review Pith/arXiv arXiv 2023
[38]

Gemma 3 Technical Report

Gemma Team, Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Ta- tiana Matejovicova, Alexandre Ram ´e, Morgane Rivi `ere, et al. Gemma 3 technical report.arXiv preprint arXiv:2503.19786, 2025. 4

work page internal anchor Pith review Pith/arXiv arXiv 2025
[39]

Mmlu-pro: A more robust and challenging multi-task language understanding benchmark.Advances in Neural Information Processing Sys- tems, 37:95266–95290, 2024

Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, et al. Mmlu-pro: A more robust and challenging multi-task language understanding benchmark.Advances in Neural Information Processing Sys- tems, 37:95266–95290, 2024. 8

work page 2024
[40]

DeepSeek-OCR: Contexts Optical Compression

Haoran Wei, Yaofeng Sun, and Yukun Li. Deepseek- ocr: Contexts optical compression.arXiv preprint arXiv:2510.18234, 2025. 1, 6

work page internal anchor Pith review Pith/arXiv arXiv 2025
[41]

DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding

Zhiyu Wu, Xiaokang Chen, Zizheng Pan, Xingchao Liu, Wen Liu, Damai Dai, Huazuo Gao, Yiyang Ma, Chengyue Wu, Bingxuan Wang, et al. Deepseek-vl2: Mixture-of- experts vision-language models for advanced multimodal understanding.arXiv preprint arXiv:2412.10302, 2024. 4

work page internal anchor Pith review Pith/arXiv arXiv 2024
[42]

Multi- modal inconsistency reasoning (mmir): A new benchmark for multimodal reasoning models.ACL, 2025.https: //arxiv.org/abs/2502.16033

Qianqi Yan, Yue Fan, Hongquan Li, Shan Jiang, Yang Zhao, Xinze Guan, Ching-Chen Kuo, and Xin Eric Wang. Multi- modal inconsistency reasoning (mmir): A new benchmark for multimodal reasoning models.ACL, 2025.https: //arxiv.org/abs/2502.16033. 8

work page arXiv 2025
[43]

A survey on multimodal large language models.National Science Review, 11(12): nwae403, 2024

Shukang Yin, Chaoyou Fu, Sirui Zhao, Ke Li, Xing Sun, Tong Xu, and Enhong Chen. A survey on multimodal large language models.National Science Review, 11(12): nwae403, 2024. 1

work page 2024
[44]

Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for ex- pert agi.CVPR, 2024

Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, et al. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for ex- pert agi.CVPR, 2024. 1, 8, 13

work page 2024
[45]

Mmmu-pro: A more robust multi-discipline multimodal understanding benchmark

Xiang Yue, Tianyu Zheng, Yuansheng Ni, Yubo Wang, Kai Zhang, Shengbang Tong, Yuxuan Sun, Botao Yu, Ge Zhang, Huan Sun, Yu Su, Wenhu Chen, and Graham Neu- big. Mmmu-pro: A more robust multi-discipline multimodal understanding benchmark. InProceedings of the 63rd An- nual Meeting of the Association for Computational Linguis- tics, pages 15134–15186, 2025. 1, 8

work page 2025
[46]

Xiang Zhang, Senyu Li, Ning Shi, Bradley Hauer, Zijun Wu, Grzegorz Kondrak, Muhammad Abdul-Mageed, and Laks V . S. Lakshmanan. Cross-modal consistency in multimodal large language models, 2024. 1, 3, 8

work page 2024
[47]

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shen- glong Ye, Lixin Gu, Hao Tian, Yuchen Duan, Weijie Su, Jie Shao, et al. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models.arXiv preprint arXiv:2504.10479, 2025. 4 Same Content, Different Answers: Cross-Modal Inconsistency in MLLMs Supplementary Material

work page internal anchor Pith review Pith/arXiv arXiv 2025
[48]

2A + B = 15

Benchmark Implementation Details We provide complete specifications for reproducing the RESTbenchmark experiments, including prompt templates and dataset examples. All code, data, and model outputs will be released publicly upon acceptance to facilitate fu- ture research on cross-modal consistency. 9.1. REST ImagesIn Figure 7, we show examples of both mix...

work page
[49]

Open-source models follow vLLM’s recommended con- figurations for optimal performance

Despite these computational optimisations, all models receive identical Chain-of-Thought prompting instructions to ensure fair comparison. Open-source models follow vLLM’s recommended con- figurations for optimal performance. Experiments run on single-GPU systems: NVIDIA RTX 6000 Ada (48GB VRAM) for most models, and NVIDIA H100 (80GB VRAM) for larger mode...

work page
[50]

Do not solve the question; just transcribe the text exactly as it appears

work page
[51]

Please transcribe now

Do not add extra commentary, only transcribe. Please transcribe now. (c) OCR verification Figure 8. Prompt templates used in theRESTbenchmark for evaluating cross-modal consistency across MMLU, ARC, GSM-Symbolic. Each modality (text, image, mixed) receives task-specific instructions while maintaining consistent Chain-of-Thought reasoning require- ments an...

work page
[52]

Do not solve or simplify these equations; just transcribe them exactly as they appear

work page
[53]

Retain the same order and use the following numbering, (1), (2), (3) per equation

work page
[54]

List 1 equation per item, for example (1) 3a + 2b + c = 11

work page
[55]

Put each equation on its own line

work page
[56]

Use plain text as output, the operations that you can use are ’*’, ’+’, ’-’ and ’=’

work page
[57]

Format your output like so: (1) 2a + 3b = 10 (2) a + 3c = 30 (3) 2b + 5c = ? Please transcribe now

Do not add extra commentary-only transcribe equations. Format your output like so: (1) 2a + 3b = 10 (2) a + 3c = 30 (3) 2b + 5c = ? Please transcribe now. (d) OCR verification Figure 11. Prompt templates used for SOEBENCHevaluation. Each modality receives specific instructions for solving systems of equations with letter variables. For the mixed modality,...

work page
[58]

OCR-first

Extended Results This section presents comprehensive results for theREST benchmarks, including performance metrics across all eval- uation conditions and detailed breakdowns by modality. 10.1. REST Cross-Modal Consistency AnalysisTables 7 and 8 present RER and CFR scores for the OCR-correct sub- set and the complete set, respectively. Results are given pe...

work page
[59]

Image accuracy, RER consistency and OCR correct scores stratified by resolution (50, 100, 200 DPI) for all questions

As mentioned in the paper, we see that fewer text tokens are needed to achieve the same level of accuracy, or more Table 14.Image performance on REST+ for different DPI lev- els(complete set of questions). Image accuracy, RER consistency and OCR correct scores stratified by resolution (50, 100, 200 DPI) for all questions. Model Img Acc. RER OCR DPI 50 100...

work page 2025

[1] [1]

Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone

Marah Abdin, Jyoti Aneja, Hany Awadalla, Ahmed Awadal- lah, Ammar Ahmad Awan, Nguyen Bach, Amit Bahree, Arash Bakhtiari, Jianmin Bao, Harkirat Behl, et al. Phi-3 technical report: A highly capable language model locally on your phone, 2024.URL https://arxiv. org/abs/2404.14219, 2: 6, 2024. 4

work page internal anchor Pith review Pith/arXiv arXiv 2024

[2] [2]

Phi-4 Technical Report

Marah Abdin, Jyoti Aneja, Harkirat Behl, S ´ebastien Bubeck, Ronen Eldan, Suriya Gunasekar, Michael Harrison, Russell J Hewett, Mojan Javaheripi, Piero Kauffmann, et al. Phi-4 technical report.arXiv preprint arXiv:2412.08905, 2024. 4

work page internal anchor Pith review Pith/arXiv arXiv 2024

[3] [3]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ah- mad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774,

work page internal anchor Pith review Pith/arXiv arXiv

[4] [4]

Flamingo: a visual language model for few-shot learning.Advances in neural information processing systems, 35:23716–23736,

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Men- sch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning.Advances in neural information processing systems, 35:23716–23736,

work page

[5] [5]

Vision-language models struggle to align entities across modalities.arXiv preprint arXiv:2503.03854, 2025

I ˜nigo Alonso, Gorka Azkune, Ander Salaberria, Jeremy Barnes, and Oier Lopez de Lacalle. Vision-language models struggle to align entities across modalities.arXiv preprint arXiv:2503.03854, 2025. 1, 8

work page arXiv 2025

[6] [6]

Claude haiku 4.5.https : / / www

Anthropic. Claude haiku 4.5.https : / / www . anthropic.com, 2025. Large language model by An- thropic. 3, 4, 8

work page 2025

[7] [7]

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 3, 4

work page internal anchor Pith review Pith/arXiv arXiv 2025

[8] [8]

Omnixr: Evaluating omni-modality language models on reasoning across modal- ities.arXiv preprint arXiv:2410.12219, 2024

Lichang Chen, Hexiang Hu, Mingda Zhang, Yiwen Chen, Zifeng Wang, Yandong Li, Pranav Shyam, Tianyi Zhou, Heng Huang, Ming-Hsuan Yang, et al. Omnixr: Evaluating omni-modality language models on reasoning across modal- ities.arXiv preprint arXiv:2410.12219, 2024. 1, 4, 8

work page arXiv 2024

[9] [10]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge.arXiv preprint arXiv:1803.05457,

work page internal anchor Pith review Pith/arXiv arXiv

[10] [11]

Instructblip: Towards general- purpose vision-language models with instruction tuning

Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven Hoi. Instructblip: Towards general- purpose vision-language models with instruction tuning. In Advances in Neural Information Processing Systems, 2023. 1

work page 2023

[11] [12]

Imagenet: A large-scale hierarchical image database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In2009 IEEE conference on computer vision and pattern recognition, pages 248–255. IEEE, 2009. 7

work page 2009

[12] [13]

Mitigate the gap: In- vestigating approaches for improving cross-modal alignment in clip, 2024

Sedigheh Eslami and Gerard de Melo. Mitigate the gap: In- vestigating approaches for improving cross-modal alignment in clip, 2024. 1, 8

work page 2024

[13] [14]

Making the v in vqa matter: Elevating the role of image understanding in visual question answer- ing

Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Ba- tra, and Devi Parikh. Making the v in vqa matter: Elevating the role of image understanding in visual question answer- ing. InCVPR, 2017. 8

work page 2017

[14] [16]

Measuring Massive Multitask Language Understanding

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Mea- suring massive multitask language understanding.arXiv preprint arXiv:2009.03300, 2020. 1, 4, 7

work page internal anchor Pith review Pith/arXiv arXiv 2009

[15] [17]

Gqa: A new dataset for real-world visual reasoning and compositional question answering

Drew A Hudson and Christopher D Manning. Gqa: A new dataset for real-world visual reasoning and compositional question answering. InCVPR, 2019. 8

work page 2019

[16] [18]

GPT-4o System Card

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perel- man, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Weli- hinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card. arXiv preprint arXiv:2410.21276, 2024. 4

work page internal anchor Pith review Pith/arXiv arXiv 2024

[17] [19]

Gonzalez, Hao Zhang, and Ion Stoica

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. InPro- ceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023. 2

work page 2023

[18] [20]

Mind the gap: Understanding the modality gap in multi-modal contrastive representation learning.Advances in Neural Information Processing Sys- tems, 35:17612–17625, 2022

Victor Weixin Liang, Yuhui Zhang, Yongchan Kwon, Ser- ena Yeung, and James Y Zou. Mind the gap: Understanding the modality gap in multi-modal contrastive representation learning.Advances in Neural Information Processing Sys- tems, 35:17612–17625, 2022. 1, 8

work page 2022

[19] [21]

Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023. 1

work page 2023

[20] [22]

Improved baselines with visual instruction tuning

Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024. 1

work page 2024

[21] [23]

MMBench: Is Your Multi-modal Model an All-around Player?

Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player?arXiv preprint arXiv:2307.06281, 2023. 1, 8

work page internal anchor Pith review Pith/arXiv arXiv 2023

[22] [24]

MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts

Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathemat- ical reasoning of foundation models in visual contexts.ICLR, 2024.https://arxiv.org/abs/2310.02255. 8

work page internal anchor Pith review Pith/arXiv arXiv 2024

[23] [25]

Chartqa: A benchmark for question an- swering about charts with visual and logical reasoning

Ahmed Masry, Do Xuan Long, Jia Qing Tan, Shafiq Joty, and Enamul Hoque. Chartqa: A benchmark for question an- swering about charts with visual and logical reasoning. In Findings of the Association for Computational Linguistics: ACL 2022, pages 2263–2279, 2022. 1

work page 2022

[24] [26]

Docvqa: A dataset for vqa on document images

Minesh Mathew, Dimosthenis Karatzas, and CV Jawahar. Docvqa: A dataset for vqa on document images. InProceed- ings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 2200–2209, 2021. 1

work page 2021

[25] [27]

GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models

Iman Mirzadeh, Keivan Alizadeh, Hooman Shahrokhi, On- cel Tuzel, Samy Bengio, and Mehrdad Farajtabar. Gsm- symbolic: Understanding the limitations of mathemati- cal reasoning in large language models.arXiv preprint arXiv:2410.05229, 2024. 1, 2, 4, 7

work page internal anchor Pith review Pith/arXiv arXiv 2024

[26] [28]

Mistral-small-3.1-24b-instruct.https:// huggingface.co/mistralai/Mistral- Small- 3.1-24B-Instruct-2503, 2025

Mistral AI. Mistral-small-3.1-24b-instruct.https:// huggingface.co/mistralai/Mistral- Small- 3.1-24B-Instruct-2503, 2025. Version 2503. 4

work page 2025

[27] [29]

Gpt-5 mini.https://openai.com, 2025

OpenAI. Gpt-5 mini.https://openai.com, 2025. Compact large language model by OpenAI. 3, 4, 8

work page 2025

[28] [30]

Kakade, and Stephanie Gil

Isabel Papadimitriou, Huangyuan Su, Thomas Fel, Sham M. Kakade, and Stephanie Gil. Interpreting the linear structure of vision-language model embedding spaces. InSecond Con- ference on Language Modeling, 2025. 1, 8

work page 2025

[29] [31]

Learning transferable visual models from natural language supervi- sion

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 1

work page 2021

[30] [32]

Privacy-aware visual language models.arXiv e-prints, pages arXiv–2405, 2024

Laurens Samson, Nimrod Barazani, Sennay Ghebreab, and Yuki M Asano. Privacy-aware visual language models.arXiv e-prints, pages arXiv–2405, 2024. 1, 8

work page 2024

[31] [33]

Large vision-language model alignment and misalignment: A survey through the lens of explainability,

Dong Shu, Haiyan Zhao, Jingyu Hu, Weiru Liu, Ali Payani, Lu Cheng, and Mengnan Du. Large vision-language model alignment and misalignment: A survey through the lens of explainability.arXiv preprint arXiv:2501.01346, 2025. 1, 8

work page arXiv 2025

[32] [34]

Implicit multimodal alignment: On the generalization of frozen llms to multi- modal inputs.Advances in Neural Information Processing Systems, 37:130848–130886, 2024

Mustafa Shukor and Matthieu Cord. Implicit multimodal alignment: On the generalization of frozen llms to multi- modal inputs.Advances in Neural Information Processing Systems, 37:130848–130886, 2024. 1, 7, 8

work page 2024

[33] [35]

Can vlms actually see and read? a survey on modality collapse in vision-language models

Mong Yuan Sim, Wei Emma Zhang, Xiang Dai, and Biaoyan Fang. Can vlms actually see and read? a survey on modality collapse in vision-language models. InFindings of the As- sociation for Computational Linguistics: ACL 2025, pages 24452–24470, 2025. 1, 8

work page 2025

[34] [36]

Towards vqa models that can read

Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. Towards vqa models that can read. InCVPR,

work page

[35] [37]

Gemini: A Family of Highly Capable Multimodal Models

Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean- Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023. 3, 4

work page internal anchor Pith review Pith/arXiv arXiv 2023

[36] [38]

Gemma 3 Technical Report

Gemma Team, Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Ta- tiana Matejovicova, Alexandre Ram ´e, Morgane Rivi `ere, et al. Gemma 3 technical report.arXiv preprint arXiv:2503.19786, 2025. 4

work page internal anchor Pith review Pith/arXiv arXiv 2025

[37] [39]

Mmlu-pro: A more robust and challenging multi-task language understanding benchmark.Advances in Neural Information Processing Sys- tems, 37:95266–95290, 2024

Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, et al. Mmlu-pro: A more robust and challenging multi-task language understanding benchmark.Advances in Neural Information Processing Sys- tems, 37:95266–95290, 2024. 8

work page 2024

[38] [40]

DeepSeek-OCR: Contexts Optical Compression

Haoran Wei, Yaofeng Sun, and Yukun Li. Deepseek- ocr: Contexts optical compression.arXiv preprint arXiv:2510.18234, 2025. 1, 6

work page internal anchor Pith review Pith/arXiv arXiv 2025

[39] [41]

DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding

Zhiyu Wu, Xiaokang Chen, Zizheng Pan, Xingchao Liu, Wen Liu, Damai Dai, Huazuo Gao, Yiyang Ma, Chengyue Wu, Bingxuan Wang, et al. Deepseek-vl2: Mixture-of- experts vision-language models for advanced multimodal understanding.arXiv preprint arXiv:2412.10302, 2024. 4

work page internal anchor Pith review Pith/arXiv arXiv 2024

[40] [42]

Multi- modal inconsistency reasoning (mmir): A new benchmark for multimodal reasoning models.ACL, 2025.https: //arxiv.org/abs/2502.16033

Qianqi Yan, Yue Fan, Hongquan Li, Shan Jiang, Yang Zhao, Xinze Guan, Ching-Chen Kuo, and Xin Eric Wang. Multi- modal inconsistency reasoning (mmir): A new benchmark for multimodal reasoning models.ACL, 2025.https: //arxiv.org/abs/2502.16033. 8

work page arXiv 2025

[41] [43]

A survey on multimodal large language models.National Science Review, 11(12): nwae403, 2024

Shukang Yin, Chaoyou Fu, Sirui Zhao, Ke Li, Xing Sun, Tong Xu, and Enhong Chen. A survey on multimodal large language models.National Science Review, 11(12): nwae403, 2024. 1

work page 2024

[42] [44]

Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for ex- pert agi.CVPR, 2024

Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, et al. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for ex- pert agi.CVPR, 2024. 1, 8, 13

work page 2024

[43] [45]

Mmmu-pro: A more robust multi-discipline multimodal understanding benchmark

Xiang Yue, Tianyu Zheng, Yuansheng Ni, Yubo Wang, Kai Zhang, Shengbang Tong, Yuxuan Sun, Botao Yu, Ge Zhang, Huan Sun, Yu Su, Wenhu Chen, and Graham Neu- big. Mmmu-pro: A more robust multi-discipline multimodal understanding benchmark. InProceedings of the 63rd An- nual Meeting of the Association for Computational Linguis- tics, pages 15134–15186, 2025. 1, 8

work page 2025

[44] [46]

Xiang Zhang, Senyu Li, Ning Shi, Bradley Hauer, Zijun Wu, Grzegorz Kondrak, Muhammad Abdul-Mageed, and Laks V . S. Lakshmanan. Cross-modal consistency in multimodal large language models, 2024. 1, 3, 8

work page 2024

[45] [47]

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shen- glong Ye, Lixin Gu, Hao Tian, Yuchen Duan, Weijie Su, Jie Shao, et al. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models.arXiv preprint arXiv:2504.10479, 2025. 4 Same Content, Different Answers: Cross-Modal Inconsistency in MLLMs Supplementary Material

work page internal anchor Pith review Pith/arXiv arXiv 2025

[46] [48]

2A + B = 15

Benchmark Implementation Details We provide complete specifications for reproducing the RESTbenchmark experiments, including prompt templates and dataset examples. All code, data, and model outputs will be released publicly upon acceptance to facilitate fu- ture research on cross-modal consistency. 9.1. REST ImagesIn Figure 7, we show examples of both mix...

work page

[47] [49]

Open-source models follow vLLM’s recommended con- figurations for optimal performance

Despite these computational optimisations, all models receive identical Chain-of-Thought prompting instructions to ensure fair comparison. Open-source models follow vLLM’s recommended con- figurations for optimal performance. Experiments run on single-GPU systems: NVIDIA RTX 6000 Ada (48GB VRAM) for most models, and NVIDIA H100 (80GB VRAM) for larger mode...

work page

[48] [50]

Do not solve the question; just transcribe the text exactly as it appears

work page

[49] [51]

Please transcribe now

Do not add extra commentary, only transcribe. Please transcribe now. (c) OCR verification Figure 8. Prompt templates used in theRESTbenchmark for evaluating cross-modal consistency across MMLU, ARC, GSM-Symbolic. Each modality (text, image, mixed) receives task-specific instructions while maintaining consistent Chain-of-Thought reasoning require- ments an...

work page

[50] [52]

Do not solve or simplify these equations; just transcribe them exactly as they appear

work page

[51] [53]

Retain the same order and use the following numbering, (1), (2), (3) per equation

work page

[52] [54]

List 1 equation per item, for example (1) 3a + 2b + c = 11

work page

[53] [55]

Put each equation on its own line

work page

[54] [56]

Use plain text as output, the operations that you can use are ’*’, ’+’, ’-’ and ’=’

work page

[55] [57]

Format your output like so: (1) 2a + 3b = 10 (2) a + 3c = 30 (3) 2b + 5c = ? Please transcribe now

Do not add extra commentary-only transcribe equations. Format your output like so: (1) 2a + 3b = 10 (2) a + 3c = 30 (3) 2b + 5c = ? Please transcribe now. (d) OCR verification Figure 11. Prompt templates used for SOEBENCHevaluation. Each modality receives specific instructions for solving systems of equations with letter variables. For the mixed modality,...

work page

[56] [58]

OCR-first

Extended Results This section presents comprehensive results for theREST benchmarks, including performance metrics across all eval- uation conditions and detailed breakdowns by modality. 10.1. REST Cross-Modal Consistency AnalysisTables 7 and 8 present RER and CFR scores for the OCR-correct sub- set and the complete set, respectively. Results are given pe...

work page

[57] [59]

Image accuracy, RER consistency and OCR correct scores stratified by resolution (50, 100, 200 DPI) for all questions

As mentioned in the paper, we see that fewer text tokens are needed to achieve the same level of accuracy, or more Table 14.Image performance on REST+ for different DPI lev- els(complete set of questions). Image accuracy, RER consistency and OCR correct scores stratified by resolution (50, 100, 200 DPI) for all questions. Model Img Acc. RER OCR DPI 50 100...

work page 2025