pith. sign in

arxiv: 2606.08034 · v1 · pith:RYVBQ5SQnew · submitted 2026-06-06 · 💻 cs.CV · cs.AI· cs.CL

Sci-Rho: A Multilingual Visually-Grounded Symbolic Benchmark for STEM Problems

Pith reviewed 2026-06-27 20:09 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.CL
keywords vision-language modelsSTEM benchmarkmultilingual evaluationmodel robustnesssymbolic problemsdynamic benchmarkvisual grounding
0
0 comments X

The pith

A benchmark of executable STEM problem templates reveals that vision-language models have large gaps between their average accuracy and worst-case accuracy across equivalent variations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Sci-Rho, a benchmark built from 4,242 expert-crafted templates in five STEM subjects and seven languages. Each template is Python code that produces multiple problem instances differing in numbers, visuals, shapes, colors, functions, and language while keeping the required reasoning the same. Testing 17 VLMs shows a clear gap: models solve many templates correctly on average but fail on at least one variation for a large fraction of templates. Smaller models lose accuracy when the language changes, while larger and proprietary models stay stable. The same pattern appears in step-level scoring, and attention maps show that the balance of focus between image and text tokens shifts across languages.

Core claim

Sci-Rho supplies 4,242 problem templates (606 per language) implemented as executable Python code that generates 42,420 total instances by varying numerical values, visual patterns, geometric shapes, color schemes, function types, and language. When 17 state-of-the-art VLMs are evaluated, worst-case accuracy (templates solved correctly on every generated variation) is substantially lower than average accuracy. Smaller models exhibit clear performance drops across languages, whereas proprietary and larger models remain robust. Step-level F1 scores display the same average-to-worst-case gap, and inspection of attention heads indicates substantial cross-lingual differences in the relative atten

What carries the argument

Executable Python code templates that generate diverse but reasoning-equivalent problem instances with attached reasoning steps and ground-truth solutions.

If this is right

  • Worst-case accuracy across template variations is a stricter test of robustness than standard average accuracy.
  • Language robustness in VLMs scales with model size, with smaller models showing clear degradation.
  • Step-level reasoning evaluation exposes the same robustness shortfalls as final-answer accuracy.
  • Attention allocation between image and text tokens varies with language, pointing to internal processing differences.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Training regimes could add explicit consistency losses that penalize different answers on equivalent inputs.
  • The benchmark could be extended to new languages or subjects to test whether the size-based robustness pattern holds more broadly.
  • Models that maintain balanced attention across languages may require architectural changes beyond scale alone.

Load-bearing premise

The many variations produced from each template all require exactly the same reasoning steps even after numbers, shapes, colors, and languages are changed.

What would settle it

A follow-up experiment in which domain experts independently verify that every generated variation from a template demands identical reasoning and the measured gap between average and worst-case accuracy disappears.

Figures

Figures reproduced from arXiv: 2606.08034 by Abdullah Mubarak, Adi Yeltay, Fajri Koto, Ikhlasul Akmal Hanif, Muhammad Falensi Azmi, Vallerie Alexandra Putra.

Figure 1
Figure 1. Figure 1: An illustrative failure case of GPT-5.4 when evaluated across problem variants in our [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Dataset construction pipeline. 2 Related Works Visual Language Models Vision-Language Models (VLMs) have advanced at a remarkable pace, with proprietary models like GPT-5 [Singh et al., 2025] and Gemini 2.5 [Google DeepMind, 2025b] setting the state of the art. The open-source community has kept pace through diverse architectural innovations: Gemma 3’s optimized local-to-global attention ratios [Team et al… view at source ↗
Figure 3
Figure 3. Figure 3: A glimpse of our dataset. (a) An example from our dataset where each template consists of [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Average-case accuracy gap (∆Aavg) relative to English across languages and models. Negative values indicate performance degradation in non-English settings. To evaluate consistency across M variants of a problem, we report the Worst-case Step F1: Fwst = 1 N X N i=1 min j∈[1,M] Fi,j (4) 4.2 Main Result English-only Result [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: IAR trends across subjects and languages for [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Distribution of error types for GPT-5.4. Failure Mode Analysis To better understand how VLMs fail to answer our problems, we conducted a manual inspection of 52 templates where GPT-5.4 produced incorrect answers. We categorized the failures into three common error types: (1) figure-reading errors occur when the VLM fails to ex￾tract accurate numerical values or miscounts number of objects within the image … view at source ↗
Figure 7
Figure 7. Figure 7: More examples of variants in our dataset. [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗
read the original abstract

Symbolic benchmarks have emerged as a key approach to assess model robustness under minor modifications to STEM-related questions. However, existing symbolic benchmarks mostly remain limited to mathematical reasoning, lack visual grounding, and are predominantly in English. In this work, we introduce Sci-Rho (Science Rhobustness), a dynamic benchmark for visually-grounded STEM problems spanning five subjects and seven languages, comprising 4,242 problem templates (606 per language) crafted by domain experts, including Olympiad medalists. Each template is implemented as executable Python code that generates diverse but equivalent problem instances by varying numerical values, visual patterns, geometric shapes, color schemes, and function types, resulting in 42,420 instances in total, each paired with reasoning steps and ground-truth solutions. We evaluated 17 state-of-the-art VLMs and discovered a noticeable gap between worst-case accuracy (defined as the proportion of problem templates that a model answers correctly across every generated variation) and average accuracy. We also discovered that smaller models show noticeable performance degradation across languages, whereas proprietary and larger models remain robust. Step-level evaluation reflects this same trend, revealing a significant gap between average F1 and worst-case F1 scores. Finally, our inspection of attention heads of a VLM reveals substantial cross-lingual variation in the relative attention allocated to image tokens compared to text tokens. Our work highlights the importance of evaluation beyond static benchmarks as a metric to measure the quality of VLMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces Sci-Rho, a dynamic benchmark for visually-grounded STEM problems across five subjects and seven languages. It comprises 4,242 expert-crafted problem templates (606 per language), each implemented as executable Python code generating 10 diverse but equivalent instances by varying numerical values, visual patterns, geometric shapes, color schemes, function types, and language, for a total of 42,420 instances with reasoning steps and ground-truth solutions. Evaluation of 17 VLMs reveals a gap between worst-case accuracy (templates solved correctly on all variations) and average accuracy, with smaller models showing language degradation while larger/proprietary models remain robust; similar gaps appear in step-level F1 scores, and attention analysis shows cross-lingual variation in image vs. text token attention.

Significance. If the generated variations preserve semantic and reasoning equivalence, Sci-Rho would provide a valuable contribution as a multilingual, visually-grounded dynamic benchmark that moves beyond static evaluation to measure VLM robustness in STEM domains, with the reported performance gaps and attention findings offering concrete evidence of current model limitations.

major comments (2)
  1. [Abstract and template construction description] The central worst-case accuracy metric and cross-lingual comparisons rest on the assumption that all 10 generated instances per template are semantically, reasoning-step, and difficulty equivalent despite changes in numbers, visuals, shapes, colors, functions, and language. No section reports independent human validation, inter-rater reliability checks, or formal equivalence verification for the full set of templates and instances; this directly affects whether failures on specific variations indicate lack of robustness rather than non-equivalent test items.
  2. [Step-level evaluation section] The step-level evaluation and F1 score gaps are presented as reflecting the same robustness trends, but without details on how reasoning steps are extracted, aligned across variations, or scored for equivalence, it is unclear whether the reported average vs. worst-case F1 differences are load-bearing or could be artifacts of annotation or alignment choices.
minor comments (2)
  1. [Abstract] The abstract states '42,420 instances in total' but the per-template count (10) and template total (4,242) imply exactly that; confirm the arithmetic and any filtering in the methods section for clarity.
  2. [Evaluation metrics] Notation for worst-case accuracy (proportion of templates correct on every variation) should be formalized with an equation to avoid ambiguity in replication.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and will revise the manuscript accordingly to improve clarity on equivalence and evaluation procedures.

read point-by-point responses
  1. Referee: [Abstract and template construction description] The central worst-case accuracy metric and cross-lingual comparisons rest on the assumption that all 10 generated instances per template are semantically, reasoning-step, and difficulty equivalent despite changes in numbers, visuals, shapes, colors, functions, and language. No section reports independent human validation, inter-rater reliability checks, or formal equivalence verification for the full set of templates and instances; this directly affects whether failures on specific variations indicate lack of robustness rather than non-equivalent test items.

    Authors: We acknowledge that the manuscript does not report independent human validation or inter-rater reliability metrics for equivalence. Equivalence is ensured by construction: each template was authored by domain experts (including Olympiad medalists) as executable Python code that applies controlled parametric changes while preserving the fixed reasoning structure, semantic content, and difficulty level. We will add a dedicated subsection in the template construction description that details the design process, the specific variation mechanisms, and the internal consistency checks performed by the expert authors during template development. This will make the equivalence argument more explicit without claiming external validation that was not conducted. revision: yes

  2. Referee: [Step-level evaluation section] The step-level evaluation and F1 score gaps are presented as reflecting the same robustness trends, but without details on how reasoning steps are extracted, aligned across variations, or scored for equivalence, it is unclear whether the reported average vs. worst-case F1 differences are load-bearing or could be artifacts of annotation or alignment choices.

    Authors: Reasoning steps are generated directly by the same executable templates, so alignment across the ten variations of each template occurs by construction: every instance follows the identical sequence of symbolic steps with only the instantiated values or visual elements differing. We will expand the step-level evaluation section to explicitly describe (1) how steps are emitted from the code, (2) the template-based alignment procedure, and (3) the exact F1 computation (token-level overlap against the ground-truth step sequence). These additions will clarify that the observed average-versus-worst-case F1 gaps arise from model behavior rather than annotation or alignment artifacts. revision: yes

Circularity Check

0 steps flagged

No significant circularity; benchmark construction and empirical evaluation are self-contained.

full rationale

The paper describes expert-crafted templates implemented as Python generators to produce problem instances, followed by direct VLM evaluation on those instances. No equations, fitted parameters, predictions, or derivations are present that reduce to inputs by construction. Claims rest on measured accuracies rather than any self-referential chain or renamed result. Self-citations are absent from the provided text, and the central metrics (worst-case vs. average accuracy) are computed directly from model outputs on the generated set.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper introduces a new benchmark constructed from expert-written templates and Python generators; it does not rely on fitted parameters, unstated mathematical axioms, or newly postulated entities.

pith-pipeline@v0.9.1-grok · 5817 in / 1113 out tokens · 25259 ms · 2026-06-27T20:09:10.449857+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

77 extracted references · 3 canonical work pages

  1. [1]

    2024 , eprint=

    Measuring Multimodal Mathematical Reasoning with MATH-Vision Dataset , author=. 2024 , eprint=

  2. [2]

    2025 , eprint=

    Vision Language Models are Confused Tourists , author=. 2025 , eprint=

  3. [3]

    2026 , eprint=

    Vision Language Models are Biased , author=. 2026 , eprint=

  4. [4]

    2026 , eprint=

    FinChain: A Symbolic Benchmark for Verifiable Chain-of-Thought Financial Reasoning , author=. 2026 , eprint=

  5. [5]

    arXiv preprint arXiv:2405.20797 , year=

    Ovis: Structural embedding alignment for multimodal large language model , author=. arXiv preprint arXiv:2405.20797 , year=

  6. [6]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

    Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

  7. [7]

    2024 , eprint=

    Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models , author=. 2024 , eprint=

  8. [8]

    2025 , eprint=

    Qwen2.5-VL Technical Report , author=. 2025 , eprint=

  9. [9]

    2024 , eprint=

    DeepSeek-VL: Towards Real-World Vision-Language Understanding , author=. 2024 , eprint=

  10. [10]

    Introducing LLaMA 4 , year =

  11. [11]

    Qwen3.5: Towards Native Multimodal Agents , year =

  12. [12]

    Introducing Claude 4 , year =

  13. [13]

    2025 , eprint=

    Gemini: A Family of Highly Capable Multimodal Models , author=. 2025 , eprint=

  14. [14]

    2025 , eprint=

    OpenAI GPT-5 System Card , author=. 2025 , eprint=

  15. [15]

    2024 , eprint=

    MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts , author=. 2024 , eprint=

  16. [17]

    2025 , eprint=

    SeePhys: Does Seeing Help Thinking? -- Benchmarking Vision-Based Physics Reasoning , author=. 2025 , eprint=

  17. [18]

    2025 , eprint=

    PHYBench: Holistic Evaluation of Physical Perception and Reasoning in Large Language Models , author=. 2025 , eprint=

  18. [19]

    2025 , eprint=

    PhysReason: A Comprehensive Benchmark towards Physics-Based Reasoning , author=. 2025 , eprint=

  19. [20]

    2022 , eprint=

    Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering , author=. 2022 , eprint=

  20. [21]

    2025 , eprint=

    SciVerse: Unveiling the Knowledge Comprehension and Visual Reasoning of LMMs on Multi-modal Scientific Problems , author=. 2025 , eprint=

  21. [22]

    2024 , eprint=

    Large Language Models Are Not Robust Multiple Choice Selectors , author=. 2024 , eprint=

  22. [23]

    2026 , eprint=

    EngTrace: A Symbolic Benchmark for Verifiable Process Supervision of Engineering Reasoning , author=. 2026 , eprint=

  23. [24]

    2024 , eprint=

    Fool Your (Vision and) Language Model With Embarrassingly Simple Permutations , author=. 2024 , eprint=

  24. [25]

    2025 , eprint=

    GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models , author=. 2025 , eprint=

  25. [26]

    2025 , eprint=

    DynaMath: A Dynamic Visual Benchmark for Evaluating Mathematical Reasoning Robustness of Vision Language Models , author=. 2025 , eprint=

  26. [27]

    2025 , eprint=

    VGRP-Bench: Visual Grid Reasoning Puzzle Benchmark for Large Vision-Language Models , author=. 2025 , eprint=

  27. [28]

    2025 , eprint=

    AgroBench: Vision-Language Model Benchmark in Agriculture , author=. 2025 , eprint=

  28. [31]

    2024 , eprint=

    Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution , author=. 2024 , eprint=

  29. [32]

    2024 , eprint=

    How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites , author=. 2024 , eprint=

  30. [33]

    2024 , eprint=

    Improved Baselines with Visual Instruction Tuning , author=. 2024 , eprint=

  31. [34]

    2024 , eprint=

    The Llama 3 Herd of Models , author=. 2024 , eprint=

  32. [35]

    Gemini 2.5 Pro Model Card , year =

  33. [36]

    Gemini 2.5 Flash Model Card , year =

  34. [37]

    2025 , eprint=

    Gemma 3 Technical Report , author=. 2025 , eprint=

  35. [38]

    2025 , eprint=

    InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency , author=. 2025 , eprint=

  36. [39]

    2026 , eprint=

    Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding , author=. 2026 , eprint=

  37. [40]

    SEA-LION v4 Model Documentation , year =

  38. [41]

    2025 , eprint=

    Qwen3-VL Technical Report , author=. 2025 , eprint=

  39. [42]

    2026 , eprint=

    ChartAttack: Testing the Vulnerability of LLMs to Malicious Prompting in Chart Generation , author=. 2026 , eprint=

  40. [43]

    2026 , eprint=

    Is this chart lying to me? Automating the detection of misleading visualizations , author=. 2026 , eprint=

  41. [44]

    2025 , eprint=

    True Multimodal In-Context Learning Needs Attention to the Visual Context , author=. 2025 , eprint=

  42. [45]

    2025 , eprint=

    Words or Vision: Do Vision-Language Models Have Blind Faith in Text? , author=. 2025 , eprint=

  43. [46]

    2025 , eprint=

    Refusal Falls off a Cliff: How Safety Alignment Fails in Reasoning? , author=. 2025 , eprint=

  44. [47]

    2025 , eprint=

    MLLMs are Deeply Affected by Modality Bias , author=. 2025 , eprint=

  45. [48]

    Sea-lion v4 model documentation

    AI Singapore . Sea-lion v4 model documentation. https://docs.sea-lion.ai/models/sea-lion-v4, 2026. Accessed: 2026-04-26

  46. [49]

    Qwen3-vl technical report, 2025

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Junyang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shixuan ...

  47. [50]

    True multimodal in-context learning needs attention to the visual context, 2025

    Shuo Chen, Jianzhe Liu, Zhen Han, Yan Xia, Daniel Cremers, Philip Torr, Volker Tresp, and Jindong Gu. True multimodal in-context learning needs attention to the visual context, 2025. URL https://arxiv.org/abs/2507.15807

  48. [51]

    Molmo2: Open weights and data for vision-language models with video understanding and grounding, 2026

    Christopher Clark, Jieyu Zhang, Zixian Ma, Jae Sung Park, Mohammadreza Salehi, Rohun Tripathi, Sangho Lee, Zhongzheng Ren, Chris Dongjoo Kim, Yinuo Yang, Vincent Shao, Yue Yang, Weikai Huang, Ziqi Gao, Taira Anderson, Jianrui Zhang, Jitesh Jain, George Stoica, Winson Han, Ali Farhadi, and Ranjay Krishna. Molmo2: Open weights and data for vision-language m...

  49. [52]

    Words or vision: Do vision-language models have blind faith in text?, 2025

    Ailin Deng, Tri Cao, Zhirui Chen, and Bryan Hooi. Words or vision: Do vision-language models have blind faith in text?, 2025. URL https://arxiv.org/abs/2503.02199

  50. [53]

    Gemini 2.5 flash model card

    Google DeepMind . Gemini 2.5 flash model card. https://storage.googleapis.com/deepmind-media/Model-Cards/Gemini-2-5-Flash-Model-Card.pdf, 2025 a . Accessed: 2026-04-26

  51. [54]

    Gemini 2.5 pro model card

    Google DeepMind . Gemini 2.5 pro model card. https://storage.googleapis.com/deepmind-media/Model-Cards/Gemini-2-5-Pro-Model-Card.pdf, 2025 b . Accessed: 2026-04-26

  52. [55]

    Engtrace: A symbolic benchmark for verifiable process supervision of engineering reasoning, 2026

    Ayesha Gull, Muhammad Usman Safder, Rania Elbadry, Fan Zhang, Veselin Stoyanov, Preslav Nakov, and Zhuohan Xie. Engtrace: A symbolic benchmark for verifiable process supervision of engineering reasoning, 2026. URL https://arxiv.org/abs/2511.01650

  53. [56]

    Sciverse: Unveiling the knowledge comprehension and visual reasoning of lmms on multi-modal scientific problems, 2025

    Ziyu Guo, Ray Zhang, Hao Chen, Jialin Gao, Dongzhi Jiang, Jiaze Wang, and Pheng-Ann Heng. Sciverse: Unveiling the knowledge comprehension and visual reasoning of lmms on multi-modal scientific problems, 2025. URL https://arxiv.org/abs/2503.10627

  54. [57]

    Vision language models are confused tourists, 2025

    Patrick Amadeus Irawan, Ikhlasul Akmal Hanif, Muhammad Dehan Al Kautsar, Genta Indra Winata, Fajri Koto, and Alham Fikri Aji. Vision language models are confused tourists, 2025. URL https://arxiv.org/abs/2511.17004

  55. [58]

    Learn to explain: Multimodal reasoning via thought chains for science question answering, 2022

    Pan Lu, Swaroop Mishra, Tony Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering, 2022. URL https://arxiv.org/abs/2209.09513

  56. [59]

    Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts, 2024

    Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts, 2024. URL https://arxiv.org/abs/2310.02255

  57. [60]

    Introducing llama 4, 2025

    Meta AI . Introducing llama 4, 2025. URL https://ai.meta.com/llama/. Accessed: 2026-04-16

  58. [61]

    Gsm-symbolic: Understanding the limitations of mathematical reasoning in large language models, 2025

    Iman Mirzadeh, Keivan Alizadeh, Hooman Shahrokhi, Oncel Tuzel, Samy Bengio, and Mehrdad Farajtabar. Gsm-symbolic: Understanding the limitations of mathematical reasoning in large language models, 2025. URL https://arxiv.org/abs/2410.05229

  59. [62]

    Benchmarking Vision Language Models for Cultural Understanding

    Shravan Nayak, Kanishk Jain, Rabiul Awal, Siva Reddy, Sjoerd Van Steenkiste, Lisa Anne Hendricks, Karolina Stanczak, and Aishwarya Agrawal. Benchmarking vision language models for cultural understanding. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors, Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pa...

  60. [63]

    ISBN 979-8-89176-332-6

    Jean De Dieu Nyandwi, Yueqi Song, Simran Khanuja, and Graham Neubig. Grounding multilingual multimodal LLM s with cultural knowledge. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors, Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 24198--24242, Suzhou, China, November...

  61. [64]

    Phybench: Holistic evaluation of physical perception and reasoning in large language models, 2025

    Shi Qiu, Shaoyang Guo, Zhuo-Yang Song, Yunbo Sun, Zeyu Cai, Jiashen Wei, Tianyu Luo, Yixuan Yin, Haoxu Zhang, Yi Hu, Chenyang Wang, Chencheng Tang, Haoling Chang, Qi Liu, Ziheng Zhou, Tianyu Zhang, Jingtian Zhang, Zhangyi Liu, Minghao Li, Yuku Zhang, Boxuan Jing, Xianqi Yin, Yutong Ren, Zizhuo Fu, Jiaming Ji, Weike Wang, Xudong Tian, Anqi Lv, Laifu Man, J...

  62. [65]

    Vgrp-bench: Visual grid reasoning puzzle benchmark for large vision-language models, 2025

    Yufan Ren, Konstantinos Tertikas, Shalini Maiti, Junlin Han, Tong Zhang, Sabine Süsstrunk, and Filippos Kokkinos. Vgrp-bench: Visual grid reasoning puzzle benchmark for large vision-language models, 2025. URL https://arxiv.org/abs/2503.23064

  63. [66]

    Agrobench: Vision-language model benchmark in agriculture, 2025

    Risa Shinoda, Nakamasa Inoue, Hirokatsu Kataoka, Masaki Onishi, and Yoshitaka Ushiku. Agrobench: Vision-language model benchmark in agriculture, 2025. URL https://arxiv.org/abs/2507.20519

  64. [67]

    Openai gpt-5 system card, 2025

    Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram, Akshay Nathan, Alan Luo, Alec Helyar, Aleksander Madry, Aleksandr Efremov, Aleksandra Spyra, Alex Baker-Whitcomb, Alex Beutel, Alex Karpenko, Alex Makelov, Alex Neitz, Alex Wei, Alexandra Barr, Alexandre Kirchmeyer, Ale...

  65. [68]

    MM - MATH : Advancing multimodal math evaluation with process evaluation and fine-grained classification

    Kai Sun, Yushi Bai, Ji Qi, Lei Hou, and Juanzi Li. MM - MATH : Advancing multimodal math evaluation with process evaluation and fine-grained classification. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors, Findings of the Association for Computational Linguistics: EMNLP 2024, pages 1358--1375, Miami, Florida, USA, November 2024. Association ...

  66. [69]

    Gemma Team, Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Tatiana Matejovicova, Alexandre Ramé, Morgane Rivière, Louis Rouillard, Thomas Mesnard, Geoffrey Cideron, Jean bastien Grill, Sabela Ramos, Edouard Yvinec, Michelle Casbon, Etienne Pot, Ivo Penchev, Gaël Liu, Francesco Visin, Kathleen Kenealy, Lucas Bey...

  67. [70]

    Vision language models are biased, 2026

    An Vo, Khai-Nguyen Nguyen, Mohammad Reza Taesiri, Vy Tuong Dang, Anh Totti Nguyen, and Daeyoung Kim. Vision language models are biased, 2026. URL https://arxiv.org/abs/2505.23941

  68. [71]

    Measuring multimodal mathematical reasoning with math-vision dataset, 2024

    Ke Wang, Junting Pan, Weikang Shi, Zimu Lu, Mingjie Zhan, and Hongsheng Li. Measuring multimodal mathematical reasoning with math-vision dataset, 2024. URL https://arxiv.org/abs/2402.14804

  69. [72]

    Internvl3.5: Advancing open-source multimodal models in versatility, reasoning, and efficiency, 2025

    Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, Zhaokai Wang, Zhe Chen, Hongjie Zhang, Ganlin Yang, Haomin Wang, Qi Wei, Jinhui Yin, Wenhao Li, Erfei Cui, Guanzhou Chen, Zichen Ding, Changyao Tian, Zhenyu Wu, Jingjing Xie, Zehao Li, Bowen Yang, Yuchen Duan, Xuehui Wang, Zhi Hou,...

  70. [73]

    Seephys: Does seeing help thinking? -- benchmarking vision-based physics reasoning, 2025

    Kun Xiang, Heng Li, Terry Jingchen Zhang, Yinya Huang, Zirong Liu, Peixin Qu, Jixi He, Jiaqi Chen, Yu-Jie Yuan, Jianhua Han, Hang Xu, Hanhui Li, Mrinmaya Sachan, and Xiaodan Liang. Seephys: Does seeing help thinking? -- benchmarking vision-based physics reasoning, 2025. URL https://arxiv.org/abs/2505.19099

  71. [74]

    Finchain: A symbolic benchmark for verifiable chain-of-thought financial reasoning, 2026

    Zhuohan Xie, Daniil Orel, Rushil Thareja, Dhruv Sahnan, Hachem Madmoun, Fan Zhang, Debopriyo Banerjee, Georgi Georgiev, Xueqing Peng, Lingfei Qian, Jimin Huang, Jinyan Su, Aaryamonvikram Singh, Rui Xing, Rania Elbadry, Chen Xu, Haonan Li, Fajri Koto, Ivan Koychev, Tanmoy Chakraborty, Yuxia Wang, Salem Lahlou, Veselin Stoyanov, Sophia Ananiadou, and Presla...

  72. [75]

    Refusal falls off a cliff: How safety alignment fails in reasoning?, 2025

    Qingyu Yin, Chak Tou Leong, Linyi Yang, Wenxuan Huang, Wenjie Li, Xiting Wang, Jaehong Yoon, YunXing, XingYu, and Jinjin Gu. Refusal falls off a cliff: How safety alignment fails in reasoning?, 2025. URL https://arxiv.org/abs/2510.06036

  73. [76]

    Physreason: A comprehensive benchmark towards physics-based reasoning, 2025

    Xinyu Zhang, Yuxuan Dong, Yanrui Wu, Jiaxing Huang, Chengyou Jia, Basura Fernando, Mike Zheng Shou, Lingling Zhang, and Jun Liu. Physreason: A comprehensive benchmark towards physics-based reasoning, 2025. URL https://arxiv.org/abs/2502.12054

  74. [77]

    Large language models are not robust multiple choice selectors, 2024

    Chujie Zheng, Hao Zhou, Fandong Meng, Jie Zhou, and Minlie Huang. Large language models are not robust multiple choice selectors, 2024. URL https://arxiv.org/abs/2309.03882

  75. [78]

    Mllms are deeply affected by modality bias, 2025

    Xu Zheng, Chenfei Liao, Yuqian Fu, Kaiyu Lei, Yuanhuiyi Lyu, Lutao Jiang, Bin Ren, Jialei Chen, Jiawen Wang, Chengxin Li, Linfeng Zhang, Danda Pani Paudel, Xuanjing Huang, Yu-Gang Jiang, Nicu Sebe, Dacheng Tao, Luc Van Gool, and Xuming Hu. Mllms are deeply affected by modality bias, 2025. URL https://arxiv.org/abs/2505.18657

  76. [79]

    Fool your (vision and) language model with embarrassingly simple permutations, 2024

    Yongshuo Zong, Tingyang Yu, Ruchika Chavhan, Bingchen Zhao, and Timothy Hospedales. Fool your (vision and) language model with embarrassingly simple permutations, 2024. URL https://arxiv.org/abs/2310.01651

  77. [80]

    Dynamath: A dynamic visual benchmark for evaluating mathematical reasoning robustness of vision language models, 2025

    Chengke Zou, Xingang Guo, Rui Yang, Junyu Zhang, Bin Hu, and Huan Zhang. Dynamath: A dynamic visual benchmark for evaluating mathematical reasoning robustness of vision language models, 2025. URL https://arxiv.org/abs/2411.00836