pith. sign in

arxiv: 2410.14702 · v2 · submitted 2024-10-06 · 💻 cs.AI · cs.CL

Polymath: A Challenging Multi-modal Mathematical Reasoning Benchmark

Pith reviewed 2026-05-23 20:01 UTC · model grok-4.3

classification 💻 cs.AI cs.CL
keywords multi-modal large language modelsmathematical reasoningvisual comprehensionspatial reasoningbenchmark evaluationpattern recognitionabstract reasoningMLLM performance
0
0 comments X

The pith

Multi-modal models gain only about 4 percent when diagrams are replaced by text descriptions on a new math reasoning benchmark.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents PolyMATH, a benchmark of 5,000 manually collected images spanning ten categories of cognitive challenges including pattern recognition, spatial reasoning, and relative reasoning. Comprehensive tests of fifteen MLLMs across four prompting methods produce top scores of roughly 41 percent for Claude-3.5 Sonnet, 36 percent for GPT-4o, and 27 percent for Gemini-1.5 Pro. An ablation replacing images with textual descriptions yields only a four percent average lift, which the authors interpret as evidence that models fail to extract spatial information from diagrams and therefore commit logical errors in extended reasoning. OpenAI o1 models reach performance levels comparable to the human baseline, underscoring the benchmark's difficulty.

Core claim

The central claim is that MLLMs do not truly comprehend visual diagrams and the spatial information they contain. This is shown by the low overall scores on PolyMATH and by the ablation result that textual descriptions produce only marginal gains over the actual images, leaving models prone to logical errors on tasks requiring drawn-out high-level reasoning.

What carries the argument

The PolyMATH benchmark itself, built from 5,000 high-quality images across ten distinct categories of textual and visual cognitive challenges.

If this is right

  • Current MLLMs remain limited in handling spatial relations within diagrams even when prompted with Chain-of-Thought or Step-Back strategies.
  • Models are prone to logical errors once tasks require integrating visual details over multiple steps.
  • The benchmark can serve as a diagnostic tool to measure progress in visual abstraction beyond current training regimes.
  • OpenAI o1 models matching human baseline performance indicates that scaling alone has not closed the gap on these visual reasoning tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Training corpora may contain too few examples of complex diagram-based spatial relations, limiting what models can internalize from image-text pairs.
  • Targeted pretraining on diagram parsing or synthetic spatial puzzles could be tested as a direct remedy for the observed weaknesses.
  • Extending the benchmark with dynamic or interactive diagrams would reveal whether the current static-image limitation is fundamental.

Load-bearing premise

The 5,000 manually collected images form a fair, unbiased, and appropriately difficult sample of the visual and cognitive challenges that matter for mathematical reasoning.

What would settle it

A new model achieving above 70 percent accuracy on the full PolyMATH set or showing more than 15 percent improvement when given textual descriptions instead of the original images would undermine the claim of weak visual diagram comprehension.

Figures

Figures reproduced from arXiv: 2410.14702 by Chitta Baral, Himanshu Gupta, Kevin Scaria, Mihir Parmar, Shreyas Verma, Swaroop Mishra, Ujjwala Anantheswaran.

Figure 1
Figure 1. Figure 1: Examples of the reasoning patterns employed by MLLMs when faced with questions [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: An overview of POLYMATH’s distribution and difficulty (a) exhibits the per-category split of the 5000 questions in the dataset, along with the split of with diagram (WD) and without diagram (WoD) for that category ; (b) Compares the per-category performance of various MLLMs. relational information (∼ 12%). Finally, we evaluate OpenAI o1 models (OpenAI, 2024b) on without diagram questions of the benchmark a… view at source ↗
Figure 3
Figure 3. Figure 3: Examples of with diagram and without diagram questions. In addition to the question image, POLYMATH includes the metadata shown above. Question without diagram is not present in test-img while both kinds of questions will be present in testmini. notable gap, as visual puzzle tasks require logical leaps that differ fundamentally from reasoning patterns over textual or linguistic problems. Moreover, spatial … view at source ↗
Figure 4
Figure 4. Figure 4: Frequency of Logical Flaw (LF) and Spatial Misunderstanding (SM) errors across different [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Questions belonging to the figure_completion (FC) category 30 [PITH_FULL_IMAGE:figures/full_fig_p030_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Questions belonging to the logical_reasoning (LR) category 31 [PITH_FULL_IMAGE:figures/full_fig_p031_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Questions belonging to the mathematical_reasoning (MR) category 32 [PITH_FULL_IMAGE:figures/full_fig_p032_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Questions belonging to the numerical_reasoning (NR) category 33 [PITH_FULL_IMAGE:figures/full_fig_p033_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Questions belonging to the odd_one_out (OD) category 34 [PITH_FULL_IMAGE:figures/full_fig_p034_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Questions belonging to the pattern_recognition (PR) category 35 [PITH_FULL_IMAGE:figures/full_fig_p035_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Questions belonging to the perspective_shift (PS) category 36 [PITH_FULL_IMAGE:figures/full_fig_p036_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Questions belonging to the relative_reasoning (RR) category 37 [PITH_FULL_IMAGE:figures/full_fig_p037_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Questions belonging to the sequence_completion (SC) category 38 [PITH_FULL_IMAGE:figures/full_fig_p038_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Questions belonging to the spatial_reasoning (SR) category 39 [PITH_FULL_IMAGE:figures/full_fig_p039_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Erroneous model reasoning patterns observed on an FC question [PITH_FULL_IMAGE:figures/full_fig_p040_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Erroneous model reasoning patterns observed on an LR question [PITH_FULL_IMAGE:figures/full_fig_p041_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Erroneous model reasoning patterns observed on an MR question [PITH_FULL_IMAGE:figures/full_fig_p042_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Erroneous model reasoning patterns observed on an NR question [PITH_FULL_IMAGE:figures/full_fig_p043_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Erroneous model reasoning patterns observed on an OD question [PITH_FULL_IMAGE:figures/full_fig_p044_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: Erroneous model reasoning patterns observed on a PR question [PITH_FULL_IMAGE:figures/full_fig_p045_20.png] view at source ↗
Figure 21
Figure 21. Figure 21: Erroneous model reasoning patterns observed on a PS question [PITH_FULL_IMAGE:figures/full_fig_p046_21.png] view at source ↗
Figure 22
Figure 22. Figure 22: Erroneous model reasoning patterns observed on an RR question [PITH_FULL_IMAGE:figures/full_fig_p047_22.png] view at source ↗
Figure 23
Figure 23. Figure 23: Erroneous model reasoning patterns observed on an SC question [PITH_FULL_IMAGE:figures/full_fig_p048_23.png] view at source ↗
Figure 24
Figure 24. Figure 24: Erroneous model reasoning patterns observed on an SR question [PITH_FULL_IMAGE:figures/full_fig_p049_24.png] view at source ↗
read the original abstract

Multi-modal Large Language Models (MLLMs) exhibit impressive problem-solving abilities in various domains, but their visual comprehension and abstract reasoning skills remain under-evaluated. To this end, we present PolyMATH, a challenging benchmark aimed at evaluating the general cognitive reasoning abilities of MLLMs. PolyMATH comprises 5,000 manually collected high-quality images of cognitive textual and visual challenges across 10 distinct categories, including pattern recognition, spatial reasoning, and relative reasoning. We conducted a comprehensive, and quantitative evaluation of 15 MLLMs using four diverse prompting strategies, including Chain-of-Thought and Step-Back. The best scores achieved on PolyMATH are ~41%, ~36%, and ~27%, obtained by Claude-3.5 Sonnet, GPT-4o and Gemini-1.5 Pro respectively - highlighting the logical and visual complexity of these questions. A further fine-grained error analysis reveals that these models struggle to understand spatial relations and perform drawn-out, high-level reasoning. This is further strengthened by our ablation study estimating MLLM performance when given textual descriptions in place of diagrams. As evidenced by ~4% improvement over textual descriptions as opposed to actual images, we discover that models do not truly comprehend visual diagrams and the spatial information therein, and are thus prone to logical errors. Finally, we evaluate the OpenAI o1 models and find that their performance only matches the human baseline, highlighting the difficulty of the benchmark. The results on PolyMATH highlight the room for improvement in multi-modal reasoning and provide unique insights to guide the development of future MLLMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The manuscript introduces PolyMATH, a benchmark consisting of 5,000 manually collected images across 10 categories of cognitive challenges (pattern recognition, spatial reasoning, relative reasoning, etc.) to evaluate MLLMs on multi-modal mathematical reasoning. It reports evaluations of 15 MLLMs under four prompting strategies, with top scores of ~41% (Claude-3.5 Sonnet), ~36% (GPT-4o), and ~27% (Gemini-1.5 Pro); an ablation replacing images with textual descriptions yields only ~4% improvement, supporting the claim that models fail to comprehend visual diagrams and spatial information; o1 models match but do not exceed a human baseline.

Significance. If the 5,000-image sample is representative and the ablation comparison is well-controlled, the results would establish a valuable, high-difficulty benchmark exposing clear gaps in current MLLMs' spatial and multi-step visual reasoning, with the error analysis and human/o1 comparisons providing concrete guidance for future work. The scale and category coverage are strengths.

major comments (3)
  1. [Benchmark construction] Benchmark construction (abstract and § on data collection): The images are described only as 'manually collected' across 10 categories with no sourcing protocol, inclusion/exclusion criteria, diversity metrics, or verification steps against selection bias. This is load-bearing for the central claim, because both the headline ~41% ceiling and the ~4% image-vs-text gap could arise from non-representative sampling of unusually difficult cases rather than a fundamental deficit in diagram comprehension.
  2. [Ablation study] Ablation study (results section): The ~4% performance lift when substituting textual descriptions for images is used to conclude that 'models do not truly comprehend visual diagrams,' yet the manuscript supplies no information on how the textual descriptions were generated, their fidelity to the original diagrams, or any validation that they preserve all spatial relations. Without these controls the ablation cannot isolate visual comprehension.
  3. [Experimental setup] Evaluation protocol (experimental setup): No inter-annotator agreement, difficulty calibration procedure, or statistical tests (e.g., confidence intervals or significance of the 4% gap) are reported for the quantitative scores or human baseline. These omissions undermine the reliability of the performance numbers that underpin the main conclusions.
minor comments (1)
  1. [Abstract] The abstract lists 'four diverse prompting strategies' but only names Chain-of-Thought and Step-Back; the remaining two should be enumerated for reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback highlighting areas where additional methodological details would strengthen the paper. We address each major comment below and will incorporate the suggested clarifications and controls in a revised manuscript.

read point-by-point responses
  1. Referee: [Benchmark construction] The images are described only as 'manually collected' across 10 categories with no sourcing protocol, inclusion/exclusion criteria, diversity metrics, or verification steps against selection bias. This is load-bearing for the central claim, because both the headline ~41% ceiling and the ~4% image-vs-text gap could arise from non-representative sampling of unusually difficult cases rather than a fundamental deficit in diagram comprehension.

    Authors: We agree that the current description of data collection is insufficiently detailed. In the revision we will add a dedicated subsection describing the sourcing (publicly available cognitive puzzle repositories and textbooks), explicit inclusion criteria (clear diagrams, unambiguous ground-truth answers, coverage of the 10 categories), diversity metrics (balanced distribution across categories and difficulty levels as measured by pilot human accuracy), and verification (two independent annotators reviewed each item for bias and clarity, with disagreements resolved by discussion). These additions will directly support the representativeness of the 5,000-image sample. revision: yes

  2. Referee: [Ablation study] The ~4% performance lift when substituting textual descriptions for images is used to conclude that 'models do not truly comprehend visual diagrams,' yet the manuscript supplies no information on how the textual descriptions were generated, their fidelity to the original diagrams, or any validation that they preserve all spatial relations. Without these controls the ablation cannot isolate visual comprehension.

    Authors: We acknowledge the omission of generation and validation details for the textual descriptions. The revision will specify that descriptions were produced by human experts following a template that required explicit enumeration of all spatial relations, object positions, and visual elements while omitting the solution; a separate validation set of 200 items was rated by two additional annotators for completeness (mean fidelity score 4.7/5), with inter-annotator agreement reported. These controls will be added to the ablation section to better isolate the contribution of visual input. revision: yes

  3. Referee: [Experimental setup] No inter-annotator agreement, difficulty calibration procedure, or statistical tests (e.g., confidence intervals or significance of the 4% gap) are reported for the quantitative scores or human baseline. These omissions undermine the reliability of the performance numbers that underpin the main conclusions.

    Authors: We agree these statistical and reliability details are missing. The revised experimental setup section will report (i) inter-annotator agreement (Cohen’s kappa) for the human baseline, (ii) the difficulty calibration procedure (pilot testing on 100 items with 10 participants to stratify by accuracy bands), and (iii) bootstrap confidence intervals plus paired t-tests for all model comparisons, including the 4% image-vs-text gap. These additions will quantify the reliability of the reported numbers. revision: yes

Circularity Check

0 steps flagged

No circularity: direct empirical evaluation on collected dataset

full rationale

The paper reports performance measurements and an ablation on a newly collected set of 5,000 images. No equations, fitted parameters, predictions derived from inputs, or self-citation chains appear in the derivation of the central claims. The ~4% text-vs-image gap and overall scores are direct empirical observations on the provided data, not quantities forced by construction or renamed from prior fits. The manual collection step is an input to the evaluation rather than a self-referential loop. This is a standard benchmark paper with no load-bearing reductions to its own outputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

No free parameters or invented entities are introduced; the work rests on the domain assumption that the manually curated images validly test the targeted reasoning skills.

axioms (1)
  • domain assumption Manually collected images across 10 categories accurately capture distinct cognitive challenges without selection bias.
    Invoked in the description of benchmark construction in the abstract.

pith-pipeline@v0.9.0 · 5840 in / 1177 out tokens · 40830 ms · 2026-05-23T20:01:27.942914+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. How RL Unlocks the Aha Moment in Geometric Interleaved Reasoning

    cs.CL 2026-03 unverdicted novelty 6.0

    Reinforcement learning with three causal constraints enables multimodal models to internalize diagram-reasoning links in geometry, unlike SFT which only mimics surface format and harms performance.

  2. ViSurf: Visual Supervised-and-Reinforcement Fine-Tuning for Large Vision-and-Language Models

    cs.CV 2025-10 unverdicted novelty 6.0

    ViSurf unifies SFT and RLVR for LVLMs in one training stage by injecting ground-truth labels into rollouts and applying novel reward controls, outperforming standalone and two-stage baselines on diverse benchmarks.

Reference graph

Works this paper leans on

61 extracted references · 61 canonical work pages · cited by 2 Pith papers · 26 internal anchors

  1. [1]

    GPT-4 Technical Report

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774,

  2. [2]

    Mathqa: Towards interpretable math word problem solving with operation-based formalisms

    Aida Amini, Saadia Gabriel, Shanchuan Lin, Rik Koncel-Kedziorski, Yejin Choi, and Hannaneh Hajishirzi. Mathqa: Towards interpretable math word problem solving with operation-based formalisms. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL), pp. 2357– 2367,

  3. [3]

    OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models

    Anas Awadalla, Irena Gao, Josh Gardner, Jack Hessel, Yusuf Hanafy, Wanrong Zhu, Kalyani Marathe, Yonatan Bitton, Samir Gadre, Shiori Sagawa, et al. OpenFlamingo: An open-source framework for training large autoregressive vision-language models. arXiv preprint arXiv:2308.01390,

  4. [4]

    Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

    Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond. arXiv preprint arXiv:2308.12966,

  5. [5]

    Visit-bench: A benchmark for vision-language instruction following inspired by real-world use.arXiv preprint arXiv:2308.06595, 2023

    Yonatan Bitton, Hritik Bansal, Jack Hessel, Rulin Shao, Wanrong Zhu, Anas Awadalla, Josh Gardner, Rohan Taori, and Ludwig Schimdt. VisIT-Bench: A benchmark for vision-language instruction following inspired by real-world use. arXiv preprint arXiv:2308.06595,

  6. [6]

    Breaking common sense: WHOOPS! A vision-and-language benchmark of synthetic and compositional images

    Nitzan Bitton-Guetta, Yonatan Bitton, Jack Hessel, Ludwig Schmidt, Yuval Elovici, Gabriel Stanovsky, and Roy Schwartz. Breaking common sense: WHOOPS! A vision-and-language benchmark of synthetic and compositional images. arXiv preprint arXiv:2303.07274,

  7. [7]

    Language models are few-shot learners

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901,

  8. [8]

    Sparks of Artificial General Intelligence: Early experiments with GPT-4

    Sébastien Bubeck, Varun Chandrasekaran, Ronen Eldan, Johannes Gehrke, Eric Horvitz, Ece Kamar, Peter Lee, Yin Tat Lee, Yuanzhi Li, Scott Lundberg, et al. Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712,

  9. [9]

    MapQA: A dataset for question answering on choropleth maps

    Shuaichen Chang, David Palzer, Jialin Li, Eric Fosler-Lussier, and Ningchuan Xiao. MapQA: A dataset for question answering on choropleth maps. arXiv preprint arXiv:2211.08545,

  10. [10]

    Videollm: Modeling video sequence with large language models

    11 Preprint Guo Chen, Yin-Dong Zheng, Jiahao Wang, Jilan Xu, Yifei Huang, Junting Pan, Yi Wang, Yali Wang, Yu Qiao, Tong Lu, et al. Videollm: Modeling video sequence with large language models. arXiv preprint arXiv:2305.13292, 2023a. Jiaqi Chen, Jianheng Tang, Jinghui Qin, Xiaodan Liang, Lingbo Liu, Eric P. Xing, and Liang Lin. Geoqa: A geometric question...

  11. [11]

    CLEVR-Math: A dataset for compositional language, visual and mathematical reasoning

    Adam Dahlgren Lindström and Savitha Sam Abraham. CLEVR-Math: A dataset for compositional language, visual and mathematical reasoning. In 16th International Workshop on Neural-Symbolic Learning and Reasoning, NeSy 2022, Windsor, UK, september 28-30, 2022., volume

  12. [12]

    Large language model for science: A study on P vs

    Qingxiu Dong, Li Dong, Ke Xu, Guangyan Zhou, Yaru Hao, Zhifang Sui, and Furu Wei. Large language model for science: A study on P vs. NP. arXiv preprint arXiv:2309.05689,

  13. [13]

    InternLM-XComposer2: Mastering Free-form Text-Image Composition and Comprehension in Vision-Language Large Model

    Xiaoyi Dong, Pan Zhang, Yuhang Zang, Yuhang Cao, Bin Wang, Linke Ouyang, Xilin Wei, Songyang Zhang, Haodong Duan, Maosong Cao, et al. Internlm-xcomposer2: Mastering free-form text-image composition and comprehension in vision-language large model. arXiv preprint arXiv:2401.16420,

  14. [14]

    MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

    Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Jinrui Yang, Xiawu Zheng, Ke Li, Xing Sun, Yunsheng Wu, and Rongrong Ji. Mme: A comprehensive evaluation benchmark for multimodal large language models. arXiv preprint arXiv:2306.13394, 2023a. Chaoyou Fu, Renrui Zhang, Haojia Lin, Zihan Wang, Timin Gao, Yongdong Luo, Yubo Huang, Zhe...

  15. [15]

    Point-bind & point-llm: Aligning point cloud with multi-modality for 3d understanding, generation, and instruction following

    Ziyu Guo, Renrui Zhang, Xiangyang Zhu, Yiwen Tang, Xianzheng Ma, Jiaming Han, Kexin Chen, Peng Gao, Xianzhi Li, Hongsheng Li, et al. Point-bind & point-llm: Aligning point cloud with multi-modality for 3d understanding, generation, and instruction following. arXiv preprint arXiv:2309.00615,

  16. [16]

    Imagebind-llm: Multi-modality instruction tun- ing

    Jiaming Han, Renrui Zhang, Wenqi Shao, Peng Gao, Peng Xu, Han Xiao, Kaipeng Zhang, Chris Liu, Song Wen, Ziyu Guo, et al. Imagebind-llm: Multi-modality instruction tuning. arXiv preprint arXiv:2309.03905,

  17. [17]

    C-eval: A multi-level multi-discipline chinese evaluation suite for foundation models

    Yuzhen Huang, Yuzhuo Bai, Zhihao Zhu, Junlei Zhang, Jinghan Zhang, Tangjun Su, Junteng Liu, Chuancheng Lv, Yikai Zhang, Jiayi Lei, et al. C-eval: A multi-level multi-discipline chinese evaluation suite for foundation models. arXiv preprint arXiv:2305.08322,

  18. [18]

    Abstract visual reasoning with tangram shapes

    Anya Ji, Noriyuki Kojima, Noah Rush, Alane Suhr, Wai Keen V ong, Robert D Hawkins, and Yoav Artzi. Abstract visual reasoning with tangram shapes. arXiv preprint arXiv:2211.16492,

  19. [19]

    13 Preprint Albert Q. Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de Las Casas, Emma Bou Hanna, Florian Bressand, Gianna Lengyel, Guillaume Bour, Guillaume Lample, Lélio Renard Lavaud, Lucile Saulnier, Marie- Anne Lachaux, Pierre Stock, Sandeep Subramanian, Sophia Yang, Szymon An...

  20. [20]

    FigureQA: An Annotated Figure Dataset for Visual Reasoning

    Samira Ebrahimi Kahou, Vincent Michalski, Adam Atkinson, Ákos Kádár, Adam Trischler, and Yoshua Bengio. FigureQA: An annotated figure dataset for visual reasoning. arXiv preprint arXiv:1710.07300,

  21. [21]

    A diagram is worth a dozen images

    Aniruddha Kembhavi, Mike Salvato, Eric Kolve, Minjoon Seo, Hannaneh Hajishirzi, and Ali Farhadi. A diagram is worth a dozen images. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part IV 14 , pp. 235–251. Springer,

  22. [22]

    Segment Anything

    Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al. Segment anything.arXiv preprint arXiv:2304.02643,

  23. [23]

    Mimic-it: Multi-modal in-context instruction tuning,

    Bo Li, Yuanhan Zhang, Liangyu Chen, Jinghao Wang, Fanyi Pu, Jingkang Yang, Chunyuan Li, and Ziwei Liu. Mimic-it: Multi-modal in-context instruction tuning. arXiv preprint arXiv:2306.05425, 2023a. Bohao Li, Rui Wang, Guangzhi Wang, Yuying Ge, Yixiao Ge, and Ying Shan. Seed-bench: Bench- marking multimodal llms with generative comprehension. ArXiv, abs/2307...

  24. [24]

    Microsoft COCO: Common objects in context

    14 Preprint Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft COCO: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pp. 740–755. Springer,

  25. [25]

    SPHINX: The Joint Mixing of Weights, Tasks, and Visual Embeddings for Multi-modal Large Language Models

    Ziyi Lin, Chris Liu, Renrui Zhang, Peng Gao, Longtian Qiu, Han Xiao, Han Qiu, Chen Lin, Wenqi Shao, Keqin Chen, et al. Sphinx: The joint mixing of weights, tasks, and visual embeddings for multi-modal large language models. arXiv preprint arXiv:2311.07575,

  26. [26]

    MatCha: Enhancing visual language pretrain- ing with math reasoning and chart derendering

    Fangyu Liu, Francesco Piccinno, Syrine Krichene, Chenxi Pang, Kenton Lee, Mandar Joshi, Yasemin Altun, Nigel Collier, and Julian Martin Eisenschlos. MatCha: Enhancing visual language pretrain- ing with math reasoning and chart derendering. arXiv preprint arXiv:2212.09662,

  27. [27]

    Visual Instruction Tuning

    Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning, 2023a. Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. arXiv preprint arXiv:2304.08485, 2023b. Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. Llava-next: Improved reasoning,...

  28. [28]

    AgentBench: Evaluating LLMs as Agents

    URL https:// llava-vl.github.io/blog/2024-01-30-llava-next/ . Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, et al. AgentBench: Evaluating LLMs as agents. arXiv preprint arXiv:2308.03688, 2023c. Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wan...

  29. [29]

    MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts

    Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chun yue Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating math reasoning in visual contexts with gpt-4v, bard, and other large multimodal models. ArXiv, abs/2310.02255, 2023a. Pan Lu, Liang Qiu, Wenhao Yu, Sean Welleck, and Kai-Wei Chang. A survey of de...

  30. [30]

    ChartQA: A benchmark for question answering about charts with visual and logical reasoning

    15 Preprint Ahmed Masry, Xuan Long Do, Jia Qing Tan, Shafiq Joty, and Enamul Hoque. ChartQA: A benchmark for question answering about charts with visual and logical reasoning. InFindings of the Association for Computational Linguistics: ACL 2022, pp. 2263–2279,

  31. [31]

    UniChart: A universal vision-language pretrained model for chart comprehension and reasoning

    Ahmed Masry, Parsa Kavehzadeh, Xuan Long Do, Enamul Hoque, and Shafiq Joty. UniChart: A universal vision-language pretrained model for chart comprehension and reasoning. arXiv preprint arXiv:2305.14761,

  32. [32]

    LILA: A unified benchmark for mathematical reasoning

    Swaroop Mishra, Matthew Finlayson, Pan Lu, Leonard Tang, Sean Welleck, Chitta Baral, Tanmay Rajpurohit, Oyvind Tafjord, Ashish Sabharwal, Peter Clark, and Ashwin Kalyan. LILA: A unified benchmark for mathematical reasoning. In The 2022 Conference on Empirical Methods in Natural Language Processing (EMNLP),

  33. [33]

    Capabilities of GPT-4 on Medical Challenge Problems

    Harsha Nori, Nicholas King, Scott Mayer McKinney, Dean Carignan, and Eric Horvitz. Capabilities of GPT-4 on medical challenge problems. arXiv preprint arXiv:2303.13375,

  34. [34]

    Solving General Arithmetic Word Problems

    URL https://api.semanticscholar.org/CorpusID: 231591445. Subhro Roy and Dan Roth. Solving general arithmetic word problems. ArXiv, abs/1608.01413,

  35. [35]

    Solving geometry problems: Combining text and diagram interpretation

    16 Preprint Minjoon Seo, Hannaneh Hajishirzi, Ali Farhadi, Oren Etzioni, and Clint Malcolm. Solving geometry problems: Combining text and diagram interpretation. In Proceedings of the 2015 conference on empirical methods in natural language processing, pp. 1466–1476,

  36. [36]

    Tiny lvlm-ehub: Early multimodal experiments with bard

    Wenqi Shao, Yutao Hu, Peng Gao, Meng Lei, Kaipeng Zhang, Fanqing Meng, Peng Xu, Siyuan Huang, Hongsheng Li, Yu Qiao, et al. Tiny lvlm-ehub: Early multimodal experiments with bard. arXiv preprint arXiv:2308.03729,

  37. [37]

    PandaGPT: One Model To Instruction-Follow Them All

    Yixuan Su, Tian Lan, Huayang Li, Jialu Xu, Yan Wang, and Deng Cai. Pandagpt: One model to instruction-follow them all. arXiv preprint arXiv:2305.16355,

  38. [38]

    SciEval: A multi-level large language model evaluation benchmark for scientific research

    Liangtai Sun, Yang Han, Zihan Zhao, Da Ma, Zhennan Shen, Baocai Chen, Lu Chen, and Kai Yu. SciEval: A multi-level large language model evaluation benchmark for scientific research. arXiv preprint arXiv:2308.13149,

  39. [39]

    Large language model (llm) as a system of multiple expert agents: An approach to solve the abstraction and reasoning corpus (arc) challenge

    John Chong Min Tan and Mehul Motani. Large language model (llm) as a system of multiple expert agents: An approach to solve the abstraction and reasoning corpus (arc) challenge. arXiv preprint arXiv:2310.05146,

  40. [40]

    Galactica: A Large Language Model for Science

    Ross Taylor, Marcin Kardas, Guillem Cucurull, Thomas Scialom, Anthony Hartshorn, Elvis Saravia, Andrew Poulton, Viktor Kerkez, and Robert Stojnic. Galactica: A large language model for science. arXiv preprint arXiv:2211.09085,

  41. [41]

    Gemini: A Family of Highly Capable Multimodal Models

    Gemini Team, Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, et al. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805,

  42. [42]

    LLaMA: Open and Efficient Foundation Language Models

    Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023a. Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay ...

  43. [43]

    Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

    Ke Wang, Houxing Ren, Aojun Zhou, Zimu Lu, Sichun Luo, Weikang Shi, Renrui Zhang, Linqi Song, Mingjie Zhan, and Hongsheng Li. Mathcoder: Seamless code integration in LLMs for enhanced mathematical reasoning. In The Twelfth International Conference on Learning Representations, 2024a. URL https://openreview.net/forum?id=z8TW0ttBPp. Peng Wang, Shuai Bai, Sin...

  44. [44]

    Emergent Abilities of Large Language Models

    Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, et al. Emergent abilities of large language models. arXiv preprint arXiv:2206.07682, 2022a. Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Ed Chi, Quoc Le, and Denny Zhou. Chain of thought prompting elicits ...

  45. [45]

    Lvlm-ehub: A comprehensive evaluation benchmark for large vision-language models

    Peng Xu, Wenqi Shao, Kaipeng Zhang, Peng Gao, Shuo Liu, Meng Lei, Fanqing Meng, Siyuan Huang, Yu Qiao, and Ping Luo. LVLM-eHub: A comprehensive evaluation benchmark for large vision-language models. arXiv preprint arXiv:2306.09265, 2023a. Runsen Xu, Xiaolong Wang, Tai Wang, Yilun Chen, Jiangmiao Pang, and Dahua Lin. Pointllm: Empowering large language mod...

  46. [46]

    mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality

    Qinghao Ye, Haiyang Xu, Guohai Xu, Jiabo Ye, Ming Yan, Yiyang Zhou, Junyang Wang, Anwen Hu, Pengcheng Shi, Yaya Shi, et al. mPlug-Owl: Modularization empowers large language models with multimodality. arXiv preprint arXiv:2304.14178, 2023a. Qinghao Ye, Haiyang Xu, Jiabo Ye, Ming Yan, Anwen Hu, Haowei Liu, Qi Qian, Ji Zhang, Fei Huang, and Jingren Zhou. mp...

  47. [47]

    MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities

    Weihao Yu, Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Zicheng Liu, Xinchao Wang, and Lijuan Wang. MM-Vet: Evaluating large multimodal models for integrated capabilities. arXiv preprint arXiv:2308.02490,

  48. [48]

    MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI

    Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, Cong Wei, Botao Yu, Ruibin Yuan, Renliang Sun, Ming Yin, Boyuan Zheng, Zhenzhu Yang, Yibo Liu, Wenhao Huang, Huan Sun, Yu Su, and Wenhu Chen. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for exp...

  49. [49]

    Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding

    18 Preprint Hang Zhang, Xin Li, and Lidong Bing. Video-llama: An instruction-tuned audio-visual language model for video understanding. arXiv preprint arXiv:2306.02858, 2023a. Renrui Zhang, Jiaming Han, Aojun Zhou, Xiangfei Hu, Shilin Yan, Pan Lu, Hongsheng Li, Peng Gao, and Qiao Yu. LLaMA-Adapter: Efficient fine-tuning of language models with zero-init a...

  50. [50]

    Xiang Zhang, Senyu Li, Zijun Wu, and Ning Shi

    URL https://openreview.net/forum?id=d4UiXAHN2W. Xiang Zhang, Senyu Li, Zijun Wu, and Ning Shi. Lost in translation: When gpt-4v (ision) can’t see eye to eye with text. a vision-language-consistency analysis of vllms and beyond. arXiv preprint arXiv:2310.12520, 2023f. Xiaoman Zhang, Chaoyi Wu, Ziheng Zhao, Weixiong Lin, Ya Zhang, Yanfeng Wang, and Weidi Xi...

  51. [51]

    Solving challenging math word problems using gpt-4 code interpreter with code-based self-verification

    Aojun Zhou, Ke Wang, Zimu Lu, Weikang Shi, Sichun Luo, Zipeng Qin, Shaoqing Lu, Anya Jia, Linqi Song, Mingjie Zhan, et al. Solving challenging math word problems using gpt-4 code interpreter with code-based self-verification. arXiv preprint arXiv:2308.07921,

  52. [52]

    MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

    Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. MiniGPT-4: En- hancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023a. Wanrong Zhu, Jack Hessel, Anas Awadalla, Samir Yitzhak Gadre, Jesse Dodge, Alex Fang, Youngjae Yu, Ludwig Schmidt, William Yang Wang, and Yejin Choi. Multimod...

  53. [53]

    word problems with limited scope. Subsequent efforts, including MATH (Hendrycks et al., 2021b), GSM8K (Cobbe et al., 2021), MMLU (Hendrycks et al., 2021a), and others (Zhou et al., 2023; Yue et al., 2023b; Wang et al., 2024a; Gao et al., 2023a; Luo et al., 2023), expanded the range and quality of textual mathematical problems, establishing robust benchmar...

  54. [54]

    While these benchmarks assess text-only mathematical reasoning, the rapid progress of MLLMs necessitates high-quality benchmarks for evaluating visual mathematical problem-solving

    provide limited coverage of rigorous scientific domains crucial for general-purpose AI assistants. While these benchmarks assess text-only mathematical reasoning, the rapid progress of MLLMs necessitates high-quality benchmarks for evaluating visual mathematical problem-solving. Prior 20 Preprint attempts like GeoQA (Chen et al., 2021a), while MathVista (...

  55. [55]

    and large vision models (Radford et al., 2021; Kirillov et al., 2023; Zhang et al., 2023d;c;e), have become increasingly prominent. They extend LLMs to diverse tasks and modalities, including 2D images (Li et al., 2022; Dai et al., 2023; Alayrac et al., 2022; Li et al., 2023a), 3D point clouds (Guo et al., 2023; Xu et al., 2023b; Hong et al., 2024), audio...

  56. [56]

    However, their closed-source nature hinders broader application and development of MLLMs

    exhibit exceptional visual reasoning capabilities, setting new benchmarks in multi-modal performance. However, their closed-source nature hinders broader application and development of MLLMs. Concur- rently, open-source MLLMs like LLaMA-Adapter (Zhang et al., 2024; Gao et al., 2023b), LLaV A (Liu et al., 2023b; 2024; 2023a), MiniGPT-4 (Zhu et al., 2023a; ...

  57. [57]

    for image encoding and LLaMA (Touvron et al., 2023a) for multi-modal instruction tuning, advancing MLLMs’ visual understanding and generalization. Despite comprehensive benchmarks (Fu et al., 2023a; Liu et al., 2023d; Li et al., 2023b; Xu et al., 2023a) for general visual instruction-following scenarios, the specific potential of MLLMs for visual mathemat...

  58. [58]

    evaluate LMMs’ general visual question answering abilities on open-ended image queries. Additionally, works have assessed LMMs’ specific skills beyond natural scenes, such as abstract shapes (Antol et al., 2015; Lu et al., 2021b; Ji et al., 2022), geometry diagrams (Seo et al., 2015; Lu et al., 2021a; Chen et al., 2022a; Cao & Xiao, 2022), charts (Methani...

  59. [59]

    leverage paired (Schuhmann et al., 2022; Sharma et al., 2018; Lin et al.,

  60. [60]

    Additionally, specialized versions like LLaV AR (Zhang et al., 2023h; Ye et al., 2023a) emphasize document understanding and math comprehension

    and interleaved (Zhu et al., 2023b) image-text data. Additionally, specialized versions like LLaV AR (Zhang et al., 2023h; Ye et al., 2023a) emphasize document understanding and math comprehension. Recent works like Visit-Bench (Bitton et al., 2023), LVLM-eHub (Yu et al., 2023), MMBench (Liu et al., 2023d; Xu et al., 2023a; Shao et al.,

  61. [61]

    assess these models’ instruction-following and reasoning capabilities. Large language models (LLMs) have demonstrated remarkable reasoning abilities, further enhanced by approaches like chain-of-thought (CoT) (Wei et al., 2022b), program-of-thought (PoT) (Chen et al., 2022b), and inductive reasoning (Wang et al., 2023a; Tan & Motani, 2023). The feasibilit...