Polymath: A Challenging Multi-modal Mathematical Reasoning Benchmark

Chitta Baral; Himanshu Gupta; Kevin Scaria; Mihir Parmar; Shreyas Verma; Swaroop Mishra; Ujjwala Anantheswaran

arxiv: 2410.14702 · v2 · submitted 2024-10-06 · 💻 cs.AI · cs.CL

Polymath: A Challenging Multi-modal Mathematical Reasoning Benchmark

Himanshu Gupta , Shreyas Verma , Ujjwala Anantheswaran , Kevin Scaria , Mihir Parmar , Swaroop Mishra , Chitta Baral This is my paper

Pith reviewed 2026-05-23 20:01 UTC · model grok-4.3

classification 💻 cs.AI cs.CL

keywords multi-modal large language modelsmathematical reasoningvisual comprehensionspatial reasoningbenchmark evaluationpattern recognitionabstract reasoningMLLM performance

0 comments

The pith

Multi-modal models gain only about 4 percent when diagrams are replaced by text descriptions on a new math reasoning benchmark.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents PolyMATH, a benchmark of 5,000 manually collected images spanning ten categories of cognitive challenges including pattern recognition, spatial reasoning, and relative reasoning. Comprehensive tests of fifteen MLLMs across four prompting methods produce top scores of roughly 41 percent for Claude-3.5 Sonnet, 36 percent for GPT-4o, and 27 percent for Gemini-1.5 Pro. An ablation replacing images with textual descriptions yields only a four percent average lift, which the authors interpret as evidence that models fail to extract spatial information from diagrams and therefore commit logical errors in extended reasoning. OpenAI o1 models reach performance levels comparable to the human baseline, underscoring the benchmark's difficulty.

Core claim

The central claim is that MLLMs do not truly comprehend visual diagrams and the spatial information they contain. This is shown by the low overall scores on PolyMATH and by the ablation result that textual descriptions produce only marginal gains over the actual images, leaving models prone to logical errors on tasks requiring drawn-out high-level reasoning.

What carries the argument

The PolyMATH benchmark itself, built from 5,000 high-quality images across ten distinct categories of textual and visual cognitive challenges.

If this is right

Current MLLMs remain limited in handling spatial relations within diagrams even when prompted with Chain-of-Thought or Step-Back strategies.
Models are prone to logical errors once tasks require integrating visual details over multiple steps.
The benchmark can serve as a diagnostic tool to measure progress in visual abstraction beyond current training regimes.
OpenAI o1 models matching human baseline performance indicates that scaling alone has not closed the gap on these visual reasoning tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Training corpora may contain too few examples of complex diagram-based spatial relations, limiting what models can internalize from image-text pairs.
Targeted pretraining on diagram parsing or synthetic spatial puzzles could be tested as a direct remedy for the observed weaknesses.
Extending the benchmark with dynamic or interactive diagrams would reveal whether the current static-image limitation is fundamental.

Load-bearing premise

The 5,000 manually collected images form a fair, unbiased, and appropriately difficult sample of the visual and cognitive challenges that matter for mathematical reasoning.

What would settle it

A new model achieving above 70 percent accuracy on the full PolyMATH set or showing more than 15 percent improvement when given textual descriptions instead of the original images would undermine the claim of weak visual diagram comprehension.

Figures

Figures reproduced from arXiv: 2410.14702 by Chitta Baral, Himanshu Gupta, Kevin Scaria, Mihir Parmar, Shreyas Verma, Swaroop Mishra, Ujjwala Anantheswaran.

**Figure 2.** Figure 2: An overview of POLYMATH’s distribution and difficulty (a) exhibits the per-category split of the 5000 questions in the dataset, along with the split of with diagram (WD) and without diagram (WoD) for that category ; (b) Compares the per-category performance of various MLLMs. relational information (∼ 12%). Finally, we evaluate OpenAI o1 models (OpenAI, 2024b) on without diagram questions of the benchmark a… view at source ↗

**Figure 3.** Figure 3: Examples of with diagram and without diagram questions. In addition to the question image, POLYMATH includes the metadata shown above. Question without diagram is not present in test-img while both kinds of questions will be present in testmini. notable gap, as visual puzzle tasks require logical leaps that differ fundamentally from reasoning patterns over textual or linguistic problems. Moreover, spatial … view at source ↗

**Figure 4.** Figure 4: Frequency of Logical Flaw (LF) and Spatial Misunderstanding (SM) errors across different [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗

**Figure 5.** Figure 5: Questions belonging to the figure_completion (FC) category 30 [PITH_FULL_IMAGE:figures/full_fig_p030_5.png] view at source ↗

**Figure 6.** Figure 6: Questions belonging to the logical_reasoning (LR) category 31 [PITH_FULL_IMAGE:figures/full_fig_p031_6.png] view at source ↗

**Figure 7.** Figure 7: Questions belonging to the mathematical_reasoning (MR) category 32 [PITH_FULL_IMAGE:figures/full_fig_p032_7.png] view at source ↗

**Figure 8.** Figure 8: Questions belonging to the numerical_reasoning (NR) category 33 [PITH_FULL_IMAGE:figures/full_fig_p033_8.png] view at source ↗

**Figure 9.** Figure 9: Questions belonging to the odd_one_out (OD) category 34 [PITH_FULL_IMAGE:figures/full_fig_p034_9.png] view at source ↗

**Figure 10.** Figure 10: Questions belonging to the pattern_recognition (PR) category 35 [PITH_FULL_IMAGE:figures/full_fig_p035_10.png] view at source ↗

**Figure 11.** Figure 11: Questions belonging to the perspective_shift (PS) category 36 [PITH_FULL_IMAGE:figures/full_fig_p036_11.png] view at source ↗

**Figure 12.** Figure 12: Questions belonging to the relative_reasoning (RR) category 37 [PITH_FULL_IMAGE:figures/full_fig_p037_12.png] view at source ↗

**Figure 13.** Figure 13: Questions belonging to the sequence_completion (SC) category 38 [PITH_FULL_IMAGE:figures/full_fig_p038_13.png] view at source ↗

**Figure 14.** Figure 14: Questions belonging to the spatial_reasoning (SR) category 39 [PITH_FULL_IMAGE:figures/full_fig_p039_14.png] view at source ↗

**Figure 15.** Figure 15: Erroneous model reasoning patterns observed on an FC question [PITH_FULL_IMAGE:figures/full_fig_p040_15.png] view at source ↗

**Figure 16.** Figure 16: Erroneous model reasoning patterns observed on an LR question [PITH_FULL_IMAGE:figures/full_fig_p041_16.png] view at source ↗

**Figure 17.** Figure 17: Erroneous model reasoning patterns observed on an MR question [PITH_FULL_IMAGE:figures/full_fig_p042_17.png] view at source ↗

**Figure 18.** Figure 18: Erroneous model reasoning patterns observed on an NR question [PITH_FULL_IMAGE:figures/full_fig_p043_18.png] view at source ↗

**Figure 19.** Figure 19: Erroneous model reasoning patterns observed on an OD question [PITH_FULL_IMAGE:figures/full_fig_p044_19.png] view at source ↗

**Figure 20.** Figure 20: Erroneous model reasoning patterns observed on a PR question [PITH_FULL_IMAGE:figures/full_fig_p045_20.png] view at source ↗

**Figure 21.** Figure 21: Erroneous model reasoning patterns observed on a PS question [PITH_FULL_IMAGE:figures/full_fig_p046_21.png] view at source ↗

**Figure 22.** Figure 22: Erroneous model reasoning patterns observed on an RR question [PITH_FULL_IMAGE:figures/full_fig_p047_22.png] view at source ↗

**Figure 23.** Figure 23: Erroneous model reasoning patterns observed on an SC question [PITH_FULL_IMAGE:figures/full_fig_p048_23.png] view at source ↗

**Figure 24.** Figure 24: Erroneous model reasoning patterns observed on an SR question [PITH_FULL_IMAGE:figures/full_fig_p049_24.png] view at source ↗

read the original abstract

Multi-modal Large Language Models (MLLMs) exhibit impressive problem-solving abilities in various domains, but their visual comprehension and abstract reasoning skills remain under-evaluated. To this end, we present PolyMATH, a challenging benchmark aimed at evaluating the general cognitive reasoning abilities of MLLMs. PolyMATH comprises 5,000 manually collected high-quality images of cognitive textual and visual challenges across 10 distinct categories, including pattern recognition, spatial reasoning, and relative reasoning. We conducted a comprehensive, and quantitative evaluation of 15 MLLMs using four diverse prompting strategies, including Chain-of-Thought and Step-Back. The best scores achieved on PolyMATH are ~41%, ~36%, and ~27%, obtained by Claude-3.5 Sonnet, GPT-4o and Gemini-1.5 Pro respectively - highlighting the logical and visual complexity of these questions. A further fine-grained error analysis reveals that these models struggle to understand spatial relations and perform drawn-out, high-level reasoning. This is further strengthened by our ablation study estimating MLLM performance when given textual descriptions in place of diagrams. As evidenced by ~4% improvement over textual descriptions as opposed to actual images, we discover that models do not truly comprehend visual diagrams and the spatial information therein, and are thus prone to logical errors. Finally, we evaluate the OpenAI o1 models and find that their performance only matches the human baseline, highlighting the difficulty of the benchmark. The results on PolyMATH highlight the room for improvement in multi-modal reasoning and provide unique insights to guide the development of future MLLMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PolyMATH is a new 5000-image visual math benchmark that shows low MLLM scores and a small image-vs-text gap, but the manual collection lacks enough documentation to firmly support the comprehension claims.

read the letter

PolyMATH puts together 5000 manually collected images in 10 categories covering pattern recognition, spatial reasoning, and similar tasks, then runs 15 MLLMs through four prompting setups. The top result is 41% for Claude-3.5 Sonnet, with GPT-4o at 36% and Gemini at 27%; o1 models reach only the human baseline. The ablation replacing images with text descriptions produces just a 4% lift, which the authors read as evidence that models fail to extract spatial information from diagrams. Error analysis points to trouble with spatial relations and extended reasoning chains. This is new work because it supplies a larger, category-structured multi-modal math set and pairs it with the image-to-text ablation. The evaluation across multiple models and prompts is straightforward and useful for tracking progress. The dataset scale and the concrete error categories give it some practical value for people building or testing MLLMs. The main weakness is the dataset construction. The abstract says the images were manually collected but gives no sourcing protocol, inclusion rules, diversity checks, or verification steps. Without those, the low absolute scores and the tiny ablation delta could reflect selection of unusually difficult cases rather than a general failure to read diagrams. The absence of inter-annotator agreement numbers or difficulty calibration also leaves the quantitative claims thinner than they need to be. The stress-test concern about bias therefore stands on the information provided. This paper is aimed at groups working on multi-modal evaluation and model improvement. It is worth sending to peer review so the construction details can be examined and the dataset can be stress-tested by others before it is widely adopted.

Referee Report

3 major / 1 minor

Summary. The manuscript introduces PolyMATH, a benchmark consisting of 5,000 manually collected images across 10 categories of cognitive challenges (pattern recognition, spatial reasoning, relative reasoning, etc.) to evaluate MLLMs on multi-modal mathematical reasoning. It reports evaluations of 15 MLLMs under four prompting strategies, with top scores of ~41% (Claude-3.5 Sonnet), ~36% (GPT-4o), and ~27% (Gemini-1.5 Pro); an ablation replacing images with textual descriptions yields only ~4% improvement, supporting the claim that models fail to comprehend visual diagrams and spatial information; o1 models match but do not exceed a human baseline.

Significance. If the 5,000-image sample is representative and the ablation comparison is well-controlled, the results would establish a valuable, high-difficulty benchmark exposing clear gaps in current MLLMs' spatial and multi-step visual reasoning, with the error analysis and human/o1 comparisons providing concrete guidance for future work. The scale and category coverage are strengths.

major comments (3)

[Benchmark construction] Benchmark construction (abstract and § on data collection): The images are described only as 'manually collected' across 10 categories with no sourcing protocol, inclusion/exclusion criteria, diversity metrics, or verification steps against selection bias. This is load-bearing for the central claim, because both the headline ~41% ceiling and the ~4% image-vs-text gap could arise from non-representative sampling of unusually difficult cases rather than a fundamental deficit in diagram comprehension.
[Ablation study] Ablation study (results section): The ~4% performance lift when substituting textual descriptions for images is used to conclude that 'models do not truly comprehend visual diagrams,' yet the manuscript supplies no information on how the textual descriptions were generated, their fidelity to the original diagrams, or any validation that they preserve all spatial relations. Without these controls the ablation cannot isolate visual comprehension.
[Experimental setup] Evaluation protocol (experimental setup): No inter-annotator agreement, difficulty calibration procedure, or statistical tests (e.g., confidence intervals or significance of the 4% gap) are reported for the quantitative scores or human baseline. These omissions undermine the reliability of the performance numbers that underpin the main conclusions.

minor comments (1)

[Abstract] The abstract lists 'four diverse prompting strategies' but only names Chain-of-Thought and Step-Back; the remaining two should be enumerated for reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback highlighting areas where additional methodological details would strengthen the paper. We address each major comment below and will incorporate the suggested clarifications and controls in a revised manuscript.

read point-by-point responses

Referee: [Benchmark construction] The images are described only as 'manually collected' across 10 categories with no sourcing protocol, inclusion/exclusion criteria, diversity metrics, or verification steps against selection bias. This is load-bearing for the central claim, because both the headline ~41% ceiling and the ~4% image-vs-text gap could arise from non-representative sampling of unusually difficult cases rather than a fundamental deficit in diagram comprehension.

Authors: We agree that the current description of data collection is insufficiently detailed. In the revision we will add a dedicated subsection describing the sourcing (publicly available cognitive puzzle repositories and textbooks), explicit inclusion criteria (clear diagrams, unambiguous ground-truth answers, coverage of the 10 categories), diversity metrics (balanced distribution across categories and difficulty levels as measured by pilot human accuracy), and verification (two independent annotators reviewed each item for bias and clarity, with disagreements resolved by discussion). These additions will directly support the representativeness of the 5,000-image sample. revision: yes
Referee: [Ablation study] The ~4% performance lift when substituting textual descriptions for images is used to conclude that 'models do not truly comprehend visual diagrams,' yet the manuscript supplies no information on how the textual descriptions were generated, their fidelity to the original diagrams, or any validation that they preserve all spatial relations. Without these controls the ablation cannot isolate visual comprehension.

Authors: We acknowledge the omission of generation and validation details for the textual descriptions. The revision will specify that descriptions were produced by human experts following a template that required explicit enumeration of all spatial relations, object positions, and visual elements while omitting the solution; a separate validation set of 200 items was rated by two additional annotators for completeness (mean fidelity score 4.7/5), with inter-annotator agreement reported. These controls will be added to the ablation section to better isolate the contribution of visual input. revision: yes
Referee: [Experimental setup] No inter-annotator agreement, difficulty calibration procedure, or statistical tests (e.g., confidence intervals or significance of the 4% gap) are reported for the quantitative scores or human baseline. These omissions undermine the reliability of the performance numbers that underpin the main conclusions.

Authors: We agree these statistical and reliability details are missing. The revised experimental setup section will report (i) inter-annotator agreement (Cohen’s kappa) for the human baseline, (ii) the difficulty calibration procedure (pilot testing on 100 items with 10 participants to stratify by accuracy bands), and (iii) bootstrap confidence intervals plus paired t-tests for all model comparisons, including the 4% image-vs-text gap. These additions will quantify the reliability of the reported numbers. revision: yes

Circularity Check

0 steps flagged

No circularity: direct empirical evaluation on collected dataset

full rationale

The paper reports performance measurements and an ablation on a newly collected set of 5,000 images. No equations, fitted parameters, predictions derived from inputs, or self-citation chains appear in the derivation of the central claims. The ~4% text-vs-image gap and overall scores are direct empirical observations on the provided data, not quantities forced by construction or renamed from prior fits. The manual collection step is an input to the evaluation rather than a self-referential loop. This is a standard benchmark paper with no load-bearing reductions to its own outputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

No free parameters or invented entities are introduced; the work rests on the domain assumption that the manually curated images validly test the targeted reasoning skills.

axioms (1)

domain assumption Manually collected images across 10 categories accurately capture distinct cognitive challenges without selection bias.
Invoked in the description of benchmark construction in the abstract.

pith-pipeline@v0.9.0 · 5840 in / 1177 out tokens · 40830 ms · 2026-05-23T20:01:27.942914+00:00 · methodology

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

How RL Unlocks the Aha Moment in Geometric Interleaved Reasoning
cs.CL 2026-03 unverdicted novelty 6.0

Reinforcement learning with three causal constraints enables multimodal models to internalize diagram-reasoning links in geometry, unlike SFT which only mimics surface format and harms performance.
ViSurf: Visual Supervised-and-Reinforcement Fine-Tuning for Large Vision-and-Language Models
cs.CV 2025-10 unverdicted novelty 6.0

ViSurf unifies SFT and RLVR for LVLMs in one training stage by injecting ground-truth labels into rollouts and applying novel reward controls, outperforming standalone and two-stage baselines on diverse benchmarks.

Reference graph

Works this paper leans on

61 extracted references · 61 canonical work pages · cited by 2 Pith papers · 26 internal anchors

[1]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Mathqa: Towards interpretable math word problem solving with operation-based formalisms

Aida Amini, Saadia Gabriel, Shanchuan Lin, Rik Koncel-Kedziorski, Yejin Choi, and Hannaneh Hajishirzi. Mathqa: Towards interpretable math word problem solving with operation-based formalisms. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL), pp. 2357– 2367,

work page 2019
[3]

OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models

Anas Awadalla, Irena Gao, Josh Gardner, Jack Hessel, Yusuf Hanafy, Wanrong Zhu, Kalyani Marathe, Yonatan Bitton, Samir Gadre, Shiori Sagawa, et al. OpenFlamingo: An open-source framework for training large autoregressive vision-language models. arXiv preprint arXiv:2308.01390,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond. arXiv preprint arXiv:2308.12966,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

Visit-bench: A benchmark for vision-language instruction following inspired by real-world use.arXiv preprint arXiv:2308.06595, 2023

Yonatan Bitton, Hritik Bansal, Jack Hessel, Rulin Shao, Wanrong Zhu, Anas Awadalla, Josh Gardner, Rohan Taori, and Ludwig Schimdt. VisIT-Bench: A benchmark for vision-language instruction following inspired by real-world use. arXiv preprint arXiv:2308.06595,

work page arXiv
[6]

Breaking common sense: WHOOPS! A vision-and-language benchmark of synthetic and compositional images

Nitzan Bitton-Guetta, Yonatan Bitton, Jack Hessel, Ludwig Schmidt, Yuval Elovici, Gabriel Stanovsky, and Roy Schwartz. Breaking common sense: WHOOPS! A vision-and-language benchmark of synthetic and compositional images. arXiv preprint arXiv:2303.07274,

work page arXiv
[7]

Language models are few-shot learners

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901,

work page 1901
[8]

Sparks of Artificial General Intelligence: Early experiments with GPT-4

Sébastien Bubeck, Varun Chandrasekaran, Ronen Eldan, Johannes Gehrke, Eric Horvitz, Ece Kamar, Peter Lee, Yin Tat Lee, Yuanzhi Li, Scott Lundberg, et al. Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712,

work page internal anchor Pith review Pith/arXiv arXiv
[9]

MapQA: A dataset for question answering on choropleth maps

Shuaichen Chang, David Palzer, Jialin Li, Eric Fosler-Lussier, and Ningchuan Xiao. MapQA: A dataset for question answering on choropleth maps. arXiv preprint arXiv:2211.08545,

work page arXiv
[10]

Videollm: Modeling video sequence with large language models

11 Preprint Guo Chen, Yin-Dong Zheng, Jiahao Wang, Jilan Xu, Yifei Huang, Junting Pan, Yi Wang, Yali Wang, Yu Qiao, Tong Lu, et al. Videollm: Modeling video sequence with large language models. arXiv preprint arXiv:2305.13292, 2023a. Jiaqi Chen, Jianheng Tang, Jinghui Qin, Xiaodan Liang, Lingbo Liu, Eric P. Xing, and Liang Lin. Geoqa: A geometric question...

work page arXiv 2022
[11]

CLEVR-Math: A dataset for compositional language, visual and mathematical reasoning

Adam Dahlgren Lindström and Savitha Sam Abraham. CLEVR-Math: A dataset for compositional language, visual and mathematical reasoning. In 16th International Workshop on Neural-Symbolic Learning and Reasoning, NeSy 2022, Windsor, UK, september 28-30, 2022., volume

work page 2022
[12]

Large language model for science: A study on P vs

Qingxiu Dong, Li Dong, Ke Xu, Guangyan Zhou, Yaru Hao, Zhifang Sui, and Furu Wei. Large language model for science: A study on P vs. NP. arXiv preprint arXiv:2309.05689,

work page arXiv
[13]

InternLM-XComposer2: Mastering Free-form Text-Image Composition and Comprehension in Vision-Language Large Model

Xiaoyi Dong, Pan Zhang, Yuhang Zang, Yuhang Cao, Bin Wang, Linke Ouyang, Xilin Wei, Songyang Zhang, Haodong Duan, Maosong Cao, et al. Internlm-xcomposer2: Mastering free-form text-image composition and comprehension in vision-language large model. arXiv preprint arXiv:2401.16420,

work page internal anchor Pith review Pith/arXiv arXiv
[14]

MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Jinrui Yang, Xiawu Zheng, Ke Li, Xing Sun, Yunsheng Wu, and Rongrong Ji. Mme: A comprehensive evaluation benchmark for multimodal large language models. arXiv preprint arXiv:2306.13394, 2023a. Chaoyou Fu, Renrui Zhang, Haojia Lin, Zihan Wang, Timin Gao, Yongdong Luo, Yubo Huang, Zhe...

work page internal anchor Pith review Pith/arXiv arXiv
[15]

Point-bind & point-llm: Aligning point cloud with multi-modality for 3d understanding, generation, and instruction following

Ziyu Guo, Renrui Zhang, Xiangyang Zhu, Yiwen Tang, Xianzheng Ma, Jiaming Han, Kexin Chen, Peng Gao, Xianzhi Li, Hongsheng Li, et al. Point-bind & point-llm: Aligning point cloud with multi-modality for 3d understanding, generation, and instruction following. arXiv preprint arXiv:2309.00615,

work page arXiv
[16]

Imagebind-llm: Multi-modality instruction tun- ing

Jiaming Han, Renrui Zhang, Wenqi Shao, Peng Gao, Peng Xu, Han Xiao, Kaipeng Zhang, Chris Liu, Song Wen, Ziyu Guo, et al. Imagebind-llm: Multi-modality instruction tuning. arXiv preprint arXiv:2309.03905,

work page arXiv
[17]

C-eval: A multi-level multi-discipline chinese evaluation suite for foundation models

Yuzhen Huang, Yuzhuo Bai, Zhihao Zhu, Junlei Zhang, Jinghan Zhang, Tangjun Su, Junteng Liu, Chuancheng Lv, Yikai Zhang, Jiayi Lei, et al. C-eval: A multi-level multi-discipline chinese evaluation suite for foundation models. arXiv preprint arXiv:2305.08322,

work page arXiv
[18]

Abstract visual reasoning with tangram shapes

Anya Ji, Noriyuki Kojima, Noah Rush, Alane Suhr, Wai Keen V ong, Robert D Hawkins, and Yoav Artzi. Abstract visual reasoning with tangram shapes. arXiv preprint arXiv:2211.16492,

work page arXiv
[19]

13 Preprint Albert Q. Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de Las Casas, Emma Bou Hanna, Florian Bressand, Gianna Lengyel, Guillaume Bour, Guillaume Lample, Lélio Renard Lavaud, Lucile Saulnier, Marie- Anne Lachaux, Pierre Stock, Sandeep Subramanian, Sophia Yang, Szymon An...

work page internal anchor Pith review Pith/arXiv arXiv
[20]

FigureQA: An Annotated Figure Dataset for Visual Reasoning

Samira Ebrahimi Kahou, Vincent Michalski, Adam Atkinson, Ákos Kádár, Adam Trischler, and Yoshua Bengio. FigureQA: An annotated figure dataset for visual reasoning. arXiv preprint arXiv:1710.07300,

work page internal anchor Pith review Pith/arXiv arXiv
[21]

A diagram is worth a dozen images

Aniruddha Kembhavi, Mike Salvato, Eric Kolve, Minjoon Seo, Hannaneh Hajishirzi, and Ali Farhadi. A diagram is worth a dozen images. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part IV 14 , pp. 235–251. Springer,

work page 2016
[22]

Segment Anything

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al. Segment anything.arXiv preprint arXiv:2304.02643,

work page internal anchor Pith review Pith/arXiv arXiv
[23]

Mimic-it: Multi-modal in-context instruction tuning,

Bo Li, Yuanhan Zhang, Liangyu Chen, Jinghao Wang, Fanyi Pu, Jingkang Yang, Chunyuan Li, and Ziwei Liu. Mimic-it: Multi-modal in-context instruction tuning. arXiv preprint arXiv:2306.05425, 2023a. Bohao Li, Rui Wang, Guangzhi Wang, Yuying Ge, Yixiao Ge, and Ying Shan. Seed-bench: Bench- marking multimodal llms with generative comprehension. ArXiv, abs/2307...

work page arXiv
[24]

Microsoft COCO: Common objects in context

14 Preprint Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft COCO: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pp. 740–755. Springer,

work page 2014
[25]

SPHINX: The Joint Mixing of Weights, Tasks, and Visual Embeddings for Multi-modal Large Language Models

Ziyi Lin, Chris Liu, Renrui Zhang, Peng Gao, Longtian Qiu, Han Xiao, Han Qiu, Chen Lin, Wenqi Shao, Keqin Chen, et al. Sphinx: The joint mixing of weights, tasks, and visual embeddings for multi-modal large language models. arXiv preprint arXiv:2311.07575,

work page internal anchor Pith review Pith/arXiv arXiv
[26]

MatCha: Enhancing visual language pretrain- ing with math reasoning and chart derendering

Fangyu Liu, Francesco Piccinno, Syrine Krichene, Chenxi Pang, Kenton Lee, Mandar Joshi, Yasemin Altun, Nigel Collier, and Julian Martin Eisenschlos. MatCha: Enhancing visual language pretrain- ing with math reasoning and chart derendering. arXiv preprint arXiv:2212.09662,

work page arXiv
[27]

Visual Instruction Tuning

Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning, 2023a. Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. arXiv preprint arXiv:2304.08485, 2023b. Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. Llava-next: Improved reasoning,...

work page internal anchor Pith review Pith/arXiv arXiv
[28]

AgentBench: Evaluating LLMs as Agents

URL https:// llava-vl.github.io/blog/2024-01-30-llava-next/ . Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, et al. AgentBench: Evaluating LLMs as agents. arXiv preprint arXiv:2308.03688, 2023c. Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wan...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[29]

MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts

Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chun yue Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating math reasoning in visual contexts with gpt-4v, bard, and other large multimodal models. ArXiv, abs/2310.02255, 2023a. Pan Lu, Liang Qiu, Wenhao Yu, Sean Welleck, and Kai-Wei Chang. A survey of de...

work page internal anchor Pith review Pith/arXiv arXiv
[30]

ChartQA: A benchmark for question answering about charts with visual and logical reasoning

15 Preprint Ahmed Masry, Xuan Long Do, Jia Qing Tan, Shafiq Joty, and Enamul Hoque. ChartQA: A benchmark for question answering about charts with visual and logical reasoning. InFindings of the Association for Computational Linguistics: ACL 2022, pp. 2263–2279,

work page 2022
[31]

UniChart: A universal vision-language pretrained model for chart comprehension and reasoning

Ahmed Masry, Parsa Kavehzadeh, Xuan Long Do, Enamul Hoque, and Shafiq Joty. UniChart: A universal vision-language pretrained model for chart comprehension and reasoning. arXiv preprint arXiv:2305.14761,

work page arXiv
[32]

LILA: A unified benchmark for mathematical reasoning

Swaroop Mishra, Matthew Finlayson, Pan Lu, Leonard Tang, Sean Welleck, Chitta Baral, Tanmay Rajpurohit, Oyvind Tafjord, Ashish Sabharwal, Peter Clark, and Ashwin Kalyan. LILA: A unified benchmark for mathematical reasoning. In The 2022 Conference on Empirical Methods in Natural Language Processing (EMNLP),

work page 2022
[33]

Capabilities of GPT-4 on Medical Challenge Problems

Harsha Nori, Nicholas King, Scott Mayer McKinney, Dean Carignan, and Eric Horvitz. Capabilities of GPT-4 on medical challenge problems. arXiv preprint arXiv:2303.13375,

work page internal anchor Pith review Pith/arXiv arXiv
[34]

Solving General Arithmetic Word Problems

URL https://api.semanticscholar.org/CorpusID: 231591445. Subhro Roy and Dan Roth. Solving general arithmetic word problems. ArXiv, abs/1608.01413,

work page internal anchor Pith review Pith/arXiv arXiv
[35]

Solving geometry problems: Combining text and diagram interpretation

16 Preprint Minjoon Seo, Hannaneh Hajishirzi, Ali Farhadi, Oren Etzioni, and Clint Malcolm. Solving geometry problems: Combining text and diagram interpretation. In Proceedings of the 2015 conference on empirical methods in natural language processing, pp. 1466–1476,

work page 2015
[36]

Tiny lvlm-ehub: Early multimodal experiments with bard

Wenqi Shao, Yutao Hu, Peng Gao, Meng Lei, Kaipeng Zhang, Fanqing Meng, Peng Xu, Siyuan Huang, Hongsheng Li, Yu Qiao, et al. Tiny lvlm-ehub: Early multimodal experiments with bard. arXiv preprint arXiv:2308.03729,

work page arXiv
[37]

PandaGPT: One Model To Instruction-Follow Them All

Yixuan Su, Tian Lan, Huayang Li, Jialu Xu, Yan Wang, and Deng Cai. Pandagpt: One model to instruction-follow them all. arXiv preprint arXiv:2305.16355,

work page internal anchor Pith review Pith/arXiv arXiv
[38]

SciEval: A multi-level large language model evaluation benchmark for scientific research

Liangtai Sun, Yang Han, Zihan Zhao, Da Ma, Zhennan Shen, Baocai Chen, Lu Chen, and Kai Yu. SciEval: A multi-level large language model evaluation benchmark for scientific research. arXiv preprint arXiv:2308.13149,

work page arXiv
[39]

Large language model (llm) as a system of multiple expert agents: An approach to solve the abstraction and reasoning corpus (arc) challenge

John Chong Min Tan and Mehul Motani. Large language model (llm) as a system of multiple expert agents: An approach to solve the abstraction and reasoning corpus (arc) challenge. arXiv preprint arXiv:2310.05146,

work page arXiv
[40]

Galactica: A Large Language Model for Science

Ross Taylor, Marcin Kardas, Guillem Cucurull, Thomas Scialom, Anthony Hartshorn, Elvis Saravia, Andrew Poulton, Viktor Kerkez, and Robert Stojnic. Galactica: A large language model for science. arXiv preprint arXiv:2211.09085,

work page internal anchor Pith review Pith/arXiv arXiv
[41]

Gemini: A Family of Highly Capable Multimodal Models

Gemini Team, Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, et al. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805,

work page internal anchor Pith review Pith/arXiv arXiv
[42]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023a. Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay ...

work page internal anchor Pith review Pith/arXiv arXiv
[43]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Ke Wang, Houxing Ren, Aojun Zhou, Zimu Lu, Sichun Luo, Weikang Shi, Renrui Zhang, Linqi Song, Mingjie Zhan, and Hongsheng Li. Mathcoder: Seamless code integration in LLMs for enhanced mathematical reasoning. In The Twelfth International Conference on Learning Representations, 2024a. URL https://openreview.net/forum?id=z8TW0ttBPp. Peng Wang, Shuai Bai, Sin...

work page internal anchor Pith review Pith/arXiv arXiv 2017
[44]

Emergent Abilities of Large Language Models

Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, et al. Emergent abilities of large language models. arXiv preprint arXiv:2206.07682, 2022a. Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Ed Chi, Quoc Le, and Denny Zhou. Chain of thought prompting elicits ...

work page internal anchor Pith review Pith/arXiv arXiv
[45]

Lvlm-ehub: A comprehensive evaluation benchmark for large vision-language models

Peng Xu, Wenqi Shao, Kaipeng Zhang, Peng Gao, Shuo Liu, Meng Lei, Fanqing Meng, Siyuan Huang, Yu Qiao, and Ping Luo. LVLM-eHub: A comprehensive evaluation benchmark for large vision-language models. arXiv preprint arXiv:2306.09265, 2023a. Runsen Xu, Xiaolong Wang, Tai Wang, Yilun Chen, Jiangmiao Pang, and Dahua Lin. Pointllm: Empowering large language mod...

work page arXiv
[46]

mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality

Qinghao Ye, Haiyang Xu, Guohai Xu, Jiabo Ye, Ming Yan, Yiyang Zhou, Junyang Wang, Anwen Hu, Pengcheng Shi, Yaya Shi, et al. mPlug-Owl: Modularization empowers large language models with multimodality. arXiv preprint arXiv:2304.14178, 2023a. Qinghao Ye, Haiyang Xu, Jiabo Ye, Ming Yan, Anwen Hu, Haowei Liu, Qi Qian, Ji Zhang, Fei Huang, and Jingren Zhou. mp...

work page internal anchor Pith review Pith/arXiv arXiv
[47]

MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities

Weihao Yu, Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Zicheng Liu, Xinchao Wang, and Lijuan Wang. MM-Vet: Evaluating large multimodal models for integrated capabilities. arXiv preprint arXiv:2308.02490,

work page internal anchor Pith review Pith/arXiv arXiv
[48]

MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI

Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, Cong Wei, Botao Yu, Ruibin Yuan, Renliang Sun, Ming Yin, Boyuan Zheng, Zhenzhu Yang, Yibo Liu, Wenhao Huang, Huan Sun, Yu Su, and Wenhu Chen. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for exp...

work page internal anchor Pith review Pith/arXiv arXiv
[49]

Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding

18 Preprint Hang Zhang, Xin Li, and Lidong Bing. Video-llama: An instruction-tuned audio-visual language model for video understanding. arXiv preprint arXiv:2306.02858, 2023a. Renrui Zhang, Jiaming Han, Aojun Zhou, Xiangfei Hu, Shilin Yan, Pan Lu, Hongsheng Li, Peng Gao, and Qiao Yu. LLaMA-Adapter: Efficient fine-tuning of language models with zero-init a...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[50]

Xiang Zhang, Senyu Li, Zijun Wu, and Ning Shi

URL https://openreview.net/forum?id=d4UiXAHN2W. Xiang Zhang, Senyu Li, Zijun Wu, and Ning Shi. Lost in translation: When gpt-4v (ision) can’t see eye to eye with text. a vision-language-consistency analysis of vllms and beyond. arXiv preprint arXiv:2310.12520, 2023f. Xiaoman Zhang, Chaoyi Wu, Ziheng Zhao, Weixiong Lin, Ya Zhang, Yanfeng Wang, and Weidi Xi...

work page arXiv
[51]

Solving challenging math word problems using gpt-4 code interpreter with code-based self-verification

Aojun Zhou, Ke Wang, Zimu Lu, Weikang Shi, Sichun Luo, Zipeng Qin, Shaoqing Lu, Anya Jia, Linqi Song, Mingjie Zhan, et al. Solving challenging math word problems using gpt-4 code interpreter with code-based self-verification. arXiv preprint arXiv:2308.07921,

work page arXiv
[52]

MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. MiniGPT-4: En- hancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023a. Wanrong Zhu, Jack Hessel, Anas Awadalla, Samir Yitzhak Gadre, Jesse Dodge, Alex Fang, Youngjae Yu, Ludwig Schmidt, William Yang Wang, and Yejin Choi. Multimod...

work page internal anchor Pith review Pith/arXiv arXiv 2021
[53]

word problems with limited scope. Subsequent efforts, including MATH (Hendrycks et al., 2021b), GSM8K (Cobbe et al., 2021), MMLU (Hendrycks et al., 2021a), and others (Zhou et al., 2023; Yue et al., 2023b; Wang et al., 2024a; Gao et al., 2023a; Luo et al., 2023), expanded the range and quality of textual mathematical problems, establishing robust benchmar...

work page 2021
[54]

While these benchmarks assess text-only mathematical reasoning, the rapid progress of MLLMs necessitates high-quality benchmarks for evaluating visual mathematical problem-solving

provide limited coverage of rigorous scientific domains crucial for general-purpose AI assistants. While these benchmarks assess text-only mathematical reasoning, the rapid progress of MLLMs necessitates high-quality benchmarks for evaluating visual mathematical problem-solving. Prior 20 Preprint attempts like GeoQA (Chen et al., 2021a), while MathVista (...

work page 2024
[55]

and large vision models (Radford et al., 2021; Kirillov et al., 2023; Zhang et al., 2023d;c;e), have become increasingly prominent. They extend LLMs to diverse tasks and modalities, including 2D images (Li et al., 2022; Dai et al., 2023; Alayrac et al., 2022; Li et al., 2023a), 3D point clouds (Guo et al., 2023; Xu et al., 2023b; Hong et al., 2024), audio...

work page 2021
[56]

However, their closed-source nature hinders broader application and development of MLLMs

exhibit exceptional visual reasoning capabilities, setting new benchmarks in multi-modal performance. However, their closed-source nature hinders broader application and development of MLLMs. Concur- rently, open-source MLLMs like LLaMA-Adapter (Zhang et al., 2024; Gao et al., 2023b), LLaV A (Liu et al., 2023b; 2024; 2023a), MiniGPT-4 (Zhu et al., 2023a; ...

work page 2024
[57]

for image encoding and LLaMA (Touvron et al., 2023a) for multi-modal instruction tuning, advancing MLLMs’ visual understanding and generalization. Despite comprehensive benchmarks (Fu et al., 2023a; Liu et al., 2023d; Li et al., 2023b; Xu et al., 2023a) for general visual instruction-following scenarios, the specific potential of MLLMs for visual mathemat...

work page 2015
[58]

evaluate LMMs’ general visual question answering abilities on open-ended image queries. Additionally, works have assessed LMMs’ specific skills beyond natural scenes, such as abstract shapes (Antol et al., 2015; Lu et al., 2021b; Ji et al., 2022), geometry diagrams (Seo et al., 2015; Lu et al., 2021a; Chen et al., 2022a; Cao & Xiao, 2022), charts (Methani...

work page 2015
[59]

leverage paired (Schuhmann et al., 2022; Sharma et al., 2018; Lin et al.,

work page 2022
[60]

Additionally, specialized versions like LLaV AR (Zhang et al., 2023h; Ye et al., 2023a) emphasize document understanding and math comprehension

and interleaved (Zhu et al., 2023b) image-text data. Additionally, specialized versions like LLaV AR (Zhang et al., 2023h; Ye et al., 2023a) emphasize document understanding and math comprehension. Recent works like Visit-Bench (Bitton et al., 2023), LVLM-eHub (Yu et al., 2023), MMBench (Liu et al., 2023d; Xu et al., 2023a; Shao et al.,

work page 2023
[61]

assess these models’ instruction-following and reasoning capabilities. Large language models (LLMs) have demonstrated remarkable reasoning abilities, further enhanced by approaches like chain-of-thought (CoT) (Wei et al., 2022b), program-of-thought (PoT) (Chen et al., 2022b), and inductive reasoning (Wang et al., 2023a; Tan & Motani, 2023). The feasibilit...

work page 2023

[1] [1]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774,

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

Mathqa: Towards interpretable math word problem solving with operation-based formalisms

Aida Amini, Saadia Gabriel, Shanchuan Lin, Rik Koncel-Kedziorski, Yejin Choi, and Hannaneh Hajishirzi. Mathqa: Towards interpretable math word problem solving with operation-based formalisms. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL), pp. 2357– 2367,

work page 2019

[3] [3]

OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models

Anas Awadalla, Irena Gao, Josh Gardner, Jack Hessel, Yusuf Hanafy, Wanrong Zhu, Kalyani Marathe, Yonatan Bitton, Samir Gadre, Shiori Sagawa, et al. OpenFlamingo: An open-source framework for training large autoregressive vision-language models. arXiv preprint arXiv:2308.01390,

work page internal anchor Pith review Pith/arXiv arXiv

[4] [4]

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond. arXiv preprint arXiv:2308.12966,

work page internal anchor Pith review Pith/arXiv arXiv

[5] [5]

Visit-bench: A benchmark for vision-language instruction following inspired by real-world use.arXiv preprint arXiv:2308.06595, 2023

Yonatan Bitton, Hritik Bansal, Jack Hessel, Rulin Shao, Wanrong Zhu, Anas Awadalla, Josh Gardner, Rohan Taori, and Ludwig Schimdt. VisIT-Bench: A benchmark for vision-language instruction following inspired by real-world use. arXiv preprint arXiv:2308.06595,

work page arXiv

[6] [6]

Breaking common sense: WHOOPS! A vision-and-language benchmark of synthetic and compositional images

Nitzan Bitton-Guetta, Yonatan Bitton, Jack Hessel, Ludwig Schmidt, Yuval Elovici, Gabriel Stanovsky, and Roy Schwartz. Breaking common sense: WHOOPS! A vision-and-language benchmark of synthetic and compositional images. arXiv preprint arXiv:2303.07274,

work page arXiv

[7] [7]

Language models are few-shot learners

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901,

work page 1901

[8] [8]

Sparks of Artificial General Intelligence: Early experiments with GPT-4

Sébastien Bubeck, Varun Chandrasekaran, Ronen Eldan, Johannes Gehrke, Eric Horvitz, Ece Kamar, Peter Lee, Yin Tat Lee, Yuanzhi Li, Scott Lundberg, et al. Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712,

work page internal anchor Pith review Pith/arXiv arXiv

[9] [9]

MapQA: A dataset for question answering on choropleth maps

Shuaichen Chang, David Palzer, Jialin Li, Eric Fosler-Lussier, and Ningchuan Xiao. MapQA: A dataset for question answering on choropleth maps. arXiv preprint arXiv:2211.08545,

work page arXiv

[10] [10]

Videollm: Modeling video sequence with large language models

11 Preprint Guo Chen, Yin-Dong Zheng, Jiahao Wang, Jilan Xu, Yifei Huang, Junting Pan, Yi Wang, Yali Wang, Yu Qiao, Tong Lu, et al. Videollm: Modeling video sequence with large language models. arXiv preprint arXiv:2305.13292, 2023a. Jiaqi Chen, Jianheng Tang, Jinghui Qin, Xiaodan Liang, Lingbo Liu, Eric P. Xing, and Liang Lin. Geoqa: A geometric question...

work page arXiv 2022

[11] [11]

CLEVR-Math: A dataset for compositional language, visual and mathematical reasoning

Adam Dahlgren Lindström and Savitha Sam Abraham. CLEVR-Math: A dataset for compositional language, visual and mathematical reasoning. In 16th International Workshop on Neural-Symbolic Learning and Reasoning, NeSy 2022, Windsor, UK, september 28-30, 2022., volume

work page 2022

[12] [12]

Large language model for science: A study on P vs

Qingxiu Dong, Li Dong, Ke Xu, Guangyan Zhou, Yaru Hao, Zhifang Sui, and Furu Wei. Large language model for science: A study on P vs. NP. arXiv preprint arXiv:2309.05689,

work page arXiv

[13] [13]

InternLM-XComposer2: Mastering Free-form Text-Image Composition and Comprehension in Vision-Language Large Model

Xiaoyi Dong, Pan Zhang, Yuhang Zang, Yuhang Cao, Bin Wang, Linke Ouyang, Xilin Wei, Songyang Zhang, Haodong Duan, Maosong Cao, et al. Internlm-xcomposer2: Mastering free-form text-image composition and comprehension in vision-language large model. arXiv preprint arXiv:2401.16420,

work page internal anchor Pith review Pith/arXiv arXiv

[14] [14]

MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Jinrui Yang, Xiawu Zheng, Ke Li, Xing Sun, Yunsheng Wu, and Rongrong Ji. Mme: A comprehensive evaluation benchmark for multimodal large language models. arXiv preprint arXiv:2306.13394, 2023a. Chaoyou Fu, Renrui Zhang, Haojia Lin, Zihan Wang, Timin Gao, Yongdong Luo, Yubo Huang, Zhe...

work page internal anchor Pith review Pith/arXiv arXiv

[15] [15]

Point-bind & point-llm: Aligning point cloud with multi-modality for 3d understanding, generation, and instruction following

Ziyu Guo, Renrui Zhang, Xiangyang Zhu, Yiwen Tang, Xianzheng Ma, Jiaming Han, Kexin Chen, Peng Gao, Xianzhi Li, Hongsheng Li, et al. Point-bind & point-llm: Aligning point cloud with multi-modality for 3d understanding, generation, and instruction following. arXiv preprint arXiv:2309.00615,

work page arXiv

[16] [16]

Imagebind-llm: Multi-modality instruction tun- ing

Jiaming Han, Renrui Zhang, Wenqi Shao, Peng Gao, Peng Xu, Han Xiao, Kaipeng Zhang, Chris Liu, Song Wen, Ziyu Guo, et al. Imagebind-llm: Multi-modality instruction tuning. arXiv preprint arXiv:2309.03905,

work page arXiv

[17] [17]

C-eval: A multi-level multi-discipline chinese evaluation suite for foundation models

Yuzhen Huang, Yuzhuo Bai, Zhihao Zhu, Junlei Zhang, Jinghan Zhang, Tangjun Su, Junteng Liu, Chuancheng Lv, Yikai Zhang, Jiayi Lei, et al. C-eval: A multi-level multi-discipline chinese evaluation suite for foundation models. arXiv preprint arXiv:2305.08322,

work page arXiv

[18] [18]

Abstract visual reasoning with tangram shapes

Anya Ji, Noriyuki Kojima, Noah Rush, Alane Suhr, Wai Keen V ong, Robert D Hawkins, and Yoav Artzi. Abstract visual reasoning with tangram shapes. arXiv preprint arXiv:2211.16492,

work page arXiv

[19] [19]

13 Preprint Albert Q. Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de Las Casas, Emma Bou Hanna, Florian Bressand, Gianna Lengyel, Guillaume Bour, Guillaume Lample, Lélio Renard Lavaud, Lucile Saulnier, Marie- Anne Lachaux, Pierre Stock, Sandeep Subramanian, Sophia Yang, Szymon An...

work page internal anchor Pith review Pith/arXiv arXiv

[20] [20]

FigureQA: An Annotated Figure Dataset for Visual Reasoning

Samira Ebrahimi Kahou, Vincent Michalski, Adam Atkinson, Ákos Kádár, Adam Trischler, and Yoshua Bengio. FigureQA: An annotated figure dataset for visual reasoning. arXiv preprint arXiv:1710.07300,

work page internal anchor Pith review Pith/arXiv arXiv

[21] [21]

A diagram is worth a dozen images

Aniruddha Kembhavi, Mike Salvato, Eric Kolve, Minjoon Seo, Hannaneh Hajishirzi, and Ali Farhadi. A diagram is worth a dozen images. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part IV 14 , pp. 235–251. Springer,

work page 2016

[22] [22]

Segment Anything

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al. Segment anything.arXiv preprint arXiv:2304.02643,

work page internal anchor Pith review Pith/arXiv arXiv

[23] [23]

Mimic-it: Multi-modal in-context instruction tuning,

Bo Li, Yuanhan Zhang, Liangyu Chen, Jinghao Wang, Fanyi Pu, Jingkang Yang, Chunyuan Li, and Ziwei Liu. Mimic-it: Multi-modal in-context instruction tuning. arXiv preprint arXiv:2306.05425, 2023a. Bohao Li, Rui Wang, Guangzhi Wang, Yuying Ge, Yixiao Ge, and Ying Shan. Seed-bench: Bench- marking multimodal llms with generative comprehension. ArXiv, abs/2307...

work page arXiv

[24] [24]

Microsoft COCO: Common objects in context

14 Preprint Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft COCO: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pp. 740–755. Springer,

work page 2014

[25] [25]

SPHINX: The Joint Mixing of Weights, Tasks, and Visual Embeddings for Multi-modal Large Language Models

Ziyi Lin, Chris Liu, Renrui Zhang, Peng Gao, Longtian Qiu, Han Xiao, Han Qiu, Chen Lin, Wenqi Shao, Keqin Chen, et al. Sphinx: The joint mixing of weights, tasks, and visual embeddings for multi-modal large language models. arXiv preprint arXiv:2311.07575,

work page internal anchor Pith review Pith/arXiv arXiv

[26] [26]

MatCha: Enhancing visual language pretrain- ing with math reasoning and chart derendering

Fangyu Liu, Francesco Piccinno, Syrine Krichene, Chenxi Pang, Kenton Lee, Mandar Joshi, Yasemin Altun, Nigel Collier, and Julian Martin Eisenschlos. MatCha: Enhancing visual language pretrain- ing with math reasoning and chart derendering. arXiv preprint arXiv:2212.09662,

work page arXiv

[27] [27]

Visual Instruction Tuning

Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning, 2023a. Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. arXiv preprint arXiv:2304.08485, 2023b. Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. Llava-next: Improved reasoning,...

work page internal anchor Pith review Pith/arXiv arXiv

[28] [28]

AgentBench: Evaluating LLMs as Agents

URL https:// llava-vl.github.io/blog/2024-01-30-llava-next/ . Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, et al. AgentBench: Evaluating LLMs as agents. arXiv preprint arXiv:2308.03688, 2023c. Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wan...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[29] [29]

MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts

Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chun yue Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating math reasoning in visual contexts with gpt-4v, bard, and other large multimodal models. ArXiv, abs/2310.02255, 2023a. Pan Lu, Liang Qiu, Wenhao Yu, Sean Welleck, and Kai-Wei Chang. A survey of de...

work page internal anchor Pith review Pith/arXiv arXiv

[30] [30]

ChartQA: A benchmark for question answering about charts with visual and logical reasoning

15 Preprint Ahmed Masry, Xuan Long Do, Jia Qing Tan, Shafiq Joty, and Enamul Hoque. ChartQA: A benchmark for question answering about charts with visual and logical reasoning. InFindings of the Association for Computational Linguistics: ACL 2022, pp. 2263–2279,

work page 2022

[31] [31]

UniChart: A universal vision-language pretrained model for chart comprehension and reasoning

Ahmed Masry, Parsa Kavehzadeh, Xuan Long Do, Enamul Hoque, and Shafiq Joty. UniChart: A universal vision-language pretrained model for chart comprehension and reasoning. arXiv preprint arXiv:2305.14761,

work page arXiv

[32] [32]

LILA: A unified benchmark for mathematical reasoning

Swaroop Mishra, Matthew Finlayson, Pan Lu, Leonard Tang, Sean Welleck, Chitta Baral, Tanmay Rajpurohit, Oyvind Tafjord, Ashish Sabharwal, Peter Clark, and Ashwin Kalyan. LILA: A unified benchmark for mathematical reasoning. In The 2022 Conference on Empirical Methods in Natural Language Processing (EMNLP),

work page 2022

[33] [33]

Capabilities of GPT-4 on Medical Challenge Problems

Harsha Nori, Nicholas King, Scott Mayer McKinney, Dean Carignan, and Eric Horvitz. Capabilities of GPT-4 on medical challenge problems. arXiv preprint arXiv:2303.13375,

work page internal anchor Pith review Pith/arXiv arXiv

[34] [34]

Solving General Arithmetic Word Problems

URL https://api.semanticscholar.org/CorpusID: 231591445. Subhro Roy and Dan Roth. Solving general arithmetic word problems. ArXiv, abs/1608.01413,

work page internal anchor Pith review Pith/arXiv arXiv

[35] [35]

Solving geometry problems: Combining text and diagram interpretation

16 Preprint Minjoon Seo, Hannaneh Hajishirzi, Ali Farhadi, Oren Etzioni, and Clint Malcolm. Solving geometry problems: Combining text and diagram interpretation. In Proceedings of the 2015 conference on empirical methods in natural language processing, pp. 1466–1476,

work page 2015

[36] [36]

Tiny lvlm-ehub: Early multimodal experiments with bard

Wenqi Shao, Yutao Hu, Peng Gao, Meng Lei, Kaipeng Zhang, Fanqing Meng, Peng Xu, Siyuan Huang, Hongsheng Li, Yu Qiao, et al. Tiny lvlm-ehub: Early multimodal experiments with bard. arXiv preprint arXiv:2308.03729,

work page arXiv

[37] [37]

PandaGPT: One Model To Instruction-Follow Them All

Yixuan Su, Tian Lan, Huayang Li, Jialu Xu, Yan Wang, and Deng Cai. Pandagpt: One model to instruction-follow them all. arXiv preprint arXiv:2305.16355,

work page internal anchor Pith review Pith/arXiv arXiv

[38] [38]

SciEval: A multi-level large language model evaluation benchmark for scientific research

Liangtai Sun, Yang Han, Zihan Zhao, Da Ma, Zhennan Shen, Baocai Chen, Lu Chen, and Kai Yu. SciEval: A multi-level large language model evaluation benchmark for scientific research. arXiv preprint arXiv:2308.13149,

work page arXiv

[39] [39]

Large language model (llm) as a system of multiple expert agents: An approach to solve the abstraction and reasoning corpus (arc) challenge

John Chong Min Tan and Mehul Motani. Large language model (llm) as a system of multiple expert agents: An approach to solve the abstraction and reasoning corpus (arc) challenge. arXiv preprint arXiv:2310.05146,

work page arXiv

[40] [40]

Galactica: A Large Language Model for Science

Ross Taylor, Marcin Kardas, Guillem Cucurull, Thomas Scialom, Anthony Hartshorn, Elvis Saravia, Andrew Poulton, Viktor Kerkez, and Robert Stojnic. Galactica: A large language model for science. arXiv preprint arXiv:2211.09085,

work page internal anchor Pith review Pith/arXiv arXiv

[41] [41]

Gemini: A Family of Highly Capable Multimodal Models

Gemini Team, Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, et al. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805,

work page internal anchor Pith review Pith/arXiv arXiv

[42] [42]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023a. Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay ...

work page internal anchor Pith review Pith/arXiv arXiv

[43] [43]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Ke Wang, Houxing Ren, Aojun Zhou, Zimu Lu, Sichun Luo, Weikang Shi, Renrui Zhang, Linqi Song, Mingjie Zhan, and Hongsheng Li. Mathcoder: Seamless code integration in LLMs for enhanced mathematical reasoning. In The Twelfth International Conference on Learning Representations, 2024a. URL https://openreview.net/forum?id=z8TW0ttBPp. Peng Wang, Shuai Bai, Sin...

work page internal anchor Pith review Pith/arXiv arXiv 2017

[44] [44]

Emergent Abilities of Large Language Models

Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, et al. Emergent abilities of large language models. arXiv preprint arXiv:2206.07682, 2022a. Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Ed Chi, Quoc Le, and Denny Zhou. Chain of thought prompting elicits ...

work page internal anchor Pith review Pith/arXiv arXiv

[45] [45]

Lvlm-ehub: A comprehensive evaluation benchmark for large vision-language models

Peng Xu, Wenqi Shao, Kaipeng Zhang, Peng Gao, Shuo Liu, Meng Lei, Fanqing Meng, Siyuan Huang, Yu Qiao, and Ping Luo. LVLM-eHub: A comprehensive evaluation benchmark for large vision-language models. arXiv preprint arXiv:2306.09265, 2023a. Runsen Xu, Xiaolong Wang, Tai Wang, Yilun Chen, Jiangmiao Pang, and Dahua Lin. Pointllm: Empowering large language mod...

work page arXiv

[46] [46]

mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality

Qinghao Ye, Haiyang Xu, Guohai Xu, Jiabo Ye, Ming Yan, Yiyang Zhou, Junyang Wang, Anwen Hu, Pengcheng Shi, Yaya Shi, et al. mPlug-Owl: Modularization empowers large language models with multimodality. arXiv preprint arXiv:2304.14178, 2023a. Qinghao Ye, Haiyang Xu, Jiabo Ye, Ming Yan, Anwen Hu, Haowei Liu, Qi Qian, Ji Zhang, Fei Huang, and Jingren Zhou. mp...

work page internal anchor Pith review Pith/arXiv arXiv

[47] [47]

MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities

Weihao Yu, Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Zicheng Liu, Xinchao Wang, and Lijuan Wang. MM-Vet: Evaluating large multimodal models for integrated capabilities. arXiv preprint arXiv:2308.02490,

work page internal anchor Pith review Pith/arXiv arXiv

[48] [48]

MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI

Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, Cong Wei, Botao Yu, Ruibin Yuan, Renliang Sun, Ming Yin, Boyuan Zheng, Zhenzhu Yang, Yibo Liu, Wenhao Huang, Huan Sun, Yu Su, and Wenhu Chen. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for exp...

work page internal anchor Pith review Pith/arXiv arXiv

[49] [49]

Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding

18 Preprint Hang Zhang, Xin Li, and Lidong Bing. Video-llama: An instruction-tuned audio-visual language model for video understanding. arXiv preprint arXiv:2306.02858, 2023a. Renrui Zhang, Jiaming Han, Aojun Zhou, Xiangfei Hu, Shilin Yan, Pan Lu, Hongsheng Li, Peng Gao, and Qiao Yu. LLaMA-Adapter: Efficient fine-tuning of language models with zero-init a...

work page internal anchor Pith review Pith/arXiv arXiv 2023

[50] [50]

Xiang Zhang, Senyu Li, Zijun Wu, and Ning Shi

URL https://openreview.net/forum?id=d4UiXAHN2W. Xiang Zhang, Senyu Li, Zijun Wu, and Ning Shi. Lost in translation: When gpt-4v (ision) can’t see eye to eye with text. a vision-language-consistency analysis of vllms and beyond. arXiv preprint arXiv:2310.12520, 2023f. Xiaoman Zhang, Chaoyi Wu, Ziheng Zhao, Weixiong Lin, Ya Zhang, Yanfeng Wang, and Weidi Xi...

work page arXiv

[51] [51]

Solving challenging math word problems using gpt-4 code interpreter with code-based self-verification

Aojun Zhou, Ke Wang, Zimu Lu, Weikang Shi, Sichun Luo, Zipeng Qin, Shaoqing Lu, Anya Jia, Linqi Song, Mingjie Zhan, et al. Solving challenging math word problems using gpt-4 code interpreter with code-based self-verification. arXiv preprint arXiv:2308.07921,

work page arXiv

[52] [52]

MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. MiniGPT-4: En- hancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023a. Wanrong Zhu, Jack Hessel, Anas Awadalla, Samir Yitzhak Gadre, Jesse Dodge, Alex Fang, Youngjae Yu, Ludwig Schmidt, William Yang Wang, and Yejin Choi. Multimod...

work page internal anchor Pith review Pith/arXiv arXiv 2021

[53] [53]

word problems with limited scope. Subsequent efforts, including MATH (Hendrycks et al., 2021b), GSM8K (Cobbe et al., 2021), MMLU (Hendrycks et al., 2021a), and others (Zhou et al., 2023; Yue et al., 2023b; Wang et al., 2024a; Gao et al., 2023a; Luo et al., 2023), expanded the range and quality of textual mathematical problems, establishing robust benchmar...

work page 2021

[54] [54]

While these benchmarks assess text-only mathematical reasoning, the rapid progress of MLLMs necessitates high-quality benchmarks for evaluating visual mathematical problem-solving

provide limited coverage of rigorous scientific domains crucial for general-purpose AI assistants. While these benchmarks assess text-only mathematical reasoning, the rapid progress of MLLMs necessitates high-quality benchmarks for evaluating visual mathematical problem-solving. Prior 20 Preprint attempts like GeoQA (Chen et al., 2021a), while MathVista (...

work page 2024

[55] [55]

and large vision models (Radford et al., 2021; Kirillov et al., 2023; Zhang et al., 2023d;c;e), have become increasingly prominent. They extend LLMs to diverse tasks and modalities, including 2D images (Li et al., 2022; Dai et al., 2023; Alayrac et al., 2022; Li et al., 2023a), 3D point clouds (Guo et al., 2023; Xu et al., 2023b; Hong et al., 2024), audio...

work page 2021

[56] [56]

However, their closed-source nature hinders broader application and development of MLLMs

exhibit exceptional visual reasoning capabilities, setting new benchmarks in multi-modal performance. However, their closed-source nature hinders broader application and development of MLLMs. Concur- rently, open-source MLLMs like LLaMA-Adapter (Zhang et al., 2024; Gao et al., 2023b), LLaV A (Liu et al., 2023b; 2024; 2023a), MiniGPT-4 (Zhu et al., 2023a; ...

work page 2024

[57] [57]

for image encoding and LLaMA (Touvron et al., 2023a) for multi-modal instruction tuning, advancing MLLMs’ visual understanding and generalization. Despite comprehensive benchmarks (Fu et al., 2023a; Liu et al., 2023d; Li et al., 2023b; Xu et al., 2023a) for general visual instruction-following scenarios, the specific potential of MLLMs for visual mathemat...

work page 2015

[58] [58]

evaluate LMMs’ general visual question answering abilities on open-ended image queries. Additionally, works have assessed LMMs’ specific skills beyond natural scenes, such as abstract shapes (Antol et al., 2015; Lu et al., 2021b; Ji et al., 2022), geometry diagrams (Seo et al., 2015; Lu et al., 2021a; Chen et al., 2022a; Cao & Xiao, 2022), charts (Methani...

work page 2015

[59] [59]

leverage paired (Schuhmann et al., 2022; Sharma et al., 2018; Lin et al.,

work page 2022

[60] [60]

Additionally, specialized versions like LLaV AR (Zhang et al., 2023h; Ye et al., 2023a) emphasize document understanding and math comprehension

and interleaved (Zhu et al., 2023b) image-text data. Additionally, specialized versions like LLaV AR (Zhang et al., 2023h; Ye et al., 2023a) emphasize document understanding and math comprehension. Recent works like Visit-Bench (Bitton et al., 2023), LVLM-eHub (Yu et al., 2023), MMBench (Liu et al., 2023d; Xu et al., 2023a; Shao et al.,

work page 2023

[61] [61]

assess these models’ instruction-following and reasoning capabilities. Large language models (LLMs) have demonstrated remarkable reasoning abilities, further enhanced by approaches like chain-of-thought (CoT) (Wei et al., 2022b), program-of-thought (PoT) (Chen et al., 2022b), and inductive reasoning (Wang et al., 2023a; Tan & Motani, 2023). The feasibilit...

work page 2023