pith. sign in

arxiv: 2410.04509 · v3 · submitted 2024-10-06 · 💻 cs.CL

ErrorRadar: Benchmarking Complex Mathematical Reasoning of Multimodal Large Language Models Via Error Detection

Pith reviewed 2026-05-23 20:08 UTC · model grok-4.3

classification 💻 cs.CL
keywords multimodal error detectionErrorRadarmathematical reasoningMLLMsbenchmarkerror step identificationerror categorizationK-12 math problems
0
0 comments X

The pith

Multimodal large language models lag human experts by about 10 percent on detecting errors in K-12 math problems.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces multimodal error detection as a new task for assessing how well MLLMs can spot and categorize mistakes in mathematical reasoning that includes diagrams or other visuals. It presents ErrorRadar, a benchmark of 2500 real student problems with annotations for error steps and categories. Experiments show that even top models like GPT-4o fall short of human performance, highlighting gaps in complex reasoning capabilities. This matters because error detection could help improve AI tutoring systems and model training. The benchmark provides a standardized way to measure progress beyond just solving problems correctly.

Core claim

ErrorRadar formulates multimodal error detection with two sub-tasks—error step identification and error categorization—and provides 2500 annotated K-12 problems to benchmark MLLMs, revealing that the best model trails human evaluators by around 10%.

What carries the argument

ErrorRadar benchmark consisting of 2500 multimodal math problems from real student interactions, annotated for error step identification and error categorization.

Load-bearing premise

The 2500 collected problems with their annotations accurately represent the range of complex mathematical reasoning errors encountered in multimodal settings.

What would settle it

A new MLLM achieving error detection accuracy within 5% of human evaluators on both sub-tasks of the ErrorRadar benchmark would challenge the claim that significant challenges remain.

Figures

Figures reproduced from arXiv: 2410.04509 by Aoxiao Zhong, Boyan Li, Hang Li, Hui Xiong, Jiahao Huo, Jiamin Su, Kun Wang, Philip S. Yu, Qingsong Wen, Shen Wang, Tianlong Xu, Xiong Gao, Xuming Hu, Yibo Yan, Yi-Fan Zhang, Zhendong Chu.

Figure 1
Figure 1. Figure 1: Comparison of research scope between pre￾vious work and our proposed ERRORRADAR bench￾mark on mathematical reasoning tasks. Benchmarks Venue Modality Student Ans. Error Det. TheoremQA (Chen et al., 2023a) EMNLP T - - MathBench (Liu et al., 2024b) ACL T - - MR-GSM8K (Zeng et al., 2024) arXiv T - - SciEval (Sun et al., 2024) AAAI T - - EIC (Li et al., 2024b) arXiv T - ✓ CMMaTH (Li et al., 2024c) arXiv T, I -… view at source ↗
Figure 2
Figure 2. Figure 2: Example of our well-annotated multimodal mathematical reasoning dataset ERRORRADAR, and per￾formance comparison on error categorization and error step localization tasks among representative MLLMs. It is evident that even simple math problems can be mishandled by the currently superior MLLMs in one or both tasks, highlighting the challenging nature of our proposed multimodal error detection setting. ❶ We t… view at source ↗
Figure 3
Figure 3. Figure 3: Roadmap of ERRORRADAR dataset collec￾tion, annotation, and consistent update [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Dataset distribution of ERRORRADAR with respect to problem type and error category. where I(·) is the indicator function, which returns 1 if the predicted step matches the ground truth, and 0 otherwise. Similarly, the accuracy for error categorization is: Acccate = 1 N X N i=1 I(Cerror,i = Gerror,i). 2.2 DATA SOURCE & ANNOTATION Following the roadmap shown in [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: The distribution of CAL and non-CAL of closed-source and open￾source MLLMs with top-3 CAL ACC. Finding #2: Weak open-source MLLMs tend to predict CAL category, leading to unusually high performance [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: The error category dis￾tribution of misjudged VIS cases of GPT-4o [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 9
Figure 9. Figure 9: The accuracy of STEP and CATE of two representative MLLM series: LLaVA-NEXT and InternVL2. We denote Tiny, Small, Middle, Large as the 2B, 8B, 26B, 76B for InternVL2 and None, 7B, 13B, 72B for LLaVA-NEXT, respectively. Finding #1: The performance of MLLMs on STEP task increases with the scale of parameters. We ob￾serve a phenomenon similar to the scaling law (Kaplan et al., 2020) in our experiments. As sho… view at source ↗
Figure 10
Figure 10. Figure 10: Category of bad cases where GPT-4o predicts visual perception errors incorrectly. by evaluating their problem-solving levels, but they overlook tasks based on the student’s perspec￾tive, such as error detection, and thus fail to comprehensively evaluate the more complex role of current MLLMs. Therefore, we propose the ERRORRADAR benchmark, which is entirely based on real student response data to evaluate … view at source ↗
Figure 11
Figure 11. Figure 11: Multimodal mathematical example one (type: counting) from ERRORRADAR dataset. 17 [PITH_FULL_IMAGE:figures/full_fig_p017_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Multimodal mathematical example two (type: plane geometry) from ERRORRADAR dataset. 18 [PITH_FULL_IMAGE:figures/full_fig_p018_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Multimodal mathematical example three (type: plane geometry) from ERRORRADAR dataset. 19 [PITH_FULL_IMAGE:figures/full_fig_p019_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Multimodal mathematical example four (type: counting) from ERRORRADAR dataset. 20 [PITH_FULL_IMAGE:figures/full_fig_p020_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Multimodal mathematical example five (type: plane geometry) from ERRORRADAR dataset. 21 [PITH_FULL_IMAGE:figures/full_fig_p021_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Prompt for error step identification task. 26 [PITH_FULL_IMAGE:figures/full_fig_p026_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Prompt for error categorization task. 27 [PITH_FULL_IMAGE:figures/full_fig_p027_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Distribution of CAL and non-CAL category predictions of all MLLMs we evaluate. 29 [PITH_FULL_IMAGE:figures/full_fig_p029_18.png] view at source ↗
read the original abstract

As the field of Multimodal Large Language Models (MLLMs) continues to evolve, their potential to revolutionize artificial intelligence is particularly promising, especially in addressing mathematical reasoning tasks. Current mathematical benchmarks predominantly focus on evaluating MLLMs' problem-solving ability, yet there is a crucial gap in addressing more complex scenarios such as error detection, for enhancing reasoning capability in complicated settings. To fill this gap, we formally formulate the new task: multimodal error detection, and introduce ErrorRadar, the first benchmark designed to assess MLLMs' capabilities in such a task. ErrorRadar evaluates two sub-tasks: error step identification and error categorization, providing a comprehensive framework for evaluating MLLMs' complex mathematical reasoning ability. It consists of 2,500 high-quality multimodal K-12 mathematical problems, collected from real-world student interactions in an educational organization, with rigorous annotation and rich metadata such as problem type and error category. Through extensive experiments, we evaluated both open-source and closed-source representative MLLMs, benchmarking their performance against educational expert evaluators. Results indicate significant challenges still remain, as GPT-4o with best performance is still around 10% behind human evaluation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper formulates the new task of multimodal error detection for mathematical reasoning and introduces ErrorRadar, the first benchmark for it. ErrorRadar contains 2,500 K-12 multimodal math problems collected from real-world student interactions in one educational organization; it defines two subtasks (error step identification and error categorization) and evaluates representative open- and closed-source MLLMs against human experts, reporting that GPT-4o achieves the highest scores but remains approximately 10% behind human performance.

Significance. If the benchmark construction and evaluation protocols are shown to be reliable and representative, the work would usefully shift evaluation focus from problem solving to error detection and supply a real-world-derived testbed with metadata. The explicit formulation of the new task and the collection of authentic student errors constitute clear strengths; the reported performance gap, once statistically grounded, would provide a concrete target for future MLLM development.

major comments (3)
  1. [Benchmark construction] Benchmark construction section: the manuscript asserts that the 2,500 problems were obtained via 'rigorous annotation' from a single educational organization yet supplies no inter-annotator agreement figures, annotation guidelines, selection criteria, or validation against external error distributions. This directly affects the central claim that ErrorRadar constitutes a valid and representative test of complex multimodal mathematical reasoning errors.
  2. [Evaluation and results] Evaluation and results section: the headline result that GPT-4o 'is still around 10% behind human evaluation' is presented without definitions of the exact metrics for each sub-task, without the procedure used to obtain human scores, and without statistical significance tests or confidence intervals. These omissions render the performance comparison unverifiable and load-bearing for the paper's conclusions.
  3. [Data description] Data description: no table or subsection reports the distribution of problem types, error categories, or curricular coverage, nor any comparison to established math-error taxonomies; without such information the representativeness argument cannot be assessed.
minor comments (2)
  1. [Abstract] The abstract and introduction would benefit from a brief statement of the precise metric definitions and the human-evaluation protocol.
  2. [Related work] Related-work section should cite prior single-modality error-detection benchmarks to clarify the incremental contribution of the multimodal setting.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each of the major comments below and will make the necessary revisions to enhance the clarity and rigor of the paper.

read point-by-point responses
  1. Referee: [Benchmark construction] Benchmark construction section: the manuscript asserts that the 2,500 problems were obtained via 'rigorous annotation' from a single educational organization yet supplies no inter-annotator agreement figures, annotation guidelines, selection criteria, or validation against external error distributions. This directly affects the central claim that ErrorRadar constitutes a valid and representative test of complex multimodal mathematical reasoning errors.

    Authors: We agree that additional details on the annotation process are necessary to substantiate the rigor of our benchmark. In the revised manuscript, we will expand the Benchmark Construction section to include inter-annotator agreement figures (such as Cohen's kappa), the annotation guidelines, selection criteria, and any validation steps against external error distributions. These will be provided in the main text or an appendix. revision: yes

  2. Referee: [Evaluation and results] Evaluation and results section: the headline result that GPT-4o 'is still around 10% behind human evaluation' is presented without definitions of the exact metrics for each sub-task, without the procedure used to obtain human scores, and without statistical significance tests or confidence intervals. These omissions render the performance comparison unverifiable and load-bearing for the paper's conclusions.

    Authors: We recognize the importance of clearly defining metrics and providing statistical analysis for the performance comparison. The revised paper will define the exact metrics for error step identification and error categorization, detail the human evaluation procedure (including the number of experts and their expertise), and include statistical significance tests along with confidence intervals to support the reported performance gap. revision: yes

  3. Referee: [Data description] Data description: no table or subsection reports the distribution of problem types, error categories, or curricular coverage, nor any comparison to established math-error taxonomies; without such information the representativeness argument cannot be assessed.

    Authors: We will add a dedicated subsection and tables in the Data Description section to report the distributions of problem types, error categories, and curricular coverage. Furthermore, we will include a comparison of our error categories to established math-error taxonomies from prior literature to better demonstrate the representativeness of the ErrorRadar benchmark. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark introduction with no derivations or self-referential reductions

full rationale

The paper formulates a new task and presents ErrorRadar as a benchmark of 2,500 problems collected from student interactions, with evaluation of MLLMs. No equations, derivations, fitted parameters, or load-bearing self-citations appear in the abstract or described structure. The work is self-contained empirical evaluation against human experts; the representativeness claim is an external validity issue, not a circular reduction of any claimed derivation to its inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central contribution rests on the assumption that the collected and annotated problems validly represent real student errors; no free parameters, new physical entities, or mathematical axioms are introduced.

axioms (1)
  • domain assumption The 2,500 problems collected from real-world student interactions with rigorous annotation accurately capture complex mathematical reasoning errors in multimodal settings.
    This premise is required for the benchmark to serve as a meaningful test of MLLM capabilities.

pith-pipeline@v0.9.0 · 5790 in / 1302 out tokens · 28863 ms · 2026-05-23T20:08:17.961478+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Position: Multimodal Large Language Models Can Significantly Advance Scientific Reasoning

    cs.CL 2025-02 unverdicted novelty 2.0

    Position paper claims multimodal LLMs can significantly advance scientific reasoning and proposes a four-stage roadmap plus challenges and suggestions.

Reference graph

Works this paper leans on

92 extracted references · 92 canonical work pages · cited by 1 Pith paper · 22 internal anchors

  1. [1]

    Complexity in declarative process models: Metrics and multi-modal assessment of cognitive load

    Amine Abbad-Andaloussi, Andrea Burattin, Tijs Slaats, Ekkart Kindler, and Barbara Weber. Complexity in declarative process models: Metrics and multi-modal assessment of cognitive load. Expert Systems with Applications, 233: 0 120924, 2023

  2. [2]

    Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone

    Marah Abdin, Sam Ade Jacobs, Ammar Ahmad Awan, Jyoti Aneja, Ahmed Awadallah, Hany Awadalla, Nguyen Bach, Amit Bahree, Arash Bakhtiari, Harkirat Behl, et al. Phi-3 technical report: A highly capable language model locally on your phone. arXiv preprint arXiv:2404.14219, 2024

  3. [3]

    Scaling laws for generative mixed-modal language models

    Armen Aghajanyan, Lili Yu, Alexis Conneau, Wei-Ning Hsu, Karen Hambardzumyan, Susan Zhang, Stephen Roller, Naman Goyal, Omer Levy, and Luke Zettlemoyer. Scaling laws for generative mixed-modal language models. In International Conference on Machine Learning, pp.\ 265--279. PMLR, 2023

  4. [4]

    Large language models for mathematical reasoning: Progresses and challenges

    Janice Ahn, Rishu Verma, Renze Lou, Di Liu, Rui Zhang, and Wenpeng Yin. Large language models for mathematical reasoning: Progresses and challenges. arXiv preprint arXiv:2402.00157, 2024

  5. [5]

    Claude 3, 2024 a

    Anthropic. Claude 3, 2024 a . URL https://www.anthropic.com/news/claude-3-haiku

  6. [6]

    Claude 3.5, 2024 b

    Anthropic. Claude 3.5, 2024 b . URL https://www.anthropic.com/news/claude-3-5-sonnet

  7. [7]

    Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

    Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond. arXiv preprint arXiv:2308.12966, 2023

  8. [8]

    Turning large language models into cognitive models

    Marcel Binz and Eric Schulz. Turning large language models into cognitive models. arXiv preprint arXiv:2306.03917, 2023

  9. [9]

    Theoremqa: A theorem-driven question answering dataset

    Wenhu Chen, Ming Yin, Max Ku, Pan Lu, Yixin Wan, Xueguang Ma, Jianyu Xu, Xinyi Wang, and Tony Xia. Theoremqa: A theorem-driven question answering dataset. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp.\ 7889--7901, 2023 a

  10. [10]

    InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks

    Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, Bin Li, Ping Luo, Tong Lu, Yu Qiao, and Jifeng Dai. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. arXiv preprint arXiv:2312.14238, 2023 b

  11. [11]

    Training Verifiers to Solve Math Word Problems

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021

  12. [12]

    A survey on multimodal large language models for autonomous driving

    Can Cui, Yunsheng Ma, Xu Cao, Wenqian Ye, Yang Zhou, Kaizhao Liang, Jintai Chen, Juanwu Lu, Zichong Yang, Kuei-Da Liao, et al. A survey on multimodal large language models for autonomous driving. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp.\ 958--979, 2024

  13. [13]

    Advancing mathematics by guiding human intuition with ai

    Alex Davies, Petar Veli c kovi \'c , Lars Buesing, Sam Blackwell, Daniel Zheng, Nenad Toma s ev, Richard Tanburn, Peter Battaglia, Charles Blundell, Andr \'a s Juh \'a sz, et al. Advancing mathematics by guiding human intuition with ai. Nature, 600 0 (7887): 0 70--74, 2021

  14. [14]

    Visual representations in the human brain are aligned with large language models

    Adrien Doerig, Tim C Kietzmann, Emily Allen, Yihan Wu, Thomas Naselaris, Kendrick Kay, and Ian Charest. Visual representations in the human brain are aligned with large language models. arXiv preprint arXiv:2209.11737, 2022

  15. [15]

    Muffin or chihuahua? challenging multimodal large language models with multipanel vqa

    Yue Fan, Jing Gu, Kaiwen Zhou, Qianqi Yan, Shan Jiang, Ching-Chen Kuo, Yang Zhao, Xinze Guan, and Xin Wang. Muffin or chihuahua? challenging multimodal large language models with multipanel vqa. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 6845--6863, 2024

  16. [16]

    Trends in integration of knowledge and large language models: A survey and taxonomy of methods, benchmarks, and applications

    Zhangyin Feng, Weitao Ma, Weijiang Yu, Lei Huang, Haotian Wang, Qianglong Chen, Weihua Peng, Xiaocheng Feng, Bing Qin, et al. Trends in integration of knowledge and large language models: A survey and taxonomy of methods, benchmarks, and applications. arXiv preprint arXiv:2311.05876, 2023

  17. [17]

    Isobench: Benchmarking multimodal foundation models on isomorphic representations

    Deqing Fu, Ghazal Khalighinejad, Ollie Liu, Bhuwan Dhingra, Dani Yogatama, Robin Jia, and Willie Neiswanger. Isobench: Benchmarking multimodal foundation models on isomorphic representations. arXiv preprint arXiv:2404.01266, 2024

  18. [18]

    ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools

    Team GLM, Aohan Zeng, Bin Xu, Bowen Wang, Chenhui Zhang, Da Yin, Diego Rojas, Guanyu Feng, Hanlin Zhao, Hanyu Lai, et al. Chatglm: A family of large language models from glm-130b to glm-4 all tools. arXiv preprint arXiv:2406.12793, 2024

  19. [19]

    Urbanvlp: A multi-granularity vision-language pre-trained foundation model for urban indicator prediction

    Xixuan Hao, Wei Chen, Yibo Yan, Siru Zhong, Kun Wang, Qingsong Wen, and Yuxuan Liang. Urbanvlp: A multi-granularity vision-language pre-trained foundation model for urban indicator prediction. arXiv preprint arXiv:2403.16831, 2024

  20. [20]

    Pefomed: Parameter efficient fine-tuning on multimodal large language models for medical visual question answering

    Jinlong He, Pengfei Li, Gang Liu, Zixu Zhao, and Shenjun Zhong. Pefomed: Parameter efficient fine-tuning on multimodal large language models for medical visual question answering. arXiv preprint arXiv:2401.02797, 2024 a

  21. [21]

    Cmmu: A benchmark for chinese multi-modal multi-type question understanding and reasoning

    Zheqi He, Xinya Wu, Pengfei Zhou, Richeng Xuan, Guang Liu, Xi Yang, Qiannan Zhu, and Hua Huang. Cmmu: A benchmark for chinese multi-modal multi-type question understanding and reasoning. arXiv preprint arXiv:2401.14011, 2024 b

  22. [22]

    Measuring Mathematical Problem Solving With the MATH Dataset

    Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874, 2021

  23. [23]

    Visual sketchpad: Sketching as a visual chain of thought for multimodal language models

    Yushi Hu, Weijia Shi, Xingyu Fu, Dan Roth, Mari Ostendorf, Luke Zettlemoyer, Noah A Smith, and Ranjay Krishna. Visual sketchpad: Sketching as a visual chain of thought for multimodal language models. arXiv preprint arXiv:2406.09403, 2024

  24. [24]

    Mmneuron: Discovering neuron-level domain-specific interpretation in multimodal large language model

    Jiahao Huo, Yibo Yan, Boren Hu, Yutao Yue, and Xuming Hu. Mmneuron: Discovering neuron-level domain-specific interpretation in multimodal large language model. arXiv preprint arXiv:2406.11193, 2024

  25. [25]

    Describe-then-reason: Improving multimodal mathematical reasoning through visual comprehension training

    Mengzhao Jia, Zhihan Zhang, Wenhao Yu, Fangkai Jiao, and Meng Jiang. Describe-then-reason: Improving multimodal mathematical reasoning through visual comprehension training. arXiv preprint arXiv:2404.14604, 2024

  26. [26]

    New generation deep learning for video object detection: A survey

    Licheng Jiao, Ruohan Zhang, Fang Liu, Shuyuan Yang, Biao Hou, Lingling Li, and Xu Tang. New generation deep learning for video object detection: A survey. IEEE Transactions on Neural Networks and Learning Systems, 33 0 (8): 0 3195--3215, 2021

  27. [27]

    Learning instance-level representation for large-scale multi-modal pretraining in e-commerce

    Yang Jin, Yongzhi Li, Zehuan Yuan, and Yadong Mu. Learning instance-level representation for large-scale multi-modal pretraining in e-commerce. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 11060--11069, 2023

  28. [28]

    Large language models struggle to learn long-tail knowledge

    Nikhil Kandpal, Haikang Deng, Adam Roberts, Eric Wallace, and Colin Raffel. Large language models struggle to learn long-tail knowledge. In International Conference on Machine Learning, pp.\ 15696--15707. PMLR, 2023

  29. [29]

    Scaling Laws for Neural Language Models

    Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020

  30. [30]

    Cognitive load theory: An applied reintroduction for special and general educators

    Michael J Kennedy and John Elwood Romig. Cognitive load theory: An applied reintroduction for special and general educators. TEACHING Exceptional Children, 56 0 (6): 0 440--451, 2024

  31. [31]

    Large language models are zero-shot reasoners

    Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners. Advances in neural information processing systems, 35: 0 22199--22213, 2022

  32. [32]

    Solving quantitative reasoning problems with language models

    Aitor Lewkowycz, Anders Andreassen, David Dohan, Ethan Dyer, Henryk Michalewski, Vinay Ramasesh, Ambrose Slone, Cem Anil, Imanol Schlag, Theo Gutman-Solo, et al. Solving quantitative reasoning problems with language models. Advances in Neural Information Processing Systems, 35: 0 3843--3857, 2022

  33. [33]

    Bringing generative ai to adaptive learning in education

    Hang Li, Tianlong Xu, Chaoli Zhang, Eason Chen, Jing Liang, Xing Fan, Haoyang Li, Jiliang Tang, and Qingsong Wen. Bringing generative ai to adaptive learning in education. arXiv preprint arXiv:2402.14601, 2024 a

  34. [34]

    Evaluating mathematical reasoning of large language models: A focus on error identification and correction

    Xiaoyuan Li, Wenjie Wang, Moxin Li, Junrong Guo, Yang Zhang, and Fuli Feng. Evaluating mathematical reasoning of large language models: A focus on error identification and correction. arXiv preprint arXiv:2406.00755, 2024 b

  35. [35]

    Cmmath: A chinese multi-modal math skill evaluation benchmark for foundation models

    Zhong-Zhi Li, Ming-Liang Zhang, Fei Yin, Zhi-Long Ji, Jin-Feng Bai, Zhen-Ru Pan, Fan-Hu Zeng, Jian Xu, Jia-Xin Zhang, and Cheng-Lin Liu. Cmmath: A chinese multi-modal math skill evaluation benchmark for foundation models. arXiv preprint arXiv:2407.12023, 2024 c

  36. [36]

    Llava-next: Improved reasoning, ocr, and world knowledge, January 2024 a

    Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. Llava-next: Improved reasoning, ocr, and world knowledge, January 2024 a . URL https://llava-vl.github.io/blog/2024-01-30-llava-next/

  37. [37]

    Mathbench: Evaluating the theory and application proficiency of llms with a hierarchical mathematics benchmark

    Hongwei Liu, Zilong Zheng, Yuxuan Qiao, Haodong Duan, Zhiwei Fei, Fengzhe Zhou, Wenwei Zhang, Songyang Zhang, Dahua Lin, and Kai Chen. Mathbench: Evaluating the theory and application proficiency of llms with a hierarchical mathematics benchmark. arXiv preprint arXiv:2405.12209, 2024 b

  38. [38]

    Are llms capable of data-based statistical and causal reasoning? benchmarking advanced quantitative reasoning with data

    Xiao Liu, Zirui Wu, Xueqing Wu, Pan Lu, Kai-Wei Chang, and Yansong Feng. Are llms capable of data-based statistical and causal reasoning? benchmarking advanced quantitative reasoning with data. arXiv preprint arXiv:2402.17644, 2024 c

  39. [39]

    DeepSeek-VL: Towards Real-World Vision-Language Understanding

    Haoyu Lu, Wen Liu, Bo Zhang, Bingxuan Wang, Kai Dong, Bo Liu, Jingxiang Sun, Tongzheng Ren, Zhuoshu Li, Yaofeng Sun, et al. Deepseek-vl: towards real-world vision-language understanding. arXiv preprint arXiv:2403.05525, 2024 a

  40. [40]

    A survey of deep learning for mathematical reasoning

    Pan Lu, Liang Qiu, Wenhao Yu, Sean Welleck, and Kai-Wei Chang. A survey of deep learning for mathematical reasoning. arXiv preprint arXiv:2212.10535, 2022

  41. [41]

    MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts

    Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts. arXiv preprint arXiv:2310.02255, 2024 b

  42. [42]

    Chameleon: Plug-and-play compositional reasoning with large language models

    Pan Lu, Baolin Peng, Hao Cheng, Michel Galley, Kai-Wei Chang, Ying Nian Wu, Song-Chun Zhu, and Jianfeng Gao. Chameleon: Plug-and-play compositional reasoning with large language models. Advances in Neural Information Processing Systems, 36, 2024 c

  43. [43]

    Large Language Models: A Survey

    Shervin Minaee, Tomas Mikolov, Narjes Nikzad, Meysam Chenaghlu, Richard Socher, Xavier Amatriain, and Jianfeng Gao. Large language models: A survey. arXiv preprint arXiv:2402.06196, 2024

  44. [44]

    Scaling data-constrained language models

    Niklas Muennighoff, Alexander Rush, Boaz Barak, Teven Le Scao, Nouamane Tazi, Aleksandra Piktus, Sampo Pyysalo, Thomas Wolf, and Colin A Raffel. Scaling data-constrained language models. Advances in Neural Information Processing Systems, 36, 2024

  45. [45]

    GPT-4 Technical Report

    OpenAI. GPT-4 technical report. arXiv preprint arXiv:2303.08774, 2023

  46. [46]

    GPT-4V(ision) system card, 2024 a

    OpenAI. GPT-4V(ision) system card, 2024 a . URL https://openai.com/index/gpt-4o-system-card/

  47. [47]

    Gpt-4o mini: advancing cost-efficient intelligence, 2024 b

    OpenAI. Gpt-4o mini: advancing cost-efficient intelligence, 2024 b . URL https://openai.com/index/gpt-4o-mini-advancing-cost-efficient-intelligence/

  48. [48]

    Cognitive load theory and instructional design: Recent developments

    Fred Paas, Alexander Renkl, and John Sweller. Cognitive load theory and instructional design: Recent developments. Educational Psychologist, 2010

  49. [49]

    Gemini goes to med school: exploring the capabilities of multimodal large language models on medical challenge problems & hallucinations

    Ankit Pal and Malaikannan Sankarasubbu. Gemini goes to med school: exploring the capabilities of multimodal large language models on medical challenge problems & hallucinations. arXiv preprint arXiv:2402.07023, 2024

  50. [50]

    Multimath: Bridging visual and mathematical reasoning for large language models

    Shuai Peng, Di Fu, Liangcai Gao, Xiuqin Zhong, Hongguang Fu, and Zhi Tang. Multimath: Bridging visual and mathematical reasoning for large language models. arXiv preprint arXiv:2409.00147, 2024

  51. [51]

    We-Math: Does Your Large Multimodal Model Achieve Human-like Mathematical Reasoning?

    Runqi Qiao, Qiuna Tan, Guanting Dong, Minhui Wu, Chong Sun, Xiaoshuai Song, Zhuoma GongQue, Shanglin Lei, Zhe Wei, Miaoxuan Zhang, et al. We-math: Does your large multimodal model achieve human-like mathematical reasoning? arXiv preprint arXiv:2407.01284, 2024

  52. [52]

    Elementary math learning through piaget's cognitive development stages

    Annabelle Rabillas, Osias Kit Kilag, Neil Ca \ n ete, Maria Trazona, Mery Lou Calope, and Jacqueline Kilag. Elementary math learning through piaget's cognitive development stages. Excellencia: International Multi-disciplinary Journal of Education (2994-9521), 1 0 (4): 0 128--142, 2023

  53. [53]

    Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

    Machel Reid, Nikolay Savinov, Denis Teplyashin, Dmitry Lepikhin, Timothy Lillicrap, Jean-baptiste Alayrac, Radu Soricut, Angeliki Lazaridou, Orhan Firat, Julian Schrittwieser, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv preprint arXiv:2403.05530, 2024

  54. [54]

    Detecting Pretraining Data from Large Language Models

    Weijia Shi, Anirudh Ajith, Mengzhou Xia, Yangsibo Huang, Daogao Liu, Terra Blevins, Danqi Chen, and Luke Zettlemoyer. Detecting pretraining data from large language models. arXiv preprint arXiv:2310.16789, 2023

  55. [55]

    Math-llava: Bootstrapping mathematical reasoning for multimodal large language models

    Wenhao Shi, Zhiqiang Hu, Yi Bin, Junhua Liu, Yang Yang, See-Kiong Ng, Lidong Bing, and Roy Ka-Wei Lee. Math-llava: Bootstrapping mathematical reasoning for multimodal large language models. arXiv preprint arXiv:2406.17294, 2024

  56. [56]

    How to bridge the gap between modalities: A comprehensive survey on multimodal large language model

    Shezheng Song, Xiaopeng Li, and Shasha Li. How to bridge the gap between modalities: A comprehensive survey on multimodal large language model. arXiv preprint arXiv:2311.07594, 2023

  57. [57]

    Scieval: A multi-level large language model evaluation benchmark for scientific research

    Liangtai Sun, Yang Han, Zihan Zhao, Da Ma, Zhennan Shen, Baocai Chen, Lu Chen, and Kai Yu. Scieval: A multi-level large language model evaluation benchmark for scientific research. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pp.\ 19053--19061, 2024

  58. [58]

    Aligning Large Multimodal Models with Factually Augmented RLHF

    Zhiqing Sun, Sheng Shen, Shengcao Cao, Haotian Liu, Chunyuan Li, Yikang Shen, Chuang Gan, Liang-Yan Gui, Yu-Xiong Wang, Yiming Yang, et al. Aligning large multimodal models with factually augmented rlhf. arXiv preprint arXiv:2309.14525, 2023

  59. [59]

    Memorization without overfitting: Analyzing the training dynamics of large language models

    Kushal Tirumala, Aram Markosyan, Luke Zettlemoyer, and Armen Aghajanyan. Memorization without overfitting: Analyzing the training dynamics of large language models. Advances in Neural Information Processing Systems, 35: 0 38274--38290, 2022

  60. [60]

    Measuring Multimodal Mathematical Reasoning with MATH-Vision Dataset

    Ke Wang, Junting Pan, Weikang Shi, Zimu Lu, Mingjie Zhan, and Hongsheng Li. Measuring multimodal mathematical reasoning with math-vision dataset. arXiv preprint arXiv:2402.14804, 2024 a

  61. [61]

    Large language models for education: A survey and outlook

    Shen Wang, Tianlong Xu, Hang Li, Chaoli Zhang, Joleen Liang, Jiliang Tang, Philip S Yu, and Qingsong Wen. Large language models for education: A survey and outlook. arXiv preprint arXiv:2403.18105, 2024 b

  62. [62]

    CogVLM: Visual Expert for Pretrained Language Models

    Weihan Wang, Qingsong Lv, Wenmeng Yu, Wenyi Hong, Ji Qi, Yan Wang, Junhui Ji, Zhuoyi Yang, Lei Zhao, Xixuan Song, et al. Cogvlm: Visual expert for pretrained language models. arXiv preprint arXiv:2311.03079, 2023 a

  63. [63]

    Large-scale multi-modal pre-trained models: A comprehensive survey

    Xiao Wang, Guangyao Chen, Guangwu Qian, Pengcheng Gao, Xiao-Yong Wei, Yaowei Wang, Yonghong Tian, and Wen Gao. Large-scale multi-modal pre-trained models: A comprehensive survey. Machine Intelligence Research, 20 0 (4): 0 447--482, 2023 b

  64. [64]

    SciBench: Evaluating College-Level Scientific Problem-Solving Abilities of Large Language Models

    Xiaoxuan Wang, Ziniu Hu, Pan Lu, Yanqiao Zhu, Jieyu Zhang, Satyen Subramaniam, Arjun R Loomba, Shichang Zhang, Yizhou Sun, and Wei Wang. Scibench: Evaluating college-level scientific problem-solving abilities of large language models. arXiv preprint arXiv:2307.10635, 2024 c

  65. [65]

    Exploring the reasoning abilities of multimodal large language models (mllms): A comprehensive survey on emerging trends in multimodal reasoning

    Yiqi Wang, Wentao Chen, Xiaotian Han, Xudong Lin, Haiteng Zhao, Yongfei Liu, Bohan Zhai, Jianbo Yuan, Quanzeng You, and Hongxia Yang. Exploring the reasoning abilities of multimodal large language models (mllms): A comprehensive survey on emerging trends in multimodal reasoning. arXiv preprint arXiv:2401.06805, 2024 d

  66. [66]

    Are deep neural networks adequate behavioral models of human visual perception? Annual Review of Vision Science, 9 0 (1): 0 501--524, 2023

    Felix A Wichmann and Robert Geirhos. Are deep neural networks adequate behavioral models of human visual perception? Annual Review of Vision Science, 9 0 (1): 0 501--524, 2023

  67. [67]

    A comprehensive survey of large language models and multimodal large language models in medicine

    Hanguang Xiao, Feizhong Zhou, Xingyue Liu, Tianqi Liu, Zhipeng Li, Xin Liu, and Xiaoxuan Huang. A comprehensive survey of large language models and multimodal large language models in medicine. arXiv preprint arXiv:2405.08603, 2024

  68. [68]

    Mind: Multimodal shopping intention distillation from large vision-language models for e-commerce purchase understanding

    Baixuan Xu, Weiqi Wang, Haochen Shi, Wenxuan Ding, Huihao Jing, Tianqing Fang, Jiaxin Bai, Long Chen, and Yangqiu Song. Mind: Multimodal shopping intention distillation from large vision-language models for e-commerce purchase understanding. arXiv preprint arXiv:2406.10701, 2024 a

  69. [69]

    Superclue-math6: Graded multi-step math reasoning benchmark for llms in chinese

    Liang Xu, Hang Xue, Lei Zhu, and Kangkang Zhao. Superclue-math6: Graded multi-step math reasoning benchmark for llms in chinese. arXiv preprint arXiv:2401.11819, 2024 b

  70. [70]

    Raise a child in large language model: Towards effective and generalizable fine-tuning

    Runxin Xu, Fuli Luo, Zhiyuan Zhang, Chuanqi Tan, Baobao Chang, Songfang Huang, and Fei Huang. Raise a child in large language model: Towards effective and generalizable fine-tuning. arXiv preprint arXiv:2109.05687, 2021

  71. [71]

    Emerging synergies between large language models and machine learning in ecommerce recommendations

    Xiaonan Xu, Zheng Xu, Zhipeng Ling, Zhengyu Jin, and ShuQian Du. Emerging synergies between large language models and machine learning in ecommerce recommendations. arXiv preprint arXiv:2403.02760, 2024 c

  72. [72]

    Georeasoner: Reasoning on geospatially grounded context for natural language understanding

    Yibo Yan and Joey Lee. Georeasoner: Reasoning on geospatially grounded context for natural language understanding. arXiv preprint arXiv:2408.11366, 2024

  73. [73]

    Urbanclip: Learning text-enhanced urban region profiling with contrastive language-image pretraining from the web

    Yibo Yan, Haomin Wen, Siru Zhong, Wei Chen, Haodong Chen, Qingsong Wen, Roger Zimmermann, and Yuxuan Liang. Urbanclip: Learning text-enhanced urban region profiling with contrastive language-image pretraining from the web. In Proceedings of the ACM on Web Conference 2024, pp.\ 4006--4017, 2024

  74. [74]

    Exploring diverse in-context configurations for image captioning

    Xu Yang, Yongliang Wu, Mingzhuo Yang, Haokun Chen, and Xin Geng. Exploring diverse in-context configurations for image captioning. Advances in Neural Information Processing Systems, 36, 2024

  75. [75]

    MiniCPM-V: A GPT-4V Level MLLM on Your Phone

    Yuan Yao, Tianyu Yu, Ao Zhang, Chongyi Wang, Junbo Cui, Hongji Zhu, Tianchi Cai, Haoyu Li, Weilin Zhao, Zhihui He, et al. Minicpm-v: A gpt-4v level mllm on your phone. arXiv preprint arXiv:2408.01800, 2024

  76. [76]

    Yi: Open Foundation Models by 01.AI

    Alex Young, Bei Chen, Chao Li, Chengen Huang, Ge Zhang, Guanwei Zhang, Heng Li, Jiangcheng Zhu, Jianqun Chen, Jing Chang, et al. Yi: Open foundation models by 01. ai. arXiv preprint arXiv:2403.04652, 2024

  77. [77]

    Large language model as attributed training data generator: A tale of diversity and bias

    Yue Yu, Yuchen Zhuang, Jieyu Zhang, Yu Meng, Alexander J Ratner, Ranjay Krishna, Jiaming Shen, and Chao Zhang. Large language model as attributed training data generator: A tale of diversity and bias. Advances in Neural Information Processing Systems, 36, 2024

  78. [78]

    Mr-ben: A comprehensive meta-reasoning benchmark for large language models

    Zhongshen Zeng, Yinhong Liu, Yingjia Wan, Jingyao Li, Pengguang Chen, Jianbo Dai, Yuxuan Yao, Rongwu Xu, Zehan Qi, Wanru Zhao, et al. Mr-ben: A comprehensive meta-reasoning benchmark for large language models. arXiv preprint arXiv:2406.13975, 2024

  79. [79]

    MathVerse: Does Your Multi-modal LLM Truly See the Diagrams in Visual Math Problems?

    Renrui Zhang, Dongzhi Jiang, Yichi Zhang, Haokun Lin, Ziyu Guo, Pengshuo Qiu, Aojun Zhou, Pan Lu, Kai-Wei Chang, Peng Gao, et al. Mathverse: Does your multi-modal llm truly see the diagrams in visual math problems? arXiv preprint arXiv:2403.14624, 2024

  80. [80]

    A Survey of Large Language Models

    Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen Zhang, Junjie Zhang, Zican Dong, et al. A survey of large language models. arXiv preprint arXiv:2303.18223, 2023

Showing first 80 references.