ErrorRadar: Benchmarking Complex Mathematical Reasoning of Multimodal Large Language Models Via Error Detection
Pith reviewed 2026-05-23 20:08 UTC · model grok-4.3
The pith
Multimodal large language models lag human experts by about 10 percent on detecting errors in K-12 math problems.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
ErrorRadar formulates multimodal error detection with two sub-tasks—error step identification and error categorization—and provides 2500 annotated K-12 problems to benchmark MLLMs, revealing that the best model trails human evaluators by around 10%.
What carries the argument
ErrorRadar benchmark consisting of 2500 multimodal math problems from real student interactions, annotated for error step identification and error categorization.
Load-bearing premise
The 2500 collected problems with their annotations accurately represent the range of complex mathematical reasoning errors encountered in multimodal settings.
What would settle it
A new MLLM achieving error detection accuracy within 5% of human evaluators on both sub-tasks of the ErrorRadar benchmark would challenge the claim that significant challenges remain.
Figures
read the original abstract
As the field of Multimodal Large Language Models (MLLMs) continues to evolve, their potential to revolutionize artificial intelligence is particularly promising, especially in addressing mathematical reasoning tasks. Current mathematical benchmarks predominantly focus on evaluating MLLMs' problem-solving ability, yet there is a crucial gap in addressing more complex scenarios such as error detection, for enhancing reasoning capability in complicated settings. To fill this gap, we formally formulate the new task: multimodal error detection, and introduce ErrorRadar, the first benchmark designed to assess MLLMs' capabilities in such a task. ErrorRadar evaluates two sub-tasks: error step identification and error categorization, providing a comprehensive framework for evaluating MLLMs' complex mathematical reasoning ability. It consists of 2,500 high-quality multimodal K-12 mathematical problems, collected from real-world student interactions in an educational organization, with rigorous annotation and rich metadata such as problem type and error category. Through extensive experiments, we evaluated both open-source and closed-source representative MLLMs, benchmarking their performance against educational expert evaluators. Results indicate significant challenges still remain, as GPT-4o with best performance is still around 10% behind human evaluation.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper formulates the new task of multimodal error detection for mathematical reasoning and introduces ErrorRadar, the first benchmark for it. ErrorRadar contains 2,500 K-12 multimodal math problems collected from real-world student interactions in one educational organization; it defines two subtasks (error step identification and error categorization) and evaluates representative open- and closed-source MLLMs against human experts, reporting that GPT-4o achieves the highest scores but remains approximately 10% behind human performance.
Significance. If the benchmark construction and evaluation protocols are shown to be reliable and representative, the work would usefully shift evaluation focus from problem solving to error detection and supply a real-world-derived testbed with metadata. The explicit formulation of the new task and the collection of authentic student errors constitute clear strengths; the reported performance gap, once statistically grounded, would provide a concrete target for future MLLM development.
major comments (3)
- [Benchmark construction] Benchmark construction section: the manuscript asserts that the 2,500 problems were obtained via 'rigorous annotation' from a single educational organization yet supplies no inter-annotator agreement figures, annotation guidelines, selection criteria, or validation against external error distributions. This directly affects the central claim that ErrorRadar constitutes a valid and representative test of complex multimodal mathematical reasoning errors.
- [Evaluation and results] Evaluation and results section: the headline result that GPT-4o 'is still around 10% behind human evaluation' is presented without definitions of the exact metrics for each sub-task, without the procedure used to obtain human scores, and without statistical significance tests or confidence intervals. These omissions render the performance comparison unverifiable and load-bearing for the paper's conclusions.
- [Data description] Data description: no table or subsection reports the distribution of problem types, error categories, or curricular coverage, nor any comparison to established math-error taxonomies; without such information the representativeness argument cannot be assessed.
minor comments (2)
- [Abstract] The abstract and introduction would benefit from a brief statement of the precise metric definitions and the human-evaluation protocol.
- [Related work] Related-work section should cite prior single-modality error-detection benchmarks to clarify the incremental contribution of the multimodal setting.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address each of the major comments below and will make the necessary revisions to enhance the clarity and rigor of the paper.
read point-by-point responses
-
Referee: [Benchmark construction] Benchmark construction section: the manuscript asserts that the 2,500 problems were obtained via 'rigorous annotation' from a single educational organization yet supplies no inter-annotator agreement figures, annotation guidelines, selection criteria, or validation against external error distributions. This directly affects the central claim that ErrorRadar constitutes a valid and representative test of complex multimodal mathematical reasoning errors.
Authors: We agree that additional details on the annotation process are necessary to substantiate the rigor of our benchmark. In the revised manuscript, we will expand the Benchmark Construction section to include inter-annotator agreement figures (such as Cohen's kappa), the annotation guidelines, selection criteria, and any validation steps against external error distributions. These will be provided in the main text or an appendix. revision: yes
-
Referee: [Evaluation and results] Evaluation and results section: the headline result that GPT-4o 'is still around 10% behind human evaluation' is presented without definitions of the exact metrics for each sub-task, without the procedure used to obtain human scores, and without statistical significance tests or confidence intervals. These omissions render the performance comparison unverifiable and load-bearing for the paper's conclusions.
Authors: We recognize the importance of clearly defining metrics and providing statistical analysis for the performance comparison. The revised paper will define the exact metrics for error step identification and error categorization, detail the human evaluation procedure (including the number of experts and their expertise), and include statistical significance tests along with confidence intervals to support the reported performance gap. revision: yes
-
Referee: [Data description] Data description: no table or subsection reports the distribution of problem types, error categories, or curricular coverage, nor any comparison to established math-error taxonomies; without such information the representativeness argument cannot be assessed.
Authors: We will add a dedicated subsection and tables in the Data Description section to report the distributions of problem types, error categories, and curricular coverage. Furthermore, we will include a comparison of our error categories to established math-error taxonomies from prior literature to better demonstrate the representativeness of the ErrorRadar benchmark. revision: yes
Circularity Check
No circularity: empirical benchmark introduction with no derivations or self-referential reductions
full rationale
The paper formulates a new task and presents ErrorRadar as a benchmark of 2,500 problems collected from student interactions, with evaluation of MLLMs. No equations, derivations, fitted parameters, or load-bearing self-citations appear in the abstract or described structure. The work is self-contained empirical evaluation against human experts; the representativeness claim is an external validity issue, not a circular reduction of any claimed derivation to its inputs.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The 2,500 problems collected from real-world student interactions with rigorous annotation accurately capture complex mathematical reasoning errors in multimodal settings.
Forward citations
Cited by 1 Pith paper
-
Position: Multimodal Large Language Models Can Significantly Advance Scientific Reasoning
Position paper claims multimodal LLMs can significantly advance scientific reasoning and proposes a four-stage roadmap plus challenges and suggestions.
Reference graph
Works this paper leans on
-
[1]
Complexity in declarative process models: Metrics and multi-modal assessment of cognitive load
Amine Abbad-Andaloussi, Andrea Burattin, Tijs Slaats, Ekkart Kindler, and Barbara Weber. Complexity in declarative process models: Metrics and multi-modal assessment of cognitive load. Expert Systems with Applications, 233: 0 120924, 2023
work page 2023
-
[2]
Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone
Marah Abdin, Sam Ade Jacobs, Ammar Ahmad Awan, Jyoti Aneja, Ahmed Awadallah, Hany Awadalla, Nguyen Bach, Amit Bahree, Arash Bakhtiari, Harkirat Behl, et al. Phi-3 technical report: A highly capable language model locally on your phone. arXiv preprint arXiv:2404.14219, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[3]
Scaling laws for generative mixed-modal language models
Armen Aghajanyan, Lili Yu, Alexis Conneau, Wei-Ning Hsu, Karen Hambardzumyan, Susan Zhang, Stephen Roller, Naman Goyal, Omer Levy, and Luke Zettlemoyer. Scaling laws for generative mixed-modal language models. In International Conference on Machine Learning, pp.\ 265--279. PMLR, 2023
work page 2023
-
[4]
Large language models for mathematical reasoning: Progresses and challenges
Janice Ahn, Rishu Verma, Renze Lou, Di Liu, Rui Zhang, and Wenpeng Yin. Large language models for mathematical reasoning: Progresses and challenges. arXiv preprint arXiv:2402.00157, 2024
-
[5]
Anthropic. Claude 3, 2024 a . URL https://www.anthropic.com/news/claude-3-haiku
work page 2024
-
[6]
Anthropic. Claude 3.5, 2024 b . URL https://www.anthropic.com/news/claude-3-5-sonnet
work page 2024
-
[7]
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond. arXiv preprint arXiv:2308.12966, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[8]
Turning large language models into cognitive models
Marcel Binz and Eric Schulz. Turning large language models into cognitive models. arXiv preprint arXiv:2306.03917, 2023
-
[9]
Theoremqa: A theorem-driven question answering dataset
Wenhu Chen, Ming Yin, Max Ku, Pan Lu, Yixin Wan, Xueguang Ma, Jianyu Xu, Xinyi Wang, and Tony Xia. Theoremqa: A theorem-driven question answering dataset. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp.\ 7889--7901, 2023 a
work page 2023
-
[10]
InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks
Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, Bin Li, Ping Luo, Tong Lu, Yu Qiao, and Jifeng Dai. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. arXiv preprint arXiv:2312.14238, 2023 b
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[11]
Training Verifiers to Solve Math Word Problems
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[12]
A survey on multimodal large language models for autonomous driving
Can Cui, Yunsheng Ma, Xu Cao, Wenqian Ye, Yang Zhou, Kaizhao Liang, Jintai Chen, Juanwu Lu, Zichong Yang, Kuei-Da Liao, et al. A survey on multimodal large language models for autonomous driving. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp.\ 958--979, 2024
work page 2024
-
[13]
Advancing mathematics by guiding human intuition with ai
Alex Davies, Petar Veli c kovi \'c , Lars Buesing, Sam Blackwell, Daniel Zheng, Nenad Toma s ev, Richard Tanburn, Peter Battaglia, Charles Blundell, Andr \'a s Juh \'a sz, et al. Advancing mathematics by guiding human intuition with ai. Nature, 600 0 (7887): 0 70--74, 2021
work page 2021
-
[14]
Visual representations in the human brain are aligned with large language models
Adrien Doerig, Tim C Kietzmann, Emily Allen, Yihan Wu, Thomas Naselaris, Kendrick Kay, and Ian Charest. Visual representations in the human brain are aligned with large language models. arXiv preprint arXiv:2209.11737, 2022
-
[15]
Muffin or chihuahua? challenging multimodal large language models with multipanel vqa
Yue Fan, Jing Gu, Kaiwen Zhou, Qianqi Yan, Shan Jiang, Ching-Chen Kuo, Yang Zhao, Xinze Guan, and Xin Wang. Muffin or chihuahua? challenging multimodal large language models with multipanel vqa. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 6845--6863, 2024
work page 2024
-
[16]
Zhangyin Feng, Weitao Ma, Weijiang Yu, Lei Huang, Haotian Wang, Qianglong Chen, Weihua Peng, Xiaocheng Feng, Bing Qin, et al. Trends in integration of knowledge and large language models: A survey and taxonomy of methods, benchmarks, and applications. arXiv preprint arXiv:2311.05876, 2023
-
[17]
Isobench: Benchmarking multimodal foundation models on isomorphic representations
Deqing Fu, Ghazal Khalighinejad, Ollie Liu, Bhuwan Dhingra, Dani Yogatama, Robin Jia, and Willie Neiswanger. Isobench: Benchmarking multimodal foundation models on isomorphic representations. arXiv preprint arXiv:2404.01266, 2024
-
[18]
ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools
Team GLM, Aohan Zeng, Bin Xu, Bowen Wang, Chenhui Zhang, Da Yin, Diego Rojas, Guanyu Feng, Hanlin Zhao, Hanyu Lai, et al. Chatglm: A family of large language models from glm-130b to glm-4 all tools. arXiv preprint arXiv:2406.12793, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[19]
Xixuan Hao, Wei Chen, Yibo Yan, Siru Zhong, Kun Wang, Qingsong Wen, and Yuxuan Liang. Urbanvlp: A multi-granularity vision-language pre-trained foundation model for urban indicator prediction. arXiv preprint arXiv:2403.16831, 2024
-
[20]
Jinlong He, Pengfei Li, Gang Liu, Zixu Zhao, and Shenjun Zhong. Pefomed: Parameter efficient fine-tuning on multimodal large language models for medical visual question answering. arXiv preprint arXiv:2401.02797, 2024 a
-
[21]
Cmmu: A benchmark for chinese multi-modal multi-type question understanding and reasoning
Zheqi He, Xinya Wu, Pengfei Zhou, Richeng Xuan, Guang Liu, Xi Yang, Qiannan Zhu, and Hua Huang. Cmmu: A benchmark for chinese multi-modal multi-type question understanding and reasoning. arXiv preprint arXiv:2401.14011, 2024 b
-
[22]
Measuring Mathematical Problem Solving With the MATH Dataset
Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[23]
Visual sketchpad: Sketching as a visual chain of thought for multimodal language models
Yushi Hu, Weijia Shi, Xingyu Fu, Dan Roth, Mari Ostendorf, Luke Zettlemoyer, Noah A Smith, and Ranjay Krishna. Visual sketchpad: Sketching as a visual chain of thought for multimodal language models. arXiv preprint arXiv:2406.09403, 2024
-
[24]
Mmneuron: Discovering neuron-level domain-specific interpretation in multimodal large language model
Jiahao Huo, Yibo Yan, Boren Hu, Yutao Yue, and Xuming Hu. Mmneuron: Discovering neuron-level domain-specific interpretation in multimodal large language model. arXiv preprint arXiv:2406.11193, 2024
-
[25]
Mengzhao Jia, Zhihan Zhang, Wenhao Yu, Fangkai Jiao, and Meng Jiang. Describe-then-reason: Improving multimodal mathematical reasoning through visual comprehension training. arXiv preprint arXiv:2404.14604, 2024
-
[26]
New generation deep learning for video object detection: A survey
Licheng Jiao, Ruohan Zhang, Fang Liu, Shuyuan Yang, Biao Hou, Lingling Li, and Xu Tang. New generation deep learning for video object detection: A survey. IEEE Transactions on Neural Networks and Learning Systems, 33 0 (8): 0 3195--3215, 2021
work page 2021
-
[27]
Learning instance-level representation for large-scale multi-modal pretraining in e-commerce
Yang Jin, Yongzhi Li, Zehuan Yuan, and Yadong Mu. Learning instance-level representation for large-scale multi-modal pretraining in e-commerce. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 11060--11069, 2023
work page 2023
-
[28]
Large language models struggle to learn long-tail knowledge
Nikhil Kandpal, Haikang Deng, Adam Roberts, Eric Wallace, and Colin Raffel. Large language models struggle to learn long-tail knowledge. In International Conference on Machine Learning, pp.\ 15696--15707. PMLR, 2023
work page 2023
-
[29]
Scaling Laws for Neural Language Models
Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2001
-
[30]
Cognitive load theory: An applied reintroduction for special and general educators
Michael J Kennedy and John Elwood Romig. Cognitive load theory: An applied reintroduction for special and general educators. TEACHING Exceptional Children, 56 0 (6): 0 440--451, 2024
work page 2024
-
[31]
Large language models are zero-shot reasoners
Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners. Advances in neural information processing systems, 35: 0 22199--22213, 2022
work page 2022
-
[32]
Solving quantitative reasoning problems with language models
Aitor Lewkowycz, Anders Andreassen, David Dohan, Ethan Dyer, Henryk Michalewski, Vinay Ramasesh, Ambrose Slone, Cem Anil, Imanol Schlag, Theo Gutman-Solo, et al. Solving quantitative reasoning problems with language models. Advances in Neural Information Processing Systems, 35: 0 3843--3857, 2022
work page 2022
-
[33]
Bringing generative ai to adaptive learning in education
Hang Li, Tianlong Xu, Chaoli Zhang, Eason Chen, Jing Liang, Xing Fan, Haoyang Li, Jiliang Tang, and Qingsong Wen. Bringing generative ai to adaptive learning in education. arXiv preprint arXiv:2402.14601, 2024 a
-
[34]
Xiaoyuan Li, Wenjie Wang, Moxin Li, Junrong Guo, Yang Zhang, and Fuli Feng. Evaluating mathematical reasoning of large language models: A focus on error identification and correction. arXiv preprint arXiv:2406.00755, 2024 b
-
[35]
Cmmath: A chinese multi-modal math skill evaluation benchmark for foundation models
Zhong-Zhi Li, Ming-Liang Zhang, Fei Yin, Zhi-Long Ji, Jin-Feng Bai, Zhen-Ru Pan, Fan-Hu Zeng, Jian Xu, Jia-Xin Zhang, and Cheng-Lin Liu. Cmmath: A chinese multi-modal math skill evaluation benchmark for foundation models. arXiv preprint arXiv:2407.12023, 2024 c
-
[36]
Llava-next: Improved reasoning, ocr, and world knowledge, January 2024 a
Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. Llava-next: Improved reasoning, ocr, and world knowledge, January 2024 a . URL https://llava-vl.github.io/blog/2024-01-30-llava-next/
work page 2024
-
[37]
Hongwei Liu, Zilong Zheng, Yuxuan Qiao, Haodong Duan, Zhiwei Fei, Fengzhe Zhou, Wenwei Zhang, Songyang Zhang, Dahua Lin, and Kai Chen. Mathbench: Evaluating the theory and application proficiency of llms with a hierarchical mathematics benchmark. arXiv preprint arXiv:2405.12209, 2024 b
-
[38]
Xiao Liu, Zirui Wu, Xueqing Wu, Pan Lu, Kai-Wei Chang, and Yansong Feng. Are llms capable of data-based statistical and causal reasoning? benchmarking advanced quantitative reasoning with data. arXiv preprint arXiv:2402.17644, 2024 c
-
[39]
DeepSeek-VL: Towards Real-World Vision-Language Understanding
Haoyu Lu, Wen Liu, Bo Zhang, Bingxuan Wang, Kai Dong, Bo Liu, Jingxiang Sun, Tongzheng Ren, Zhuoshu Li, Yaofeng Sun, et al. Deepseek-vl: towards real-world vision-language understanding. arXiv preprint arXiv:2403.05525, 2024 a
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[40]
A survey of deep learning for mathematical reasoning
Pan Lu, Liang Qiu, Wenhao Yu, Sean Welleck, and Kai-Wei Chang. A survey of deep learning for mathematical reasoning. arXiv preprint arXiv:2212.10535, 2022
-
[41]
MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts
Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts. arXiv preprint arXiv:2310.02255, 2024 b
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[42]
Chameleon: Plug-and-play compositional reasoning with large language models
Pan Lu, Baolin Peng, Hao Cheng, Michel Galley, Kai-Wei Chang, Ying Nian Wu, Song-Chun Zhu, and Jianfeng Gao. Chameleon: Plug-and-play compositional reasoning with large language models. Advances in Neural Information Processing Systems, 36, 2024 c
work page 2024
-
[43]
Large Language Models: A Survey
Shervin Minaee, Tomas Mikolov, Narjes Nikzad, Meysam Chenaghlu, Richard Socher, Xavier Amatriain, and Jianfeng Gao. Large language models: A survey. arXiv preprint arXiv:2402.06196, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[44]
Scaling data-constrained language models
Niklas Muennighoff, Alexander Rush, Boaz Barak, Teven Le Scao, Nouamane Tazi, Aleksandra Piktus, Sampo Pyysalo, Thomas Wolf, and Colin A Raffel. Scaling data-constrained language models. Advances in Neural Information Processing Systems, 36, 2024
work page 2024
-
[45]
OpenAI. GPT-4 technical report. arXiv preprint arXiv:2303.08774, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[46]
GPT-4V(ision) system card, 2024 a
OpenAI. GPT-4V(ision) system card, 2024 a . URL https://openai.com/index/gpt-4o-system-card/
work page 2024
-
[47]
Gpt-4o mini: advancing cost-efficient intelligence, 2024 b
OpenAI. Gpt-4o mini: advancing cost-efficient intelligence, 2024 b . URL https://openai.com/index/gpt-4o-mini-advancing-cost-efficient-intelligence/
work page 2024
-
[48]
Cognitive load theory and instructional design: Recent developments
Fred Paas, Alexander Renkl, and John Sweller. Cognitive load theory and instructional design: Recent developments. Educational Psychologist, 2010
work page 2010
-
[49]
Ankit Pal and Malaikannan Sankarasubbu. Gemini goes to med school: exploring the capabilities of multimodal large language models on medical challenge problems & hallucinations. arXiv preprint arXiv:2402.07023, 2024
-
[50]
Multimath: Bridging visual and mathematical reasoning for large language models
Shuai Peng, Di Fu, Liangcai Gao, Xiuqin Zhong, Hongguang Fu, and Zhi Tang. Multimath: Bridging visual and mathematical reasoning for large language models. arXiv preprint arXiv:2409.00147, 2024
-
[51]
We-Math: Does Your Large Multimodal Model Achieve Human-like Mathematical Reasoning?
Runqi Qiao, Qiuna Tan, Guanting Dong, Minhui Wu, Chong Sun, Xiaoshuai Song, Zhuoma GongQue, Shanglin Lei, Zhe Wei, Miaoxuan Zhang, et al. We-math: Does your large multimodal model achieve human-like mathematical reasoning? arXiv preprint arXiv:2407.01284, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[52]
Elementary math learning through piaget's cognitive development stages
Annabelle Rabillas, Osias Kit Kilag, Neil Ca \ n ete, Maria Trazona, Mery Lou Calope, and Jacqueline Kilag. Elementary math learning through piaget's cognitive development stages. Excellencia: International Multi-disciplinary Journal of Education (2994-9521), 1 0 (4): 0 128--142, 2023
work page 2023
-
[53]
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context
Machel Reid, Nikolay Savinov, Denis Teplyashin, Dmitry Lepikhin, Timothy Lillicrap, Jean-baptiste Alayrac, Radu Soricut, Angeliki Lazaridou, Orhan Firat, Julian Schrittwieser, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv preprint arXiv:2403.05530, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[54]
Detecting Pretraining Data from Large Language Models
Weijia Shi, Anirudh Ajith, Mengzhou Xia, Yangsibo Huang, Daogao Liu, Terra Blevins, Danqi Chen, and Luke Zettlemoyer. Detecting pretraining data from large language models. arXiv preprint arXiv:2310.16789, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[55]
Math-llava: Bootstrapping mathematical reasoning for multimodal large language models
Wenhao Shi, Zhiqiang Hu, Yi Bin, Junhua Liu, Yang Yang, See-Kiong Ng, Lidong Bing, and Roy Ka-Wei Lee. Math-llava: Bootstrapping mathematical reasoning for multimodal large language models. arXiv preprint arXiv:2406.17294, 2024
-
[56]
How to bridge the gap between modalities: A comprehensive survey on multimodal large language model
Shezheng Song, Xiaopeng Li, and Shasha Li. How to bridge the gap between modalities: A comprehensive survey on multimodal large language model. arXiv preprint arXiv:2311.07594, 2023
-
[57]
Scieval: A multi-level large language model evaluation benchmark for scientific research
Liangtai Sun, Yang Han, Zihan Zhao, Da Ma, Zhennan Shen, Baocai Chen, Lu Chen, and Kai Yu. Scieval: A multi-level large language model evaluation benchmark for scientific research. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pp.\ 19053--19061, 2024
work page 2024
-
[58]
Aligning Large Multimodal Models with Factually Augmented RLHF
Zhiqing Sun, Sheng Shen, Shengcao Cao, Haotian Liu, Chunyuan Li, Yikang Shen, Chuang Gan, Liang-Yan Gui, Yu-Xiong Wang, Yiming Yang, et al. Aligning large multimodal models with factually augmented rlhf. arXiv preprint arXiv:2309.14525, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[59]
Memorization without overfitting: Analyzing the training dynamics of large language models
Kushal Tirumala, Aram Markosyan, Luke Zettlemoyer, and Armen Aghajanyan. Memorization without overfitting: Analyzing the training dynamics of large language models. Advances in Neural Information Processing Systems, 35: 0 38274--38290, 2022
work page 2022
-
[60]
Measuring Multimodal Mathematical Reasoning with MATH-Vision Dataset
Ke Wang, Junting Pan, Weikang Shi, Zimu Lu, Mingjie Zhan, and Hongsheng Li. Measuring multimodal mathematical reasoning with math-vision dataset. arXiv preprint arXiv:2402.14804, 2024 a
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[61]
Large language models for education: A survey and outlook
Shen Wang, Tianlong Xu, Hang Li, Chaoli Zhang, Joleen Liang, Jiliang Tang, Philip S Yu, and Qingsong Wen. Large language models for education: A survey and outlook. arXiv preprint arXiv:2403.18105, 2024 b
-
[62]
CogVLM: Visual Expert for Pretrained Language Models
Weihan Wang, Qingsong Lv, Wenmeng Yu, Wenyi Hong, Ji Qi, Yan Wang, Junhui Ji, Zhuoyi Yang, Lei Zhao, Xixuan Song, et al. Cogvlm: Visual expert for pretrained language models. arXiv preprint arXiv:2311.03079, 2023 a
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[63]
Large-scale multi-modal pre-trained models: A comprehensive survey
Xiao Wang, Guangyao Chen, Guangwu Qian, Pengcheng Gao, Xiao-Yong Wei, Yaowei Wang, Yonghong Tian, and Wen Gao. Large-scale multi-modal pre-trained models: A comprehensive survey. Machine Intelligence Research, 20 0 (4): 0 447--482, 2023 b
work page 2023
-
[64]
SciBench: Evaluating College-Level Scientific Problem-Solving Abilities of Large Language Models
Xiaoxuan Wang, Ziniu Hu, Pan Lu, Yanqiao Zhu, Jieyu Zhang, Satyen Subramaniam, Arjun R Loomba, Shichang Zhang, Yizhou Sun, and Wei Wang. Scibench: Evaluating college-level scientific problem-solving abilities of large language models. arXiv preprint arXiv:2307.10635, 2024 c
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[65]
Yiqi Wang, Wentao Chen, Xiaotian Han, Xudong Lin, Haiteng Zhao, Yongfei Liu, Bohan Zhai, Jianbo Yuan, Quanzeng You, and Hongxia Yang. Exploring the reasoning abilities of multimodal large language models (mllms): A comprehensive survey on emerging trends in multimodal reasoning. arXiv preprint arXiv:2401.06805, 2024 d
-
[66]
Felix A Wichmann and Robert Geirhos. Are deep neural networks adequate behavioral models of human visual perception? Annual Review of Vision Science, 9 0 (1): 0 501--524, 2023
work page 2023
-
[67]
A comprehensive survey of large language models and multimodal large language models in medicine
Hanguang Xiao, Feizhong Zhou, Xingyue Liu, Tianqi Liu, Zhipeng Li, Xin Liu, and Xiaoxuan Huang. A comprehensive survey of large language models and multimodal large language models in medicine. arXiv preprint arXiv:2405.08603, 2024
-
[68]
Baixuan Xu, Weiqi Wang, Haochen Shi, Wenxuan Ding, Huihao Jing, Tianqing Fang, Jiaxin Bai, Long Chen, and Yangqiu Song. Mind: Multimodal shopping intention distillation from large vision-language models for e-commerce purchase understanding. arXiv preprint arXiv:2406.10701, 2024 a
-
[69]
Superclue-math6: Graded multi-step math reasoning benchmark for llms in chinese
Liang Xu, Hang Xue, Lei Zhu, and Kangkang Zhao. Superclue-math6: Graded multi-step math reasoning benchmark for llms in chinese. arXiv preprint arXiv:2401.11819, 2024 b
-
[70]
Raise a child in large language model: Towards effective and generalizable fine-tuning
Runxin Xu, Fuli Luo, Zhiyuan Zhang, Chuanqi Tan, Baobao Chang, Songfang Huang, and Fei Huang. Raise a child in large language model: Towards effective and generalizable fine-tuning. arXiv preprint arXiv:2109.05687, 2021
-
[71]
Emerging synergies between large language models and machine learning in ecommerce recommendations
Xiaonan Xu, Zheng Xu, Zhipeng Ling, Zhengyu Jin, and ShuQian Du. Emerging synergies between large language models and machine learning in ecommerce recommendations. arXiv preprint arXiv:2403.02760, 2024 c
-
[72]
Georeasoner: Reasoning on geospatially grounded context for natural language understanding
Yibo Yan and Joey Lee. Georeasoner: Reasoning on geospatially grounded context for natural language understanding. arXiv preprint arXiv:2408.11366, 2024
-
[73]
Yibo Yan, Haomin Wen, Siru Zhong, Wei Chen, Haodong Chen, Qingsong Wen, Roger Zimmermann, and Yuxuan Liang. Urbanclip: Learning text-enhanced urban region profiling with contrastive language-image pretraining from the web. In Proceedings of the ACM on Web Conference 2024, pp.\ 4006--4017, 2024
work page 2024
-
[74]
Exploring diverse in-context configurations for image captioning
Xu Yang, Yongliang Wu, Mingzhuo Yang, Haokun Chen, and Xin Geng. Exploring diverse in-context configurations for image captioning. Advances in Neural Information Processing Systems, 36, 2024
work page 2024
-
[75]
MiniCPM-V: A GPT-4V Level MLLM on Your Phone
Yuan Yao, Tianyu Yu, Ao Zhang, Chongyi Wang, Junbo Cui, Hongji Zhu, Tianchi Cai, Haoyu Li, Weilin Zhao, Zhihui He, et al. Minicpm-v: A gpt-4v level mllm on your phone. arXiv preprint arXiv:2408.01800, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[76]
Yi: Open Foundation Models by 01.AI
Alex Young, Bei Chen, Chao Li, Chengen Huang, Ge Zhang, Guanwei Zhang, Heng Li, Jiangcheng Zhu, Jianqun Chen, Jing Chang, et al. Yi: Open foundation models by 01. ai. arXiv preprint arXiv:2403.04652, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[77]
Large language model as attributed training data generator: A tale of diversity and bias
Yue Yu, Yuchen Zhuang, Jieyu Zhang, Yu Meng, Alexander J Ratner, Ranjay Krishna, Jiaming Shen, and Chao Zhang. Large language model as attributed training data generator: A tale of diversity and bias. Advances in Neural Information Processing Systems, 36, 2024
work page 2024
-
[78]
Mr-ben: A comprehensive meta-reasoning benchmark for large language models
Zhongshen Zeng, Yinhong Liu, Yingjia Wan, Jingyao Li, Pengguang Chen, Jianbo Dai, Yuxuan Yao, Rongwu Xu, Zehan Qi, Wanru Zhao, et al. Mr-ben: A comprehensive meta-reasoning benchmark for large language models. arXiv preprint arXiv:2406.13975, 2024
-
[79]
MathVerse: Does Your Multi-modal LLM Truly See the Diagrams in Visual Math Problems?
Renrui Zhang, Dongzhi Jiang, Yichi Zhang, Haokun Lin, Ziyu Guo, Pengshuo Qiu, Aojun Zhou, Pan Lu, Kai-Wei Chang, Peng Gao, et al. Mathverse: Does your multi-modal llm truly see the diagrams in visual math problems? arXiv preprint arXiv:2403.14624, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[80]
A Survey of Large Language Models
Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen Zhang, Junjie Zhang, Zican Dong, et al. A survey of large language models. arXiv preprint arXiv:2303.18223, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.