ErrorRadar: Benchmarking Complex Mathematical Reasoning of Multimodal Large Language Models Via Error Detection

Aoxiao Zhong; Boyan Li; Hang Li; Hui Xiong; Jiahao Huo; Jiamin Su; Kun Wang; Philip S. Yu; Qingsong Wen; Shen Wang

arxiv: 2410.04509 · v3 · submitted 2024-10-06 · 💻 cs.CL

ErrorRadar: Benchmarking Complex Mathematical Reasoning of Multimodal Large Language Models Via Error Detection

Yibo Yan , Shen Wang , Jiahao Huo , Hang Li , Boyan Li , Jiamin Su , Xiong Gao , Yi-Fan Zhang

show 8 more authors

Tianlong Xu Zhendong Chu Aoxiao Zhong Kun Wang Hui Xiong Philip S. Yu Xuming Hu Qingsong Wen

This is my paper

Pith reviewed 2026-05-23 20:08 UTC · model grok-4.3

classification 💻 cs.CL

keywords multimodal error detectionErrorRadarmathematical reasoningMLLMsbenchmarkerror step identificationerror categorizationK-12 math problems

0 comments

The pith

Multimodal large language models lag human experts by about 10 percent on detecting errors in K-12 math problems.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces multimodal error detection as a new task for assessing how well MLLMs can spot and categorize mistakes in mathematical reasoning that includes diagrams or other visuals. It presents ErrorRadar, a benchmark of 2500 real student problems with annotations for error steps and categories. Experiments show that even top models like GPT-4o fall short of human performance, highlighting gaps in complex reasoning capabilities. This matters because error detection could help improve AI tutoring systems and model training. The benchmark provides a standardized way to measure progress beyond just solving problems correctly.

Core claim

ErrorRadar formulates multimodal error detection with two sub-tasks—error step identification and error categorization—and provides 2500 annotated K-12 problems to benchmark MLLMs, revealing that the best model trails human evaluators by around 10%.

What carries the argument

ErrorRadar benchmark consisting of 2500 multimodal math problems from real student interactions, annotated for error step identification and error categorization.

Load-bearing premise

The 2500 collected problems with their annotations accurately represent the range of complex mathematical reasoning errors encountered in multimodal settings.

What would settle it

A new MLLM achieving error detection accuracy within 5% of human evaluators on both sub-tasks of the ErrorRadar benchmark would challenge the claim that significant challenges remain.

Figures

Figures reproduced from arXiv: 2410.04509 by Aoxiao Zhong, Boyan Li, Hang Li, Hui Xiong, Jiahao Huo, Jiamin Su, Kun Wang, Philip S. Yu, Qingsong Wen, Shen Wang, Tianlong Xu, Xiong Gao, Xuming Hu, Yibo Yan, Yi-Fan Zhang, Zhendong Chu.

**Figure 1.** Figure 1: Comparison of research scope between previous work and our proposed ERRORRADAR benchmark on mathematical reasoning tasks. Benchmarks Venue Modality Student Ans. Error Det. TheoremQA (Chen et al., 2023a) EMNLP T - - MathBench (Liu et al., 2024b) ACL T - - MR-GSM8K (Zeng et al., 2024) arXiv T - - SciEval (Sun et al., 2024) AAAI T - - EIC (Li et al., 2024b) arXiv T - ✓ CMMaTH (Li et al., 2024c) arXiv T, I -… view at source ↗

**Figure 2.** Figure 2: Example of our well-annotated multimodal mathematical reasoning dataset ERRORRADAR, and performance comparison on error categorization and error step localization tasks among representative MLLMs. It is evident that even simple math problems can be mishandled by the currently superior MLLMs in one or both tasks, highlighting the challenging nature of our proposed multimodal error detection setting. ❶ We t… view at source ↗

**Figure 3.** Figure 3: Roadmap of ERRORRADAR dataset collection, annotation, and consistent update [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Dataset distribution of ERRORRADAR with respect to problem type and error category. where I(·) is the indicator function, which returns 1 if the predicted step matches the ground truth, and 0 otherwise. Similarly, the accuracy for error categorization is: Acccate = 1 N X N i=1 I(Cerror,i = Gerror,i). 2.2 DATA SOURCE & ANNOTATION Following the roadmap shown in [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: The distribution of CAL and non-CAL of closed-source and opensource MLLMs with top-3 CAL ACC. Finding #2: Weak open-source MLLMs tend to predict CAL category, leading to unusually high performance [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 6.** Figure 6: The error category distribution of misjudged VIS cases of GPT-4o [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 9.** Figure 9: The accuracy of STEP and CATE of two representative MLLM series: LLaVA-NEXT and InternVL2. We denote Tiny, Small, Middle, Large as the 2B, 8B, 26B, 76B for InternVL2 and None, 7B, 13B, 72B for LLaVA-NEXT, respectively. Finding #1: The performance of MLLMs on STEP task increases with the scale of parameters. We observe a phenomenon similar to the scaling law (Kaplan et al., 2020) in our experiments. As sho… view at source ↗

**Figure 10.** Figure 10: Category of bad cases where GPT-4o predicts visual perception errors incorrectly. by evaluating their problem-solving levels, but they overlook tasks based on the student’s perspective, such as error detection, and thus fail to comprehensively evaluate the more complex role of current MLLMs. Therefore, we propose the ERRORRADAR benchmark, which is entirely based on real student response data to evaluate … view at source ↗

**Figure 11.** Figure 11: Multimodal mathematical example one (type: counting) from ERRORRADAR dataset. 17 [PITH_FULL_IMAGE:figures/full_fig_p017_11.png] view at source ↗

**Figure 12.** Figure 12: Multimodal mathematical example two (type: plane geometry) from ERRORRADAR dataset. 18 [PITH_FULL_IMAGE:figures/full_fig_p018_12.png] view at source ↗

**Figure 13.** Figure 13: Multimodal mathematical example three (type: plane geometry) from ERRORRADAR dataset. 19 [PITH_FULL_IMAGE:figures/full_fig_p019_13.png] view at source ↗

**Figure 14.** Figure 14: Multimodal mathematical example four (type: counting) from ERRORRADAR dataset. 20 [PITH_FULL_IMAGE:figures/full_fig_p020_14.png] view at source ↗

**Figure 15.** Figure 15: Multimodal mathematical example five (type: plane geometry) from ERRORRADAR dataset. 21 [PITH_FULL_IMAGE:figures/full_fig_p021_15.png] view at source ↗

**Figure 16.** Figure 16: Prompt for error step identification task. 26 [PITH_FULL_IMAGE:figures/full_fig_p026_16.png] view at source ↗

**Figure 17.** Figure 17: Prompt for error categorization task. 27 [PITH_FULL_IMAGE:figures/full_fig_p027_17.png] view at source ↗

**Figure 18.** Figure 18: Distribution of CAL and non-CAL category predictions of all MLLMs we evaluate. 29 [PITH_FULL_IMAGE:figures/full_fig_p029_18.png] view at source ↗

read the original abstract

As the field of Multimodal Large Language Models (MLLMs) continues to evolve, their potential to revolutionize artificial intelligence is particularly promising, especially in addressing mathematical reasoning tasks. Current mathematical benchmarks predominantly focus on evaluating MLLMs' problem-solving ability, yet there is a crucial gap in addressing more complex scenarios such as error detection, for enhancing reasoning capability in complicated settings. To fill this gap, we formally formulate the new task: multimodal error detection, and introduce ErrorRadar, the first benchmark designed to assess MLLMs' capabilities in such a task. ErrorRadar evaluates two sub-tasks: error step identification and error categorization, providing a comprehensive framework for evaluating MLLMs' complex mathematical reasoning ability. It consists of 2,500 high-quality multimodal K-12 mathematical problems, collected from real-world student interactions in an educational organization, with rigorous annotation and rich metadata such as problem type and error category. Through extensive experiments, we evaluated both open-source and closed-source representative MLLMs, benchmarking their performance against educational expert evaluators. Results indicate significant challenges still remain, as GPT-4o with best performance is still around 10% behind human evaluation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ErrorRadar introduces the first benchmark for multimodal math error detection, but the abstract gives almost no methodological detail so the 10% gap claim cannot be checked.

read the letter

The paper's core move is to define a new task—multimodal error detection with two sub-tasks, step identification and error categorization—and release ErrorRadar, a 2500-problem benchmark drawn from real student work. That is actually new; prior math benchmarks for MLLMs have stayed focused on solving problems, not diagnosing mistakes. The collection from live educational interactions plus the metadata on problem type and error category is a reasonable starting point for an education-oriented evaluation set. The headline result that even GPT-4o trails human experts by roughly 10% is the kind of number that could matter for tutoring applications if it holds up.

Referee Report

3 major / 2 minor

Summary. The paper formulates the new task of multimodal error detection for mathematical reasoning and introduces ErrorRadar, the first benchmark for it. ErrorRadar contains 2,500 K-12 multimodal math problems collected from real-world student interactions in one educational organization; it defines two subtasks (error step identification and error categorization) and evaluates representative open- and closed-source MLLMs against human experts, reporting that GPT-4o achieves the highest scores but remains approximately 10% behind human performance.

Significance. If the benchmark construction and evaluation protocols are shown to be reliable and representative, the work would usefully shift evaluation focus from problem solving to error detection and supply a real-world-derived testbed with metadata. The explicit formulation of the new task and the collection of authentic student errors constitute clear strengths; the reported performance gap, once statistically grounded, would provide a concrete target for future MLLM development.

major comments (3)

[Benchmark construction] Benchmark construction section: the manuscript asserts that the 2,500 problems were obtained via 'rigorous annotation' from a single educational organization yet supplies no inter-annotator agreement figures, annotation guidelines, selection criteria, or validation against external error distributions. This directly affects the central claim that ErrorRadar constitutes a valid and representative test of complex multimodal mathematical reasoning errors.
[Evaluation and results] Evaluation and results section: the headline result that GPT-4o 'is still around 10% behind human evaluation' is presented without definitions of the exact metrics for each sub-task, without the procedure used to obtain human scores, and without statistical significance tests or confidence intervals. These omissions render the performance comparison unverifiable and load-bearing for the paper's conclusions.
[Data description] Data description: no table or subsection reports the distribution of problem types, error categories, or curricular coverage, nor any comparison to established math-error taxonomies; without such information the representativeness argument cannot be assessed.

minor comments (2)

[Abstract] The abstract and introduction would benefit from a brief statement of the precise metric definitions and the human-evaluation protocol.
[Related work] Related-work section should cite prior single-modality error-detection benchmarks to clarify the incremental contribution of the multimodal setting.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each of the major comments below and will make the necessary revisions to enhance the clarity and rigor of the paper.

read point-by-point responses

Referee: [Benchmark construction] Benchmark construction section: the manuscript asserts that the 2,500 problems were obtained via 'rigorous annotation' from a single educational organization yet supplies no inter-annotator agreement figures, annotation guidelines, selection criteria, or validation against external error distributions. This directly affects the central claim that ErrorRadar constitutes a valid and representative test of complex multimodal mathematical reasoning errors.

Authors: We agree that additional details on the annotation process are necessary to substantiate the rigor of our benchmark. In the revised manuscript, we will expand the Benchmark Construction section to include inter-annotator agreement figures (such as Cohen's kappa), the annotation guidelines, selection criteria, and any validation steps against external error distributions. These will be provided in the main text or an appendix. revision: yes
Referee: [Evaluation and results] Evaluation and results section: the headline result that GPT-4o 'is still around 10% behind human evaluation' is presented without definitions of the exact metrics for each sub-task, without the procedure used to obtain human scores, and without statistical significance tests or confidence intervals. These omissions render the performance comparison unverifiable and load-bearing for the paper's conclusions.

Authors: We recognize the importance of clearly defining metrics and providing statistical analysis for the performance comparison. The revised paper will define the exact metrics for error step identification and error categorization, detail the human evaluation procedure (including the number of experts and their expertise), and include statistical significance tests along with confidence intervals to support the reported performance gap. revision: yes
Referee: [Data description] Data description: no table or subsection reports the distribution of problem types, error categories, or curricular coverage, nor any comparison to established math-error taxonomies; without such information the representativeness argument cannot be assessed.

Authors: We will add a dedicated subsection and tables in the Data Description section to report the distributions of problem types, error categories, and curricular coverage. Furthermore, we will include a comparison of our error categories to established math-error taxonomies from prior literature to better demonstrate the representativeness of the ErrorRadar benchmark. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark introduction with no derivations or self-referential reductions

full rationale

The paper formulates a new task and presents ErrorRadar as a benchmark of 2,500 problems collected from student interactions, with evaluation of MLLMs. No equations, derivations, fitted parameters, or load-bearing self-citations appear in the abstract or described structure. The work is self-contained empirical evaluation against human experts; the representativeness claim is an external validity issue, not a circular reduction of any claimed derivation to its inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central contribution rests on the assumption that the collected and annotated problems validly represent real student errors; no free parameters, new physical entities, or mathematical axioms are introduced.

axioms (1)

domain assumption The 2,500 problems collected from real-world student interactions with rigorous annotation accurately capture complex mathematical reasoning errors in multimodal settings.
This premise is required for the benchmark to serve as a meaningful test of MLLM capabilities.

pith-pipeline@v0.9.0 · 5790 in / 1302 out tokens · 28863 ms · 2026-05-23T20:08:17.961478+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Position: Multimodal Large Language Models Can Significantly Advance Scientific Reasoning
cs.CL 2025-02 unverdicted novelty 2.0

Position paper claims multimodal LLMs can significantly advance scientific reasoning and proposes a four-stage roadmap plus challenges and suggestions.

Reference graph

Works this paper leans on

92 extracted references · 92 canonical work pages · cited by 1 Pith paper · 22 internal anchors

[1]

Complexity in declarative process models: Metrics and multi-modal assessment of cognitive load

Amine Abbad-Andaloussi, Andrea Burattin, Tijs Slaats, Ekkart Kindler, and Barbara Weber. Complexity in declarative process models: Metrics and multi-modal assessment of cognitive load. Expert Systems with Applications, 233: 0 120924, 2023

work page 2023
[2]

Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone

Marah Abdin, Sam Ade Jacobs, Ammar Ahmad Awan, Jyoti Aneja, Ahmed Awadallah, Hany Awadalla, Nguyen Bach, Amit Bahree, Arash Bakhtiari, Harkirat Behl, et al. Phi-3 technical report: A highly capable language model locally on your phone. arXiv preprint arXiv:2404.14219, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[3]

Scaling laws for generative mixed-modal language models

Armen Aghajanyan, Lili Yu, Alexis Conneau, Wei-Ning Hsu, Karen Hambardzumyan, Susan Zhang, Stephen Roller, Naman Goyal, Omer Levy, and Luke Zettlemoyer. Scaling laws for generative mixed-modal language models. In International Conference on Machine Learning, pp.\ 265--279. PMLR, 2023

work page 2023
[4]

Large language models for mathematical reasoning: Progresses and challenges

Janice Ahn, Rishu Verma, Renze Lou, Di Liu, Rui Zhang, and Wenpeng Yin. Large language models for mathematical reasoning: Progresses and challenges. arXiv preprint arXiv:2402.00157, 2024

work page arXiv 2024
[5]

Claude 3, 2024 a

Anthropic. Claude 3, 2024 a . URL https://www.anthropic.com/news/claude-3-haiku

work page 2024
[6]

Claude 3.5, 2024 b

Anthropic. Claude 3.5, 2024 b . URL https://www.anthropic.com/news/claude-3-5-sonnet

work page 2024
[7]

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond. arXiv preprint arXiv:2308.12966, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[8]

Turning large language models into cognitive models

Marcel Binz and Eric Schulz. Turning large language models into cognitive models. arXiv preprint arXiv:2306.03917, 2023

work page arXiv 2023
[9]

Theoremqa: A theorem-driven question answering dataset

Wenhu Chen, Ming Yin, Max Ku, Pan Lu, Yixin Wan, Xueguang Ma, Jianyu Xu, Xinyi Wang, and Tony Xia. Theoremqa: A theorem-driven question answering dataset. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp.\ 7889--7901, 2023 a

work page 2023
[10]

InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks

Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, Bin Li, Ping Luo, Tong Lu, Yu Qiao, and Jifeng Dai. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. arXiv preprint arXiv:2312.14238, 2023 b

work page internal anchor Pith review Pith/arXiv arXiv 2023
[11]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[12]

A survey on multimodal large language models for autonomous driving

Can Cui, Yunsheng Ma, Xu Cao, Wenqian Ye, Yang Zhou, Kaizhao Liang, Jintai Chen, Juanwu Lu, Zichong Yang, Kuei-Da Liao, et al. A survey on multimodal large language models for autonomous driving. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp.\ 958--979, 2024

work page 2024
[13]

Advancing mathematics by guiding human intuition with ai

Alex Davies, Petar Veli c kovi \'c , Lars Buesing, Sam Blackwell, Daniel Zheng, Nenad Toma s ev, Richard Tanburn, Peter Battaglia, Charles Blundell, Andr \'a s Juh \'a sz, et al. Advancing mathematics by guiding human intuition with ai. Nature, 600 0 (7887): 0 70--74, 2021

work page 2021
[14]

Visual representations in the human brain are aligned with large language models

Adrien Doerig, Tim C Kietzmann, Emily Allen, Yihan Wu, Thomas Naselaris, Kendrick Kay, and Ian Charest. Visual representations in the human brain are aligned with large language models. arXiv preprint arXiv:2209.11737, 2022

work page arXiv 2022
[15]

Muffin or chihuahua? challenging multimodal large language models with multipanel vqa

Yue Fan, Jing Gu, Kaiwen Zhou, Qianqi Yan, Shan Jiang, Ching-Chen Kuo, Yang Zhao, Xinze Guan, and Xin Wang. Muffin or chihuahua? challenging multimodal large language models with multipanel vqa. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 6845--6863, 2024

work page 2024
[16]

Trends in integration of knowledge and large language models: A survey and taxonomy of methods, benchmarks, and applications

Zhangyin Feng, Weitao Ma, Weijiang Yu, Lei Huang, Haotian Wang, Qianglong Chen, Weihua Peng, Xiaocheng Feng, Bing Qin, et al. Trends in integration of knowledge and large language models: A survey and taxonomy of methods, benchmarks, and applications. arXiv preprint arXiv:2311.05876, 2023

work page arXiv 2023
[17]

Isobench: Benchmarking multimodal foundation models on isomorphic representations

Deqing Fu, Ghazal Khalighinejad, Ollie Liu, Bhuwan Dhingra, Dani Yogatama, Robin Jia, and Willie Neiswanger. Isobench: Benchmarking multimodal foundation models on isomorphic representations. arXiv preprint arXiv:2404.01266, 2024

work page arXiv 2024
[18]

ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools

Team GLM, Aohan Zeng, Bin Xu, Bowen Wang, Chenhui Zhang, Da Yin, Diego Rojas, Guanyu Feng, Hanlin Zhao, Hanyu Lai, et al. Chatglm: A family of large language models from glm-130b to glm-4 all tools. arXiv preprint arXiv:2406.12793, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[19]

Urbanvlp: A multi-granularity vision-language pre-trained foundation model for urban indicator prediction

Xixuan Hao, Wei Chen, Yibo Yan, Siru Zhong, Kun Wang, Qingsong Wen, and Yuxuan Liang. Urbanvlp: A multi-granularity vision-language pre-trained foundation model for urban indicator prediction. arXiv preprint arXiv:2403.16831, 2024

work page arXiv 2024
[20]

Pefomed: Parameter efficient fine-tuning on multimodal large language models for medical visual question answering

Jinlong He, Pengfei Li, Gang Liu, Zixu Zhao, and Shenjun Zhong. Pefomed: Parameter efficient fine-tuning on multimodal large language models for medical visual question answering. arXiv preprint arXiv:2401.02797, 2024 a

work page arXiv 2024
[21]

Cmmu: A benchmark for chinese multi-modal multi-type question understanding and reasoning

Zheqi He, Xinya Wu, Pengfei Zhou, Richeng Xuan, Guang Liu, Xi Yang, Qiannan Zhu, and Hua Huang. Cmmu: A benchmark for chinese multi-modal multi-type question understanding and reasoning. arXiv preprint arXiv:2401.14011, 2024 b

work page arXiv 2024
[22]

Measuring Mathematical Problem Solving With the MATH Dataset

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[23]

Visual sketchpad: Sketching as a visual chain of thought for multimodal language models

Yushi Hu, Weijia Shi, Xingyu Fu, Dan Roth, Mari Ostendorf, Luke Zettlemoyer, Noah A Smith, and Ranjay Krishna. Visual sketchpad: Sketching as a visual chain of thought for multimodal language models. arXiv preprint arXiv:2406.09403, 2024

work page arXiv 2024
[24]

Mmneuron: Discovering neuron-level domain-specific interpretation in multimodal large language model

Jiahao Huo, Yibo Yan, Boren Hu, Yutao Yue, and Xuming Hu. Mmneuron: Discovering neuron-level domain-specific interpretation in multimodal large language model. arXiv preprint arXiv:2406.11193, 2024

work page arXiv 2024
[25]

Describe-then-reason: Improving multimodal mathematical reasoning through visual comprehension training

Mengzhao Jia, Zhihan Zhang, Wenhao Yu, Fangkai Jiao, and Meng Jiang. Describe-then-reason: Improving multimodal mathematical reasoning through visual comprehension training. arXiv preprint arXiv:2404.14604, 2024

work page arXiv 2024
[26]

New generation deep learning for video object detection: A survey

Licheng Jiao, Ruohan Zhang, Fang Liu, Shuyuan Yang, Biao Hou, Lingling Li, and Xu Tang. New generation deep learning for video object detection: A survey. IEEE Transactions on Neural Networks and Learning Systems, 33 0 (8): 0 3195--3215, 2021

work page 2021
[27]

Learning instance-level representation for large-scale multi-modal pretraining in e-commerce

Yang Jin, Yongzhi Li, Zehuan Yuan, and Yadong Mu. Learning instance-level representation for large-scale multi-modal pretraining in e-commerce. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 11060--11069, 2023

work page 2023
[28]

Large language models struggle to learn long-tail knowledge

Nikhil Kandpal, Haikang Deng, Adam Roberts, Eric Wallace, and Colin Raffel. Large language models struggle to learn long-tail knowledge. In International Conference on Machine Learning, pp.\ 15696--15707. PMLR, 2023

work page 2023
[29]

Scaling Laws for Neural Language Models

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2001
[30]

Cognitive load theory: An applied reintroduction for special and general educators

Michael J Kennedy and John Elwood Romig. Cognitive load theory: An applied reintroduction for special and general educators. TEACHING Exceptional Children, 56 0 (6): 0 440--451, 2024

work page 2024
[31]

Large language models are zero-shot reasoners

Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners. Advances in neural information processing systems, 35: 0 22199--22213, 2022

work page 2022
[32]

Solving quantitative reasoning problems with language models

Aitor Lewkowycz, Anders Andreassen, David Dohan, Ethan Dyer, Henryk Michalewski, Vinay Ramasesh, Ambrose Slone, Cem Anil, Imanol Schlag, Theo Gutman-Solo, et al. Solving quantitative reasoning problems with language models. Advances in Neural Information Processing Systems, 35: 0 3843--3857, 2022

work page 2022
[33]

Bringing generative ai to adaptive learning in education

Hang Li, Tianlong Xu, Chaoli Zhang, Eason Chen, Jing Liang, Xing Fan, Haoyang Li, Jiliang Tang, and Qingsong Wen. Bringing generative ai to adaptive learning in education. arXiv preprint arXiv:2402.14601, 2024 a

work page arXiv 2024
[34]

Evaluating mathematical reasoning of large language models: A focus on error identification and correction

Xiaoyuan Li, Wenjie Wang, Moxin Li, Junrong Guo, Yang Zhang, and Fuli Feng. Evaluating mathematical reasoning of large language models: A focus on error identification and correction. arXiv preprint arXiv:2406.00755, 2024 b

work page arXiv 2024
[35]

Cmmath: A chinese multi-modal math skill evaluation benchmark for foundation models

Zhong-Zhi Li, Ming-Liang Zhang, Fei Yin, Zhi-Long Ji, Jin-Feng Bai, Zhen-Ru Pan, Fan-Hu Zeng, Jian Xu, Jia-Xin Zhang, and Cheng-Lin Liu. Cmmath: A chinese multi-modal math skill evaluation benchmark for foundation models. arXiv preprint arXiv:2407.12023, 2024 c

work page arXiv 2024
[36]

Llava-next: Improved reasoning, ocr, and world knowledge, January 2024 a

Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. Llava-next: Improved reasoning, ocr, and world knowledge, January 2024 a . URL https://llava-vl.github.io/blog/2024-01-30-llava-next/

work page 2024
[37]

Mathbench: Evaluating the theory and application proficiency of llms with a hierarchical mathematics benchmark

Hongwei Liu, Zilong Zheng, Yuxuan Qiao, Haodong Duan, Zhiwei Fei, Fengzhe Zhou, Wenwei Zhang, Songyang Zhang, Dahua Lin, and Kai Chen. Mathbench: Evaluating the theory and application proficiency of llms with a hierarchical mathematics benchmark. arXiv preprint arXiv:2405.12209, 2024 b

work page arXiv 2024
[38]

Are llms capable of data-based statistical and causal reasoning? benchmarking advanced quantitative reasoning with data

Xiao Liu, Zirui Wu, Xueqing Wu, Pan Lu, Kai-Wei Chang, and Yansong Feng. Are llms capable of data-based statistical and causal reasoning? benchmarking advanced quantitative reasoning with data. arXiv preprint arXiv:2402.17644, 2024 c

work page arXiv 2024
[39]

DeepSeek-VL: Towards Real-World Vision-Language Understanding

Haoyu Lu, Wen Liu, Bo Zhang, Bingxuan Wang, Kai Dong, Bo Liu, Jingxiang Sun, Tongzheng Ren, Zhuoshu Li, Yaofeng Sun, et al. Deepseek-vl: towards real-world vision-language understanding. arXiv preprint arXiv:2403.05525, 2024 a

work page internal anchor Pith review Pith/arXiv arXiv 2024
[40]

A survey of deep learning for mathematical reasoning

Pan Lu, Liang Qiu, Wenhao Yu, Sean Welleck, and Kai-Wei Chang. A survey of deep learning for mathematical reasoning. arXiv preprint arXiv:2212.10535, 2022

work page arXiv 2022
[41]

MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts

Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts. arXiv preprint arXiv:2310.02255, 2024 b

work page internal anchor Pith review Pith/arXiv arXiv 2024
[42]

Chameleon: Plug-and-play compositional reasoning with large language models

Pan Lu, Baolin Peng, Hao Cheng, Michel Galley, Kai-Wei Chang, Ying Nian Wu, Song-Chun Zhu, and Jianfeng Gao. Chameleon: Plug-and-play compositional reasoning with large language models. Advances in Neural Information Processing Systems, 36, 2024 c

work page 2024
[43]

Large Language Models: A Survey

Shervin Minaee, Tomas Mikolov, Narjes Nikzad, Meysam Chenaghlu, Richard Socher, Xavier Amatriain, and Jianfeng Gao. Large language models: A survey. arXiv preprint arXiv:2402.06196, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[44]

Scaling data-constrained language models

Niklas Muennighoff, Alexander Rush, Boaz Barak, Teven Le Scao, Nouamane Tazi, Aleksandra Piktus, Sampo Pyysalo, Thomas Wolf, and Colin A Raffel. Scaling data-constrained language models. Advances in Neural Information Processing Systems, 36, 2024

work page 2024
[45]

GPT-4 Technical Report

OpenAI. GPT-4 technical report. arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[46]

GPT-4V(ision) system card, 2024 a

OpenAI. GPT-4V(ision) system card, 2024 a . URL https://openai.com/index/gpt-4o-system-card/

work page 2024
[47]

Gpt-4o mini: advancing cost-efficient intelligence, 2024 b

OpenAI. Gpt-4o mini: advancing cost-efficient intelligence, 2024 b . URL https://openai.com/index/gpt-4o-mini-advancing-cost-efficient-intelligence/

work page 2024
[48]

Cognitive load theory and instructional design: Recent developments

Fred Paas, Alexander Renkl, and John Sweller. Cognitive load theory and instructional design: Recent developments. Educational Psychologist, 2010

work page 2010
[49]

Gemini goes to med school: exploring the capabilities of multimodal large language models on medical challenge problems & hallucinations

Ankit Pal and Malaikannan Sankarasubbu. Gemini goes to med school: exploring the capabilities of multimodal large language models on medical challenge problems & hallucinations. arXiv preprint arXiv:2402.07023, 2024

work page arXiv 2024
[50]

Multimath: Bridging visual and mathematical reasoning for large language models

Shuai Peng, Di Fu, Liangcai Gao, Xiuqin Zhong, Hongguang Fu, and Zhi Tang. Multimath: Bridging visual and mathematical reasoning for large language models. arXiv preprint arXiv:2409.00147, 2024

work page arXiv 2024
[51]

We-Math: Does Your Large Multimodal Model Achieve Human-like Mathematical Reasoning?

Runqi Qiao, Qiuna Tan, Guanting Dong, Minhui Wu, Chong Sun, Xiaoshuai Song, Zhuoma GongQue, Shanglin Lei, Zhe Wei, Miaoxuan Zhang, et al. We-math: Does your large multimodal model achieve human-like mathematical reasoning? arXiv preprint arXiv:2407.01284, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[52]

Elementary math learning through piaget's cognitive development stages

Annabelle Rabillas, Osias Kit Kilag, Neil Ca \ n ete, Maria Trazona, Mery Lou Calope, and Jacqueline Kilag. Elementary math learning through piaget's cognitive development stages. Excellencia: International Multi-disciplinary Journal of Education (2994-9521), 1 0 (4): 0 128--142, 2023

work page 2023
[53]

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

Machel Reid, Nikolay Savinov, Denis Teplyashin, Dmitry Lepikhin, Timothy Lillicrap, Jean-baptiste Alayrac, Radu Soricut, Angeliki Lazaridou, Orhan Firat, Julian Schrittwieser, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv preprint arXiv:2403.05530, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[54]

Detecting Pretraining Data from Large Language Models

Weijia Shi, Anirudh Ajith, Mengzhou Xia, Yangsibo Huang, Daogao Liu, Terra Blevins, Danqi Chen, and Luke Zettlemoyer. Detecting pretraining data from large language models. arXiv preprint arXiv:2310.16789, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[55]

Math-llava: Bootstrapping mathematical reasoning for multimodal large language models

Wenhao Shi, Zhiqiang Hu, Yi Bin, Junhua Liu, Yang Yang, See-Kiong Ng, Lidong Bing, and Roy Ka-Wei Lee. Math-llava: Bootstrapping mathematical reasoning for multimodal large language models. arXiv preprint arXiv:2406.17294, 2024

work page arXiv 2024
[56]

How to bridge the gap between modalities: A comprehensive survey on multimodal large language model

Shezheng Song, Xiaopeng Li, and Shasha Li. How to bridge the gap between modalities: A comprehensive survey on multimodal large language model. arXiv preprint arXiv:2311.07594, 2023

work page arXiv 2023
[57]

Scieval: A multi-level large language model evaluation benchmark for scientific research

Liangtai Sun, Yang Han, Zihan Zhao, Da Ma, Zhennan Shen, Baocai Chen, Lu Chen, and Kai Yu. Scieval: A multi-level large language model evaluation benchmark for scientific research. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pp.\ 19053--19061, 2024

work page 2024
[58]

Aligning Large Multimodal Models with Factually Augmented RLHF

Zhiqing Sun, Sheng Shen, Shengcao Cao, Haotian Liu, Chunyuan Li, Yikang Shen, Chuang Gan, Liang-Yan Gui, Yu-Xiong Wang, Yiming Yang, et al. Aligning large multimodal models with factually augmented rlhf. arXiv preprint arXiv:2309.14525, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[59]

Memorization without overfitting: Analyzing the training dynamics of large language models

Kushal Tirumala, Aram Markosyan, Luke Zettlemoyer, and Armen Aghajanyan. Memorization without overfitting: Analyzing the training dynamics of large language models. Advances in Neural Information Processing Systems, 35: 0 38274--38290, 2022

work page 2022
[60]

Measuring Multimodal Mathematical Reasoning with MATH-Vision Dataset

Ke Wang, Junting Pan, Weikang Shi, Zimu Lu, Mingjie Zhan, and Hongsheng Li. Measuring multimodal mathematical reasoning with math-vision dataset. arXiv preprint arXiv:2402.14804, 2024 a

work page internal anchor Pith review Pith/arXiv arXiv 2024
[61]

Large language models for education: A survey and outlook

Shen Wang, Tianlong Xu, Hang Li, Chaoli Zhang, Joleen Liang, Jiliang Tang, Philip S Yu, and Qingsong Wen. Large language models for education: A survey and outlook. arXiv preprint arXiv:2403.18105, 2024 b

work page arXiv 2024
[62]

CogVLM: Visual Expert for Pretrained Language Models

Weihan Wang, Qingsong Lv, Wenmeng Yu, Wenyi Hong, Ji Qi, Yan Wang, Junhui Ji, Zhuoyi Yang, Lei Zhao, Xixuan Song, et al. Cogvlm: Visual expert for pretrained language models. arXiv preprint arXiv:2311.03079, 2023 a

work page internal anchor Pith review Pith/arXiv arXiv 2023
[63]

Large-scale multi-modal pre-trained models: A comprehensive survey

Xiao Wang, Guangyao Chen, Guangwu Qian, Pengcheng Gao, Xiao-Yong Wei, Yaowei Wang, Yonghong Tian, and Wen Gao. Large-scale multi-modal pre-trained models: A comprehensive survey. Machine Intelligence Research, 20 0 (4): 0 447--482, 2023 b

work page 2023
[64]

SciBench: Evaluating College-Level Scientific Problem-Solving Abilities of Large Language Models

Xiaoxuan Wang, Ziniu Hu, Pan Lu, Yanqiao Zhu, Jieyu Zhang, Satyen Subramaniam, Arjun R Loomba, Shichang Zhang, Yizhou Sun, and Wei Wang. Scibench: Evaluating college-level scientific problem-solving abilities of large language models. arXiv preprint arXiv:2307.10635, 2024 c

work page internal anchor Pith review Pith/arXiv arXiv 2024
[65]

Exploring the reasoning abilities of multimodal large language models (mllms): A comprehensive survey on emerging trends in multimodal reasoning

Yiqi Wang, Wentao Chen, Xiaotian Han, Xudong Lin, Haiteng Zhao, Yongfei Liu, Bohan Zhai, Jianbo Yuan, Quanzeng You, and Hongxia Yang. Exploring the reasoning abilities of multimodal large language models (mllms): A comprehensive survey on emerging trends in multimodal reasoning. arXiv preprint arXiv:2401.06805, 2024 d

work page arXiv 2024
[66]

Are deep neural networks adequate behavioral models of human visual perception? Annual Review of Vision Science, 9 0 (1): 0 501--524, 2023

Felix A Wichmann and Robert Geirhos. Are deep neural networks adequate behavioral models of human visual perception? Annual Review of Vision Science, 9 0 (1): 0 501--524, 2023

work page 2023
[67]

A comprehensive survey of large language models and multimodal large language models in medicine

Hanguang Xiao, Feizhong Zhou, Xingyue Liu, Tianqi Liu, Zhipeng Li, Xin Liu, and Xiaoxuan Huang. A comprehensive survey of large language models and multimodal large language models in medicine. arXiv preprint arXiv:2405.08603, 2024

work page arXiv 2024
[68]

Mind: Multimodal shopping intention distillation from large vision-language models for e-commerce purchase understanding

Baixuan Xu, Weiqi Wang, Haochen Shi, Wenxuan Ding, Huihao Jing, Tianqing Fang, Jiaxin Bai, Long Chen, and Yangqiu Song. Mind: Multimodal shopping intention distillation from large vision-language models for e-commerce purchase understanding. arXiv preprint arXiv:2406.10701, 2024 a

work page arXiv 2024
[69]

Superclue-math6: Graded multi-step math reasoning benchmark for llms in chinese

Liang Xu, Hang Xue, Lei Zhu, and Kangkang Zhao. Superclue-math6: Graded multi-step math reasoning benchmark for llms in chinese. arXiv preprint arXiv:2401.11819, 2024 b

work page arXiv 2024
[70]

Raise a child in large language model: Towards effective and generalizable fine-tuning

Runxin Xu, Fuli Luo, Zhiyuan Zhang, Chuanqi Tan, Baobao Chang, Songfang Huang, and Fei Huang. Raise a child in large language model: Towards effective and generalizable fine-tuning. arXiv preprint arXiv:2109.05687, 2021

work page arXiv 2021
[71]

Emerging synergies between large language models and machine learning in ecommerce recommendations

Xiaonan Xu, Zheng Xu, Zhipeng Ling, Zhengyu Jin, and ShuQian Du. Emerging synergies between large language models and machine learning in ecommerce recommendations. arXiv preprint arXiv:2403.02760, 2024 c

work page arXiv 2024
[72]

Georeasoner: Reasoning on geospatially grounded context for natural language understanding

Yibo Yan and Joey Lee. Georeasoner: Reasoning on geospatially grounded context for natural language understanding. arXiv preprint arXiv:2408.11366, 2024

work page arXiv 2024
[73]

Urbanclip: Learning text-enhanced urban region profiling with contrastive language-image pretraining from the web

Yibo Yan, Haomin Wen, Siru Zhong, Wei Chen, Haodong Chen, Qingsong Wen, Roger Zimmermann, and Yuxuan Liang. Urbanclip: Learning text-enhanced urban region profiling with contrastive language-image pretraining from the web. In Proceedings of the ACM on Web Conference 2024, pp.\ 4006--4017, 2024

work page 2024
[74]

Exploring diverse in-context configurations for image captioning

Xu Yang, Yongliang Wu, Mingzhuo Yang, Haokun Chen, and Xin Geng. Exploring diverse in-context configurations for image captioning. Advances in Neural Information Processing Systems, 36, 2024

work page 2024
[75]

MiniCPM-V: A GPT-4V Level MLLM on Your Phone

Yuan Yao, Tianyu Yu, Ao Zhang, Chongyi Wang, Junbo Cui, Hongji Zhu, Tianchi Cai, Haoyu Li, Weilin Zhao, Zhihui He, et al. Minicpm-v: A gpt-4v level mllm on your phone. arXiv preprint arXiv:2408.01800, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[76]

Yi: Open Foundation Models by 01.AI

Alex Young, Bei Chen, Chao Li, Chengen Huang, Ge Zhang, Guanwei Zhang, Heng Li, Jiangcheng Zhu, Jianqun Chen, Jing Chang, et al. Yi: Open foundation models by 01. ai. arXiv preprint arXiv:2403.04652, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[77]

Large language model as attributed training data generator: A tale of diversity and bias

Yue Yu, Yuchen Zhuang, Jieyu Zhang, Yu Meng, Alexander J Ratner, Ranjay Krishna, Jiaming Shen, and Chao Zhang. Large language model as attributed training data generator: A tale of diversity and bias. Advances in Neural Information Processing Systems, 36, 2024

work page 2024
[78]

Mr-ben: A comprehensive meta-reasoning benchmark for large language models

Zhongshen Zeng, Yinhong Liu, Yingjia Wan, Jingyao Li, Pengguang Chen, Jianbo Dai, Yuxuan Yao, Rongwu Xu, Zehan Qi, Wanru Zhao, et al. Mr-ben: A comprehensive meta-reasoning benchmark for large language models. arXiv preprint arXiv:2406.13975, 2024

work page arXiv 2024
[79]

MathVerse: Does Your Multi-modal LLM Truly See the Diagrams in Visual Math Problems?

Renrui Zhang, Dongzhi Jiang, Yichi Zhang, Haokun Lin, Ziyu Guo, Pengshuo Qiu, Aojun Zhou, Pan Lu, Kai-Wei Chang, Peng Gao, et al. Mathverse: Does your multi-modal llm truly see the diagrams in visual math problems? arXiv preprint arXiv:2403.14624, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[80]

A Survey of Large Language Models

Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen Zhang, Junjie Zhang, Zican Dong, et al. A survey of large language models. arXiv preprint arXiv:2303.18223, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

Showing first 80 references.

[1] [1]

Complexity in declarative process models: Metrics and multi-modal assessment of cognitive load

Amine Abbad-Andaloussi, Andrea Burattin, Tijs Slaats, Ekkart Kindler, and Barbara Weber. Complexity in declarative process models: Metrics and multi-modal assessment of cognitive load. Expert Systems with Applications, 233: 0 120924, 2023

work page 2023

[2] [2]

Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone

Marah Abdin, Sam Ade Jacobs, Ammar Ahmad Awan, Jyoti Aneja, Ahmed Awadallah, Hany Awadalla, Nguyen Bach, Amit Bahree, Arash Bakhtiari, Harkirat Behl, et al. Phi-3 technical report: A highly capable language model locally on your phone. arXiv preprint arXiv:2404.14219, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[3] [3]

Scaling laws for generative mixed-modal language models

Armen Aghajanyan, Lili Yu, Alexis Conneau, Wei-Ning Hsu, Karen Hambardzumyan, Susan Zhang, Stephen Roller, Naman Goyal, Omer Levy, and Luke Zettlemoyer. Scaling laws for generative mixed-modal language models. In International Conference on Machine Learning, pp.\ 265--279. PMLR, 2023

work page 2023

[4] [4]

Large language models for mathematical reasoning: Progresses and challenges

Janice Ahn, Rishu Verma, Renze Lou, Di Liu, Rui Zhang, and Wenpeng Yin. Large language models for mathematical reasoning: Progresses and challenges. arXiv preprint arXiv:2402.00157, 2024

work page arXiv 2024

[5] [5]

Claude 3, 2024 a

Anthropic. Claude 3, 2024 a . URL https://www.anthropic.com/news/claude-3-haiku

work page 2024

[6] [6]

Claude 3.5, 2024 b

Anthropic. Claude 3.5, 2024 b . URL https://www.anthropic.com/news/claude-3-5-sonnet

work page 2024

[7] [7]

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond. arXiv preprint arXiv:2308.12966, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[8] [8]

Turning large language models into cognitive models

Marcel Binz and Eric Schulz. Turning large language models into cognitive models. arXiv preprint arXiv:2306.03917, 2023

work page arXiv 2023

[9] [9]

Theoremqa: A theorem-driven question answering dataset

Wenhu Chen, Ming Yin, Max Ku, Pan Lu, Yixin Wan, Xueguang Ma, Jianyu Xu, Xinyi Wang, and Tony Xia. Theoremqa: A theorem-driven question answering dataset. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp.\ 7889--7901, 2023 a

work page 2023

[10] [10]

InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks

Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, Bin Li, Ping Luo, Tong Lu, Yu Qiao, and Jifeng Dai. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. arXiv preprint arXiv:2312.14238, 2023 b

work page internal anchor Pith review Pith/arXiv arXiv 2023

[11] [11]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[12] [12]

A survey on multimodal large language models for autonomous driving

Can Cui, Yunsheng Ma, Xu Cao, Wenqian Ye, Yang Zhou, Kaizhao Liang, Jintai Chen, Juanwu Lu, Zichong Yang, Kuei-Da Liao, et al. A survey on multimodal large language models for autonomous driving. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp.\ 958--979, 2024

work page 2024

[13] [13]

Advancing mathematics by guiding human intuition with ai

Alex Davies, Petar Veli c kovi \'c , Lars Buesing, Sam Blackwell, Daniel Zheng, Nenad Toma s ev, Richard Tanburn, Peter Battaglia, Charles Blundell, Andr \'a s Juh \'a sz, et al. Advancing mathematics by guiding human intuition with ai. Nature, 600 0 (7887): 0 70--74, 2021

work page 2021

[14] [14]

Visual representations in the human brain are aligned with large language models

Adrien Doerig, Tim C Kietzmann, Emily Allen, Yihan Wu, Thomas Naselaris, Kendrick Kay, and Ian Charest. Visual representations in the human brain are aligned with large language models. arXiv preprint arXiv:2209.11737, 2022

work page arXiv 2022

[15] [15]

Muffin or chihuahua? challenging multimodal large language models with multipanel vqa

Yue Fan, Jing Gu, Kaiwen Zhou, Qianqi Yan, Shan Jiang, Ching-Chen Kuo, Yang Zhao, Xinze Guan, and Xin Wang. Muffin or chihuahua? challenging multimodal large language models with multipanel vqa. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 6845--6863, 2024

work page 2024

[16] [16]

Trends in integration of knowledge and large language models: A survey and taxonomy of methods, benchmarks, and applications

Zhangyin Feng, Weitao Ma, Weijiang Yu, Lei Huang, Haotian Wang, Qianglong Chen, Weihua Peng, Xiaocheng Feng, Bing Qin, et al. Trends in integration of knowledge and large language models: A survey and taxonomy of methods, benchmarks, and applications. arXiv preprint arXiv:2311.05876, 2023

work page arXiv 2023

[17] [17]

Isobench: Benchmarking multimodal foundation models on isomorphic representations

Deqing Fu, Ghazal Khalighinejad, Ollie Liu, Bhuwan Dhingra, Dani Yogatama, Robin Jia, and Willie Neiswanger. Isobench: Benchmarking multimodal foundation models on isomorphic representations. arXiv preprint arXiv:2404.01266, 2024

work page arXiv 2024

[18] [18]

ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools

Team GLM, Aohan Zeng, Bin Xu, Bowen Wang, Chenhui Zhang, Da Yin, Diego Rojas, Guanyu Feng, Hanlin Zhao, Hanyu Lai, et al. Chatglm: A family of large language models from glm-130b to glm-4 all tools. arXiv preprint arXiv:2406.12793, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[19] [19]

Urbanvlp: A multi-granularity vision-language pre-trained foundation model for urban indicator prediction

Xixuan Hao, Wei Chen, Yibo Yan, Siru Zhong, Kun Wang, Qingsong Wen, and Yuxuan Liang. Urbanvlp: A multi-granularity vision-language pre-trained foundation model for urban indicator prediction. arXiv preprint arXiv:2403.16831, 2024

work page arXiv 2024

[20] [20]

Pefomed: Parameter efficient fine-tuning on multimodal large language models for medical visual question answering

Jinlong He, Pengfei Li, Gang Liu, Zixu Zhao, and Shenjun Zhong. Pefomed: Parameter efficient fine-tuning on multimodal large language models for medical visual question answering. arXiv preprint arXiv:2401.02797, 2024 a

work page arXiv 2024

[21] [21]

Cmmu: A benchmark for chinese multi-modal multi-type question understanding and reasoning

Zheqi He, Xinya Wu, Pengfei Zhou, Richeng Xuan, Guang Liu, Xi Yang, Qiannan Zhu, and Hua Huang. Cmmu: A benchmark for chinese multi-modal multi-type question understanding and reasoning. arXiv preprint arXiv:2401.14011, 2024 b

work page arXiv 2024

[22] [22]

Measuring Mathematical Problem Solving With the MATH Dataset

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[23] [23]

Visual sketchpad: Sketching as a visual chain of thought for multimodal language models

Yushi Hu, Weijia Shi, Xingyu Fu, Dan Roth, Mari Ostendorf, Luke Zettlemoyer, Noah A Smith, and Ranjay Krishna. Visual sketchpad: Sketching as a visual chain of thought for multimodal language models. arXiv preprint arXiv:2406.09403, 2024

work page arXiv 2024

[24] [24]

Mmneuron: Discovering neuron-level domain-specific interpretation in multimodal large language model

Jiahao Huo, Yibo Yan, Boren Hu, Yutao Yue, and Xuming Hu. Mmneuron: Discovering neuron-level domain-specific interpretation in multimodal large language model. arXiv preprint arXiv:2406.11193, 2024

work page arXiv 2024

[25] [25]

Describe-then-reason: Improving multimodal mathematical reasoning through visual comprehension training

Mengzhao Jia, Zhihan Zhang, Wenhao Yu, Fangkai Jiao, and Meng Jiang. Describe-then-reason: Improving multimodal mathematical reasoning through visual comprehension training. arXiv preprint arXiv:2404.14604, 2024

work page arXiv 2024

[26] [26]

New generation deep learning for video object detection: A survey

Licheng Jiao, Ruohan Zhang, Fang Liu, Shuyuan Yang, Biao Hou, Lingling Li, and Xu Tang. New generation deep learning for video object detection: A survey. IEEE Transactions on Neural Networks and Learning Systems, 33 0 (8): 0 3195--3215, 2021

work page 2021

[27] [27]

Learning instance-level representation for large-scale multi-modal pretraining in e-commerce

Yang Jin, Yongzhi Li, Zehuan Yuan, and Yadong Mu. Learning instance-level representation for large-scale multi-modal pretraining in e-commerce. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 11060--11069, 2023

work page 2023

[28] [28]

Large language models struggle to learn long-tail knowledge

Nikhil Kandpal, Haikang Deng, Adam Roberts, Eric Wallace, and Colin Raffel. Large language models struggle to learn long-tail knowledge. In International Conference on Machine Learning, pp.\ 15696--15707. PMLR, 2023

work page 2023

[29] [29]

Scaling Laws for Neural Language Models

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2001

[30] [30]

Cognitive load theory: An applied reintroduction for special and general educators

Michael J Kennedy and John Elwood Romig. Cognitive load theory: An applied reintroduction for special and general educators. TEACHING Exceptional Children, 56 0 (6): 0 440--451, 2024

work page 2024

[31] [31]

Large language models are zero-shot reasoners

Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners. Advances in neural information processing systems, 35: 0 22199--22213, 2022

work page 2022

[32] [32]

Solving quantitative reasoning problems with language models

Aitor Lewkowycz, Anders Andreassen, David Dohan, Ethan Dyer, Henryk Michalewski, Vinay Ramasesh, Ambrose Slone, Cem Anil, Imanol Schlag, Theo Gutman-Solo, et al. Solving quantitative reasoning problems with language models. Advances in Neural Information Processing Systems, 35: 0 3843--3857, 2022

work page 2022

[33] [33]

Bringing generative ai to adaptive learning in education

Hang Li, Tianlong Xu, Chaoli Zhang, Eason Chen, Jing Liang, Xing Fan, Haoyang Li, Jiliang Tang, and Qingsong Wen. Bringing generative ai to adaptive learning in education. arXiv preprint arXiv:2402.14601, 2024 a

work page arXiv 2024

[34] [34]

Evaluating mathematical reasoning of large language models: A focus on error identification and correction

Xiaoyuan Li, Wenjie Wang, Moxin Li, Junrong Guo, Yang Zhang, and Fuli Feng. Evaluating mathematical reasoning of large language models: A focus on error identification and correction. arXiv preprint arXiv:2406.00755, 2024 b

work page arXiv 2024

[35] [35]

Cmmath: A chinese multi-modal math skill evaluation benchmark for foundation models

Zhong-Zhi Li, Ming-Liang Zhang, Fei Yin, Zhi-Long Ji, Jin-Feng Bai, Zhen-Ru Pan, Fan-Hu Zeng, Jian Xu, Jia-Xin Zhang, and Cheng-Lin Liu. Cmmath: A chinese multi-modal math skill evaluation benchmark for foundation models. arXiv preprint arXiv:2407.12023, 2024 c

work page arXiv 2024

[36] [36]

Llava-next: Improved reasoning, ocr, and world knowledge, January 2024 a

Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. Llava-next: Improved reasoning, ocr, and world knowledge, January 2024 a . URL https://llava-vl.github.io/blog/2024-01-30-llava-next/

work page 2024

[37] [37]

Mathbench: Evaluating the theory and application proficiency of llms with a hierarchical mathematics benchmark

Hongwei Liu, Zilong Zheng, Yuxuan Qiao, Haodong Duan, Zhiwei Fei, Fengzhe Zhou, Wenwei Zhang, Songyang Zhang, Dahua Lin, and Kai Chen. Mathbench: Evaluating the theory and application proficiency of llms with a hierarchical mathematics benchmark. arXiv preprint arXiv:2405.12209, 2024 b

work page arXiv 2024

[38] [38]

Are llms capable of data-based statistical and causal reasoning? benchmarking advanced quantitative reasoning with data

Xiao Liu, Zirui Wu, Xueqing Wu, Pan Lu, Kai-Wei Chang, and Yansong Feng. Are llms capable of data-based statistical and causal reasoning? benchmarking advanced quantitative reasoning with data. arXiv preprint arXiv:2402.17644, 2024 c

work page arXiv 2024

[39] [39]

DeepSeek-VL: Towards Real-World Vision-Language Understanding

Haoyu Lu, Wen Liu, Bo Zhang, Bingxuan Wang, Kai Dong, Bo Liu, Jingxiang Sun, Tongzheng Ren, Zhuoshu Li, Yaofeng Sun, et al. Deepseek-vl: towards real-world vision-language understanding. arXiv preprint arXiv:2403.05525, 2024 a

work page internal anchor Pith review Pith/arXiv arXiv 2024

[40] [40]

A survey of deep learning for mathematical reasoning

Pan Lu, Liang Qiu, Wenhao Yu, Sean Welleck, and Kai-Wei Chang. A survey of deep learning for mathematical reasoning. arXiv preprint arXiv:2212.10535, 2022

work page arXiv 2022

[41] [41]

MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts

Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts. arXiv preprint arXiv:2310.02255, 2024 b

work page internal anchor Pith review Pith/arXiv arXiv 2024

[42] [42]

Chameleon: Plug-and-play compositional reasoning with large language models

Pan Lu, Baolin Peng, Hao Cheng, Michel Galley, Kai-Wei Chang, Ying Nian Wu, Song-Chun Zhu, and Jianfeng Gao. Chameleon: Plug-and-play compositional reasoning with large language models. Advances in Neural Information Processing Systems, 36, 2024 c

work page 2024

[43] [43]

Large Language Models: A Survey

Shervin Minaee, Tomas Mikolov, Narjes Nikzad, Meysam Chenaghlu, Richard Socher, Xavier Amatriain, and Jianfeng Gao. Large language models: A survey. arXiv preprint arXiv:2402.06196, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[44] [44]

Scaling data-constrained language models

Niklas Muennighoff, Alexander Rush, Boaz Barak, Teven Le Scao, Nouamane Tazi, Aleksandra Piktus, Sampo Pyysalo, Thomas Wolf, and Colin A Raffel. Scaling data-constrained language models. Advances in Neural Information Processing Systems, 36, 2024

work page 2024

[45] [45]

GPT-4 Technical Report

OpenAI. GPT-4 technical report. arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[46] [46]

GPT-4V(ision) system card, 2024 a

OpenAI. GPT-4V(ision) system card, 2024 a . URL https://openai.com/index/gpt-4o-system-card/

work page 2024

[47] [47]

Gpt-4o mini: advancing cost-efficient intelligence, 2024 b

OpenAI. Gpt-4o mini: advancing cost-efficient intelligence, 2024 b . URL https://openai.com/index/gpt-4o-mini-advancing-cost-efficient-intelligence/

work page 2024

[48] [48]

Cognitive load theory and instructional design: Recent developments

Fred Paas, Alexander Renkl, and John Sweller. Cognitive load theory and instructional design: Recent developments. Educational Psychologist, 2010

work page 2010

[49] [49]

Gemini goes to med school: exploring the capabilities of multimodal large language models on medical challenge problems & hallucinations

Ankit Pal and Malaikannan Sankarasubbu. Gemini goes to med school: exploring the capabilities of multimodal large language models on medical challenge problems & hallucinations. arXiv preprint arXiv:2402.07023, 2024

work page arXiv 2024

[50] [50]

Multimath: Bridging visual and mathematical reasoning for large language models

Shuai Peng, Di Fu, Liangcai Gao, Xiuqin Zhong, Hongguang Fu, and Zhi Tang. Multimath: Bridging visual and mathematical reasoning for large language models. arXiv preprint arXiv:2409.00147, 2024

work page arXiv 2024

[51] [51]

We-Math: Does Your Large Multimodal Model Achieve Human-like Mathematical Reasoning?

Runqi Qiao, Qiuna Tan, Guanting Dong, Minhui Wu, Chong Sun, Xiaoshuai Song, Zhuoma GongQue, Shanglin Lei, Zhe Wei, Miaoxuan Zhang, et al. We-math: Does your large multimodal model achieve human-like mathematical reasoning? arXiv preprint arXiv:2407.01284, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[52] [52]

Elementary math learning through piaget's cognitive development stages

Annabelle Rabillas, Osias Kit Kilag, Neil Ca \ n ete, Maria Trazona, Mery Lou Calope, and Jacqueline Kilag. Elementary math learning through piaget's cognitive development stages. Excellencia: International Multi-disciplinary Journal of Education (2994-9521), 1 0 (4): 0 128--142, 2023

work page 2023

[53] [53]

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

Machel Reid, Nikolay Savinov, Denis Teplyashin, Dmitry Lepikhin, Timothy Lillicrap, Jean-baptiste Alayrac, Radu Soricut, Angeliki Lazaridou, Orhan Firat, Julian Schrittwieser, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv preprint arXiv:2403.05530, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[54] [54]

Detecting Pretraining Data from Large Language Models

Weijia Shi, Anirudh Ajith, Mengzhou Xia, Yangsibo Huang, Daogao Liu, Terra Blevins, Danqi Chen, and Luke Zettlemoyer. Detecting pretraining data from large language models. arXiv preprint arXiv:2310.16789, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[55] [55]

Math-llava: Bootstrapping mathematical reasoning for multimodal large language models

Wenhao Shi, Zhiqiang Hu, Yi Bin, Junhua Liu, Yang Yang, See-Kiong Ng, Lidong Bing, and Roy Ka-Wei Lee. Math-llava: Bootstrapping mathematical reasoning for multimodal large language models. arXiv preprint arXiv:2406.17294, 2024

work page arXiv 2024

[56] [56]

How to bridge the gap between modalities: A comprehensive survey on multimodal large language model

Shezheng Song, Xiaopeng Li, and Shasha Li. How to bridge the gap between modalities: A comprehensive survey on multimodal large language model. arXiv preprint arXiv:2311.07594, 2023

work page arXiv 2023

[57] [57]

Scieval: A multi-level large language model evaluation benchmark for scientific research

Liangtai Sun, Yang Han, Zihan Zhao, Da Ma, Zhennan Shen, Baocai Chen, Lu Chen, and Kai Yu. Scieval: A multi-level large language model evaluation benchmark for scientific research. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pp.\ 19053--19061, 2024

work page 2024

[58] [58]

Aligning Large Multimodal Models with Factually Augmented RLHF

Zhiqing Sun, Sheng Shen, Shengcao Cao, Haotian Liu, Chunyuan Li, Yikang Shen, Chuang Gan, Liang-Yan Gui, Yu-Xiong Wang, Yiming Yang, et al. Aligning large multimodal models with factually augmented rlhf. arXiv preprint arXiv:2309.14525, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[59] [59]

Memorization without overfitting: Analyzing the training dynamics of large language models

Kushal Tirumala, Aram Markosyan, Luke Zettlemoyer, and Armen Aghajanyan. Memorization without overfitting: Analyzing the training dynamics of large language models. Advances in Neural Information Processing Systems, 35: 0 38274--38290, 2022

work page 2022

[60] [60]

Measuring Multimodal Mathematical Reasoning with MATH-Vision Dataset

Ke Wang, Junting Pan, Weikang Shi, Zimu Lu, Mingjie Zhan, and Hongsheng Li. Measuring multimodal mathematical reasoning with math-vision dataset. arXiv preprint arXiv:2402.14804, 2024 a

work page internal anchor Pith review Pith/arXiv arXiv 2024

[61] [61]

Large language models for education: A survey and outlook

Shen Wang, Tianlong Xu, Hang Li, Chaoli Zhang, Joleen Liang, Jiliang Tang, Philip S Yu, and Qingsong Wen. Large language models for education: A survey and outlook. arXiv preprint arXiv:2403.18105, 2024 b

work page arXiv 2024

[62] [62]

CogVLM: Visual Expert for Pretrained Language Models

Weihan Wang, Qingsong Lv, Wenmeng Yu, Wenyi Hong, Ji Qi, Yan Wang, Junhui Ji, Zhuoyi Yang, Lei Zhao, Xixuan Song, et al. Cogvlm: Visual expert for pretrained language models. arXiv preprint arXiv:2311.03079, 2023 a

work page internal anchor Pith review Pith/arXiv arXiv 2023

[63] [63]

Large-scale multi-modal pre-trained models: A comprehensive survey

Xiao Wang, Guangyao Chen, Guangwu Qian, Pengcheng Gao, Xiao-Yong Wei, Yaowei Wang, Yonghong Tian, and Wen Gao. Large-scale multi-modal pre-trained models: A comprehensive survey. Machine Intelligence Research, 20 0 (4): 0 447--482, 2023 b

work page 2023

[64] [64]

SciBench: Evaluating College-Level Scientific Problem-Solving Abilities of Large Language Models

Xiaoxuan Wang, Ziniu Hu, Pan Lu, Yanqiao Zhu, Jieyu Zhang, Satyen Subramaniam, Arjun R Loomba, Shichang Zhang, Yizhou Sun, and Wei Wang. Scibench: Evaluating college-level scientific problem-solving abilities of large language models. arXiv preprint arXiv:2307.10635, 2024 c

work page internal anchor Pith review Pith/arXiv arXiv 2024

[65] [65]

Exploring the reasoning abilities of multimodal large language models (mllms): A comprehensive survey on emerging trends in multimodal reasoning

Yiqi Wang, Wentao Chen, Xiaotian Han, Xudong Lin, Haiteng Zhao, Yongfei Liu, Bohan Zhai, Jianbo Yuan, Quanzeng You, and Hongxia Yang. Exploring the reasoning abilities of multimodal large language models (mllms): A comprehensive survey on emerging trends in multimodal reasoning. arXiv preprint arXiv:2401.06805, 2024 d

work page arXiv 2024

[66] [66]

Are deep neural networks adequate behavioral models of human visual perception? Annual Review of Vision Science, 9 0 (1): 0 501--524, 2023

Felix A Wichmann and Robert Geirhos. Are deep neural networks adequate behavioral models of human visual perception? Annual Review of Vision Science, 9 0 (1): 0 501--524, 2023

work page 2023

[67] [67]

A comprehensive survey of large language models and multimodal large language models in medicine

Hanguang Xiao, Feizhong Zhou, Xingyue Liu, Tianqi Liu, Zhipeng Li, Xin Liu, and Xiaoxuan Huang. A comprehensive survey of large language models and multimodal large language models in medicine. arXiv preprint arXiv:2405.08603, 2024

work page arXiv 2024

[68] [68]

Mind: Multimodal shopping intention distillation from large vision-language models for e-commerce purchase understanding

Baixuan Xu, Weiqi Wang, Haochen Shi, Wenxuan Ding, Huihao Jing, Tianqing Fang, Jiaxin Bai, Long Chen, and Yangqiu Song. Mind: Multimodal shopping intention distillation from large vision-language models for e-commerce purchase understanding. arXiv preprint arXiv:2406.10701, 2024 a

work page arXiv 2024

[69] [69]

Superclue-math6: Graded multi-step math reasoning benchmark for llms in chinese

Liang Xu, Hang Xue, Lei Zhu, and Kangkang Zhao. Superclue-math6: Graded multi-step math reasoning benchmark for llms in chinese. arXiv preprint arXiv:2401.11819, 2024 b

work page arXiv 2024

[70] [70]

Raise a child in large language model: Towards effective and generalizable fine-tuning

Runxin Xu, Fuli Luo, Zhiyuan Zhang, Chuanqi Tan, Baobao Chang, Songfang Huang, and Fei Huang. Raise a child in large language model: Towards effective and generalizable fine-tuning. arXiv preprint arXiv:2109.05687, 2021

work page arXiv 2021

[71] [71]

Emerging synergies between large language models and machine learning in ecommerce recommendations

Xiaonan Xu, Zheng Xu, Zhipeng Ling, Zhengyu Jin, and ShuQian Du. Emerging synergies between large language models and machine learning in ecommerce recommendations. arXiv preprint arXiv:2403.02760, 2024 c

work page arXiv 2024

[72] [72]

Georeasoner: Reasoning on geospatially grounded context for natural language understanding

Yibo Yan and Joey Lee. Georeasoner: Reasoning on geospatially grounded context for natural language understanding. arXiv preprint arXiv:2408.11366, 2024

work page arXiv 2024

[73] [73]

Urbanclip: Learning text-enhanced urban region profiling with contrastive language-image pretraining from the web

Yibo Yan, Haomin Wen, Siru Zhong, Wei Chen, Haodong Chen, Qingsong Wen, Roger Zimmermann, and Yuxuan Liang. Urbanclip: Learning text-enhanced urban region profiling with contrastive language-image pretraining from the web. In Proceedings of the ACM on Web Conference 2024, pp.\ 4006--4017, 2024

work page 2024

[74] [74]

Exploring diverse in-context configurations for image captioning

Xu Yang, Yongliang Wu, Mingzhuo Yang, Haokun Chen, and Xin Geng. Exploring diverse in-context configurations for image captioning. Advances in Neural Information Processing Systems, 36, 2024

work page 2024

[75] [75]

MiniCPM-V: A GPT-4V Level MLLM on Your Phone

Yuan Yao, Tianyu Yu, Ao Zhang, Chongyi Wang, Junbo Cui, Hongji Zhu, Tianchi Cai, Haoyu Li, Weilin Zhao, Zhihui He, et al. Minicpm-v: A gpt-4v level mllm on your phone. arXiv preprint arXiv:2408.01800, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[76] [76]

Yi: Open Foundation Models by 01.AI

Alex Young, Bei Chen, Chao Li, Chengen Huang, Ge Zhang, Guanwei Zhang, Heng Li, Jiangcheng Zhu, Jianqun Chen, Jing Chang, et al. Yi: Open foundation models by 01. ai. arXiv preprint arXiv:2403.04652, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[77] [77]

Large language model as attributed training data generator: A tale of diversity and bias

Yue Yu, Yuchen Zhuang, Jieyu Zhang, Yu Meng, Alexander J Ratner, Ranjay Krishna, Jiaming Shen, and Chao Zhang. Large language model as attributed training data generator: A tale of diversity and bias. Advances in Neural Information Processing Systems, 36, 2024

work page 2024

[78] [78]

Mr-ben: A comprehensive meta-reasoning benchmark for large language models

Zhongshen Zeng, Yinhong Liu, Yingjia Wan, Jingyao Li, Pengguang Chen, Jianbo Dai, Yuxuan Yao, Rongwu Xu, Zehan Qi, Wanru Zhao, et al. Mr-ben: A comprehensive meta-reasoning benchmark for large language models. arXiv preprint arXiv:2406.13975, 2024

work page arXiv 2024

[79] [79]

MathVerse: Does Your Multi-modal LLM Truly See the Diagrams in Visual Math Problems?

Renrui Zhang, Dongzhi Jiang, Yichi Zhang, Haokun Lin, Ziyu Guo, Pengshuo Qiu, Aojun Zhou, Pan Lu, Kai-Wei Chang, Peng Gao, et al. Mathverse: Does your multi-modal llm truly see the diagrams in visual math problems? arXiv preprint arXiv:2403.14624, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[80] [80]

A Survey of Large Language Models

Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen Zhang, Junjie Zhang, Zican Dong, et al. A survey of large language models. arXiv preprint arXiv:2303.18223, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023