arxiv: 2403.14624 · v2 · submitted 2024-03-21 · 💻 cs.CV · cs.AI· cs.CL· cs.LG

Recognition: 2 theorem links

MathVerse: Does Your Multi-modal LLM Truly See the Diagrams in Visual Math Problems?

Renrui Zhang , Dongzhi Jiang , Yichi Zhang , Haokun Lin , Ziyu Guo , Pengshuo Qiu , Aojun Zhou , Pan Lu

show 3 more authors

Kai-Wei Chang Peng Gao Hongsheng Li

Authors on Pith no claims yet

Pith reviewed 2026-05-17 01:22 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.CLcs.LG

keywords multi-modal LLMsvisual math problemsdiagram understandingbenchmark evaluationchain-of-thought reasoningmathematical reasoning

0 comments

The pith

MathVerse shows multi-modal LLMs often solve visual math problems using text rather than diagrams.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that existing visual math benchmarks give away too much information in the text, allowing multi-modal LLMs to answer correctly without understanding the diagrams. To fix this, it presents MathVerse, built from 2,612 real math problems that human annotators rewrite into six versions each with different levels of visual and textual detail. This creates a scale from fully supported text to diagram-only cases so that performance drops can be attributed to lack of visual understanding. A new evaluation method scores the quality of reasoning steps in the model's chain of thought instead of just checking the final answer. Readers should care because it offers a clearer way to measure and improve how well these models handle combined visual and mathematical information.

Core claim

MathVerse collects 2,612 high-quality math problems with diagrams from public sources and transforms each into six versions with varying multi-modal information content for a total of 15K samples. This design enables an equitable evaluation of whether MLLMs truly understand visual diagrams when solving math problems. The benchmark further includes a Chain-of-Thought evaluation strategy that uses GPT-4(V) to extract key reasoning steps and provide detailed error analysis on intermediate outputs.

What carries the argument

The multi-version problem transformation that systematically varies the amount of textual information provided alongside the diagram to measure reliance on visual input.

Load-bearing premise

The six versions of each problem preserve the original mathematical intent and difficulty while only changing the distribution of information between text and diagram.

What would settle it

If MLLM accuracy remains consistent even on the versions with the least textual information and greatest dependence on the diagram, the benchmark would fail to show that models need to interpret visuals.

read the original abstract

The remarkable progress of Multi-modal Large Language Models (MLLMs) has garnered unparalleled attention, due to their superior performance in visual contexts. However, their capabilities in visual math problem-solving remain insufficiently evaluated and understood. We investigate current benchmarks to incorporate excessive visual content within textual questions, which potentially assist MLLMs in deducing answers without truly interpreting the input diagrams. To this end, we introduce MathVerse, an all-around visual math benchmark designed for an equitable and in-depth evaluation of MLLMs. We meticulously collect 2,612 high-quality, multi-subject math problems with diagrams from publicly available sources. Each problem is then transformed by human annotators into six distinct versions, each offering varying degrees of information content in multi-modality, contributing to 15K test samples in total. This approach allows MathVerse to comprehensively assess whether and how much MLLMs can truly understand the visual diagrams for mathematical reasoning. In addition, we propose a Chain-of-Thought (CoT) evaluation strategy for a fine-grained assessment of the output answers. Rather than naively judging True or False, we employ GPT-4(V) to adaptively extract crucial reasoning steps, and then score each step with detailed error analysis, which can reveal the intermediate CoT reasoning quality by MLLMs. We hope the MathVerse benchmark may provide unique insights to guide the future development of MLLMs. Project page: https://mathverse-cuhk.github.io

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MathVerse's six-version design is a clean way to isolate diagram use in visual math, but missing checks on whether the human transformations keep problems equivalent is a real hole.

read the letter

The main point is that this benchmark tries to stop MLLMs from solving visual math by reading the text alone. They took 2612 problems with diagrams, had humans make six versions each with different mixes of text and visual info, and ended up with 15k samples. That controlled reduction approach is new and directly targets the gap they describe in prior benchmarks. The CoT step-scoring with GPT-4(V) for error analysis is also a practical addition that gives more than just right/wrong scores. Both pieces are useful for anyone testing whether models actually parse diagrams in math settings. The soft spot is exactly what the stress-test flags: no numbers on inter-annotator agreement, no expert ratings on equivalence, and no check that solve rates stay consistent across versions on a held-out set. Without those, differences in model performance could come from annotation artifacts or unintended changes in difficulty rather than real visual understanding. The abstract describes the process but does not show the validation data that would make the central claim stick. This is for people working on MLLM evaluation and visual reasoning benchmarks, especially in education or technical document tasks. A reader who wants concrete tests of diagram reliance will get value from the design even if the current evidence is preliminary. It deserves a serious referee because the core protocol is worth refining and the field needs sharper tools here, though the authors should add the missing equivalence checks before publication.

Referee Report

2 major / 2 minor

Summary. The paper claims that existing visual math benchmarks embed excessive textual information in questions, enabling MLLMs to deduce answers without genuinely interpreting diagrams. To enable equitable evaluation, the authors collect 2,612 high-quality multi-subject math problems with diagrams from public sources and have human annotators transform each into six versions that vary the amount of multi-modal information (textual vs. visual), yielding 15K test samples total. They further propose a Chain-of-Thought evaluation that uses GPT-4(V) to extract key reasoning steps and score them with error analysis rather than binary correctness.

Significance. If the version transformations are shown to preserve mathematical semantics and difficulty, MathVerse would provide a valuable, fine-grained benchmark for isolating and measuring visual diagram understanding in MLLMs. The multi-version design and step-level GPT-4 scoring could become a standard protocol for diagnosing whether models truly integrate visual and textual cues in mathematical reasoning, directly informing future architecture and training improvements.

major comments (2)

[§3.2] §3.2 (Human Annotation and Version Transformation): The manuscript describes the process of transforming each problem into six versions with differing textual/visual content but reports no quantitative validation—such as inter-annotator agreement scores, expert equivalence ratings, or solve-rate consistency on a held-out set—to confirm that core problem semantics, difficulty, and mathematical equivalence are preserved. Without these checks, performance gaps across versions could reflect annotation artifacts rather than genuine differences in visual understanding.
[§4.3] §4.3 (CoT Evaluation with GPT-4(V)): The adaptive step-extraction and scoring procedure is presented as enabling fine-grained assessment, yet the paper provides insufficient detail on prompt templates, exact scoring rubrics, or any human-GPT agreement study. This weakens the claim that the method reliably reveals intermediate reasoning quality, as GPT-4(V) errors could systematically bias the reported insights.

minor comments (2)

[§3.1] The selection criteria and filtering steps used to arrive at the final 2,612 problems from public sources are only briefly summarized; a table or paragraph detailing subject distribution, diagram complexity, and exclusion reasons would improve reproducibility.
In the example figures illustrating the six versions, the visual differences between versions could be highlighted with explicit callouts or color coding to make the information-content gradient immediately clear to readers.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback, which highlights key areas for strengthening the validation and transparency of MathVerse. We address each major comment below and will incorporate the necessary additions in the revised manuscript.

read point-by-point responses

Referee: [§3.2] §3.2 (Human Annotation and Version Transformation): The manuscript describes the process of transforming each problem into six versions with differing textual/visual content but reports no quantitative validation—such as inter-annotator agreement scores, expert equivalence ratings, or solve-rate consistency on a held-out set—to confirm that core problem semantics, difficulty, and mathematical equivalence are preserved. Without these checks, performance gaps across versions could reflect annotation artifacts rather than genuine differences in visual understanding.

Authors: We agree that quantitative validation would strengthen confidence in the version transformations. The manuscript details the annotation guidelines and process used by human annotators to derive the six versions through controlled, incremental removal of textual or visual information while aiming to preserve mathematical semantics and difficulty. However, we did not report inter-annotator agreement or equivalence metrics. In the revision, we will add inter-annotator agreement scores on a multi-annotated subset, expert equivalence ratings, and solve-rate consistency analysis on a held-out set to empirically confirm preservation of core problem properties. revision: yes
Referee: [§4.3] §4.3 (CoT Evaluation with GPT-4(V)): The adaptive step-extraction and scoring procedure is presented as enabling fine-grained assessment, yet the paper provides insufficient detail on prompt templates, exact scoring rubrics, or any human-GPT agreement study. This weakens the claim that the method reliably reveals intermediate reasoning quality, as GPT-4(V) errors could systematically bias the reported insights.

Authors: We concur that greater detail on the evaluation procedure is warranted to support its reliability. The manuscript describes the adaptive use of GPT-4(V) for step extraction and error-annotated scoring as an alternative to binary correctness. In the revised version, we will include the full prompt templates, the precise scoring rubrics with error categories, and a human-GPT agreement study on a sampled set of responses to quantify alignment and address potential biases in the automated assessment. revision: yes

Circularity Check

0 steps flagged

No circularity: benchmark construction is externally grounded

full rationale

The paper collects 2,612 math problems from publicly available sources and applies explicit human annotation to produce six modality variants, yielding 15K samples. No equations, fitted parameters, or model predictions are defined; the evaluation strategy (CoT extraction via GPT-4(V)) operates on external test samples rather than quantities derived from the paper's own outputs. The central claim—that performance differences isolate visual understanding—rests on the annotation process itself, which is described as an independent construction step without self-referential reduction or load-bearing self-citations. The derivation chain is therefore self-contained against external benchmarks and sources.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The benchmark rests on the assumption that human annotators can reliably create information-controlled variants without altering core mathematical content or introducing unintended cues. No free parameters or invented entities are introduced.

axioms (2)

domain assumption Human annotators can produce six versions of each problem that differ only in the amount of visual versus textual information while preserving mathematical equivalence.
Invoked in the description of transforming each of the 2,612 problems into six distinct versions.
domain assumption GPT-4(V) can accurately extract and score individual reasoning steps from MLLM outputs for error analysis.
Used in the proposed Chain-of-Thought evaluation strategy.

pith-pipeline@v0.9.0 · 5608 in / 1409 out tokens · 33382 ms · 2026-05-17T01:22:47.288047+00:00 · methodology

discussion (0)

Forward citations

Cited by 18 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark
cs.CL 2024-09 accept novelty 8.0

MMMU-Pro is a stricter multimodal benchmark that removes text-only solvable questions, augments options, and requires reading text from images, yielding substantially lower model scores of 16.8-26.9%.
Reflection Anchors for Propagation-Aware Visual Retention in Long-Chain Multimodal Reasoning
cs.CV 2026-05 unverdicted novelty 7.0

RAPO uses an information-theoretic lower bound on visual gain to select high-entropy reflection anchors and optimizes a chain-masked KL surrogate, delivering gains over baselines on reasoning benchmarks across LVLM backbones.
Structured Role-Aware Policy Optimization for Multimodal Reasoning
cs.AI 2026-05 unverdicted novelty 7.0

SRPO refines GRPO into role-aware token-level advantages by emphasizing perception tokens based on visual dependency (original vs. corrupted inputs) and reasoning tokens based on consistency with perception, unified v...
Why and When Visual Token Pruning Fails? A Study on Relevant Visual Information Shift in MLLMs Decoding
cs.CV 2026-04 unverdicted novelty 7.0

Visual token pruning in MLLMs fails on complex reasoning due to Relevant Visual Information Shift during decoding, but the DSTP framework fixes it training-free across models.
Addressing Overthinking in Large Vision-Language Models via Gated Perception-Reasoning Optimization
cs.CV 2026-01 unverdicted novelty 7.0

GPRO trains a meta-controller on 790k failure-labeled samples to dynamically select fast, perception, or reasoning paths in LVLMs, yielding higher accuracy and shorter responses than prior slow-thinking methods.
Omni-MATH: A Universal Olympiad Level Mathematic Benchmark For Large Language Models
cs.CL 2024-10 conditional novelty 7.0

Omni-MATH supplies 4428 human-verified Olympiad math problems that expose top LLMs achieving only 52.55% to 60.54% accuracy on the most difficult items.
LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models
cs.CV 2024-07 unverdicted novelty 7.0

LLaVA-NeXT-Interleave unifies multi-image, video, and 3D capabilities in large multimodal models via a new 1.18M-sample interleaved dataset and benchmark, achieving leading results across those tasks while preserving ...
We-Math: Does Your Large Multimodal Model Achieve Human-like Mathematical Reasoning?
cs.AI 2024-07 accept novelty 7.0

WE-MATH benchmark reveals most LMMs rely on rote memorization for visual math while GPT-4o has shifted toward knowledge generalization.
MuirBench: A Comprehensive Benchmark for Robust Multi-image Understanding
cs.CV 2024-06 conditional novelty 7.0

MuirBench is a new benchmark showing that top multimodal LLMs struggle with robust multi-image understanding, with GPT-4o at 68% and open-source models below 33% accuracy.
LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention
cs.CV 2023-03 conditional novelty 7.0

LLaMA-Adapter turns frozen LLaMA 7B into a capable instruction follower using only 1.2M new parameters and zero-init attention, matching Alpaca while extending to image-conditioned reasoning on ScienceQA and COCO.
LatentRouter: Can We Choose the Right Multimodal Model Before Seeing Its Answer?
cs.AI 2026-05 unverdicted novelty 6.0

LatentRouter routes image-question queries to the best MLLM by predicting counterfactual performance via latent communication between learned query capsules and model capability tokens.
Co-Evolving Policy Distillation
cs.LG 2026-04 unverdicted novelty 6.0

CoPD integrates multiple expert capabilities by running parallel RLVR training with bidirectional online policy distillation among experts, outperforming mixed RLVR and sequential OPD while surpassing domain-specific ...
VLMs Need Words: Vision Language Models Ignore Visual Detail In Favor of Semantic Anchors
cs.CV 2026-04 unverdicted novelty 6.0

VLMs bypass visual comparison by recovering semantic labels for nameable entities and hallucinate on unnamable ones, as shown by performance gaps and Logit Lens analysis.
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency
cs.CV 2025-08 unverdicted novelty 6.0

InternVL3.5 advances open-source multimodal models with Cascade RL for +16% reasoning gains and ViR for 4x inference speedup, with the 241B model reaching SOTA among open-source MLLMs on multimodal, reasoning, and age...
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models
cs.CV 2025-04 conditional novelty 6.0

InternVL3-78B sets a new open-source SOTA of 72.2 on MMMU via native joint multimodal pre-training, V2PE, MPO, and test-time scaling while remaining competitive with proprietary models.
Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization
cs.CL 2024-11 conditional novelty 6.0

Mixed Preference Optimization with the MMPR dataset boosts multimodal CoT reasoning, lifting InternVL2-8B to 67.0 accuracy on MathVista (+8.7 points) and matching the 76B model.
Cognitive Pivot Points and Visual Anchoring: Unveiling and Rectifying Hallucinations in Multimodal Reasoning Models
cs.AI 2026-04 unverdicted novelty 5.0

Multimodal reasoning models hallucinate at high-entropy cognitive bifurcation points due to loss of visual semantic anchoring, and the V-STAR training paradigm with HVAR rewards and FRM reflection mitigates this by re...
LLaVA-OneVision: Easy Visual Task Transfer
cs.CV 2024-08 unverdicted novelty 5.0

LLaVA-OneVision is the first single open LMM to simultaneously achieve strong performance in single-image, multi-image, and video scenarios with cross-scenario transfer capabilities.

Reference graph

Works this paper leans on

83 extracted references · 83 canonical work pages · cited by 18 Pith papers · 26 internal anchors

[1]

Advances in Neural Information Processing Systems 35, 23716–23736 (2022)

Alayrac, J.B., Donahue, J., Luc, P., Miech, A., Barr, I., Hasson, Y ., Lenc, K., Mensch, A., Millican, K., Reynolds, M., et al.: Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems 35, 23716–23736 (2022)

work page 2022
[2]

MathQA: Towards Interpretable Math Word Problem Solving with Operation-Based Formalisms

Amini, A., Gabriel, S., Lin, P., Koncel-Kedziorski, R., Choi, Y ., Hajishirzi, H.: Mathqa: Towards interpretable math word problem solving with operation-based formalisms. arXiv preprint arXiv:1905.13319 (2019)

work page internal anchor Pith review Pith/arXiv arXiv 1905
[3]

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

Bai, J., Bai, S., Yang, S., Wang, S., Tan, S., Wang, P., Lin, J., Zhou, C., Zhou, J.: Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond. arXiv preprint arXiv:2308.12966 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[4]

In: Advances in neural information processing systems

Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. In: Advances in neural information processing systems. pp. 1877–1901 (2020) 14

work page 1901
[5]

In: Proceedings of the 29th International Conference on Computa- tional Linguistics

Cao, J., Xiao, J.: An augmented benchmark dataset for geometric question answering through dual parallel text encoding. In: Proceedings of the 29th International Conference on Computa- tional Linguistics. pp. 1511–1520 (2022)

work page 2022
[6]

arXiv preprint arXiv:2305.13292 (2023)

Chen, G., Zheng, Y .D., Wang, J., Xu, J., Huang, Y ., Pan, J., Wang, Y ., Wang, Y ., Qiao, Y ., Lu, T., et al.: Videollm: Modeling video sequence with large language models. arXiv preprint arXiv:2305.13292 (2023)

work page arXiv 2023
[8]

ArXiv abs/2212.02746 (2022)

Chen, J., Li, T., Qin, J., Lu, P., Lin, L., Chen, C., Liang, X.: Unigeo: Unifying geometry logical reasoning via reformulating mathematical expression. ArXiv abs/2212.02746 (2022)

work page arXiv 2022
[10]

ArXiv abs/2105.14517 (2021), https://api.semanticscholar.org/CorpusID:235253782

Chen, J., Tang, J., Qin, J., Liang, X., Liu, L., Xing, E.P., Lin, L.: Geoqa: A geometric question answering benchmark towards multimodal numerical reasoning. ArXiv abs/2105.14517 (2021), https://api.semanticscholar.org/CorpusID:235253782

work page arXiv 2021
[11]

MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning

Chen, J., Li, D.Z.X.S.X., Zhang, Z.L.P., Xiong, R.K.V .C.Y ., Elhoseiny, M.: Minigpt-v2: Large language model as a unified interface for vision-language multi-task learning. arXiv preprint arXiv:2310.09478 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[12]

ShareGPT4V: Improving Large Multi-Modal Models with Better Captions

Chen, L., Li, J., wen Dong, X., Zhang, P., He, C., Wang, J., Zhao, F., Lin, D.: Sharegpt4v: Improving large multi-modal models with better captions. ArXiv abs/2311.12793 (2023), https://api.semanticscholar.org/CorpusID:265308687

work page internal anchor Pith review Pith/arXiv arXiv 2023
[13]

https://lmsys.org/blog/2023-03-30-vicuna/ (March 2023)

Chiang, W.L., Li, Z., Lin, Z., Sheng, Y ., Wu, Z., Zhang, H., Zheng, L., Zhuang, S., Zhuang, Y ., Gonzalez, J.E., Stoica, I., Xing, E.P.: Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. https://lmsys.org/blog/2023-03-30-vicuna/ (March 2023)

work page 2023
[14]

Training Verifiers to Solve Math Word Problems

Cobbe, K., Kosaraju, V ., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., et al.: Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168 (2021)

work page internal anchor Pith review Pith/arXiv arXiv 2021
[15]

Dai, W., Li, J., Li, D., Tiong, A.M.H., Zhao, J., Wang, W., Li, B., Fung, P., Hoi, S.: Instructblip: Towards general-purpose vision-language models with instruction tuning (2023)

work page 2023
[16]

InternLM-XComposer2: Mastering Free-form Text-Image Composition and Comprehension in Vision-Language Large Model

Dong, X., Zhang, P., Zang, Y ., Cao, Y ., Wang, B., Ouyang, L., Wei, X., Zhang, S., Duan, H., Cao, M., et al.: Internlm-xcomposer2: Mastering free-form text-image composition and comprehension in vision-language large model. arXiv preprint arXiv:2401.16420 (2024)

work page internal anchor Pith review arXiv 2024
[17]

MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

Fu, C., Chen, P., Shen, Y ., Qin, Y ., Zhang, M., Lin, X., Yang, J., Zheng, X., Li, K., Sun, X., Wu, Y ., Ji, R.: Mme: A comprehensive evaluation benchmark for multimodal large language models. arXiv preprint arXiv:2306.13394 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[18]

A challenger to gpt-4v? early explorations of gemini in visual expertise

Fu, C., Zhang, R., Lin, H., Wang, Z., Gao, T., Luo, Y ., Huang, Y ., Zhang, Z., Qiu, L., Ye, G., et al.: A challenger to gpt-4v? early explorations of gemini in visual expertise. arXiv preprint arXiv:2312.12436 (2023)

work page arXiv 2023
[19]

arXiv preprint arXiv:2312.11370 , year=

Gao, J., Pi, R., Zhang, J., Ye, J., Zhong, W., Wang, Y ., Hong, L., Han, J., Xu, H., Li, Z., et al.: G-llava: Solving geometric problem with multi-modal large language model. arXiv preprint arXiv:2312.11370 (2023)

work page arXiv 2023
[20]

LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model

Gao, P., Han, J., Zhang, R., Lin, Z., Geng, S., Zhou, A., Zhang, W., Lu, P., He, C., Yue, X., Li, H., Qiao, Y .: Llama-adapter v2: Parameter-efficient visual instruction model. arXiv preprint arXiv:2304.15010 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[21]

arXiv preprint arXiv:2402.05935 , year=

Gao, P., Zhang, R., Liu, C., Qiu, L., Huang, S., Lin, W., Zhao, S., Geng, S., Lin, Z., Jin, P., et al.: Sphinx-x: Scaling data and parameters for a family of multi-modal large language models. arXiv preprint arXiv:2402.05935 (2024) 15

work page arXiv 2024
[22]

Gemini: A Family of Highly Capable Multimodal Models

Gemini Team, G.: Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[23]

Point-bind & point-llm: Aligning point cloud with multi-modality for 3d understanding, generation, and instruction following

Guo, Z., Zhang, R., Zhu, X., Tang, Y ., Ma, X., Han, J., Chen, K., Gao, P., Li, X., Li, H., et al.: Point-bind & point-llm: Aligning point cloud with multi-modality for 3d understanding, generation, and instruction following. arXiv preprint arXiv:2309.00615 (2023)

work page arXiv 2023
[24]

Imagebind-llm: Multi-modality instruction tuning

Han, J., Zhang, R., Shao, W., Gao, P., Xu, P., Xiao, H., Zhang, K., Liu, C., Wen, S., Guo, Z., et al.: Imagebind-llm: Multi-modality instruction tuning. arXiv preprint arXiv:2309.03905 (2023)

work page arXiv 2023
[25]

Proceedings of the International Conference on Learning Representations (ICLR) (2021)

Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D., Steinhardt, J.: Measuring massive multitask language understanding. Proceedings of the International Conference on Learning Representations (ICLR) (2021)

work page 2021
[26]

NeurIPS (2021)

Hendrycks, D., Burns, C., Kadavath, S., Arora, A., Basart, S., Tang, E., Song, D., Steinhardt, J.: Measuring mathematical problem solving with the math dataset. NeurIPS (2021)

work page 2021
[27]

Advances in Neural Information Processing Systems 36 (2024)

Hong, Y ., Zhen, H., Chen, P., Zheng, S., Du, Y ., Chen, Z., Gan, C.: 3d-llm: Injecting the 3d world into large language models. Advances in Neural Information Processing Systems 36 (2024)

work page 2024
[28]

Mixtral of Experts

Jiang, A.Q., Sablayrolles, A., Roux, A., Mensch, A., Savary, B., Bamford, C., Chaplot, D.S., de Las Casas, D., Hanna, E.B., Bressand, F., Lengyel, G., Bour, G., Lample, G., Lavaud, L.R., Saulnier, L., Lachaux, M., Stock, P., Subramanian, S., Yang, S., Antoniak, S., Scao, T.L., Gervet, T., Lavril, T., Wang, T., Lacroix, T., Sayed, W.E.: Mixtral of experts....

work page internal anchor Pith review Pith/arXiv arXiv 2024
[29]

Segment Anything

Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.Y ., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[30]

https://llava- vl.github.io/blog/2024-05-10-llava-next-stronger-llms/ (2024)

Li, B., Zhang, K., Zhang, H., Guo, D., Zhang, R., Li, F., Zhang, Y ., Liu, Z., Li, C.: Llava-next: Stronger llms supercharge multimodal capabilities in the wild. https://llava- vl.github.io/blog/2024-05-10-llava-next-stronger-llms/ (2024)

work page 2024
[31]

Li, and Ziwei Liu

Li, B., Zhang, Y ., Chen, L., Wang, J., Pu, F., Yang, J., Li, C., Liu, Z.: Mimic-it: Multi-modal in-context instruction tuning. arXiv preprint arXiv:2306.05425 (2023)

work page arXiv 2023
[32]

LLaVA-OneVision: Easy Visual Task Transfer

Li, B., Zhang, Y ., Guo, D., Zhang, R., Li, F., Zhang, H., Zhang, K., Li, Y ., Liu, Z., Li, C.: Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[33]

SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension

Li, B., Wang, R., Wang, G., Ge, Y ., Ge, Y ., Shan, Y .: Seed-bench: Benchmarking multimodal llms with generative comprehension. ArXiv abs/2307.16125 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[34]

LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models

Li, F., Zhang, R., Zhang, H., Zhang, Y ., Li, B., Li, W., Ma, Z., Li, C.: Llava-next- interleave: Tackling multi-image, video, and 3d in large multimodal models. arXiv preprint arXiv:2407.07895 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[35]

In: International Conference on Machine Learning

Li, J., Li, D., Xiong, C., Hoi, S.: Blip: Bootstrapping language-image pre-training for uni- fied vision-language understanding and generation. In: International Conference on Machine Learning. pp. 12888–12900. PMLR (2022)

work page 2022
[36]

SPHINX: The Joint Mixing of Weights, Tasks, and Visual Embeddings for Multi-modal Large Language Models

Lin, Z., Liu, C., Zhang, R., Gao, P., Qiu, L., Xiao, H., Qiu, H., Lin, C., Shao, W., Chen, K., et al.: Sphinx: The joint mixing of weights, tasks, and visual embeddings for multi-modal large language models. arXiv preprint arXiv:2311.07575 (2023)

work page internal anchor Pith review arXiv 2023
[37]

Liu, H., Li, C., Li, Y ., Lee, Y .J.: Improved baselines with visual instruction tuning (2023)

work page 2023
[38]

Liu, H., Li, C., Li, Y ., Li, B., Zhang, Y ., Shen, S., Lee, Y .J.: Llava-next: Improved rea- soning, ocr, and world knowledge (January 2024), https://llava-vl.github.io/blog/ 2024-01-30-llava-next/

work page 2024
[39]

In: NeurIPS (2023) 16

Liu, H., Li, C., Wu, Q., Lee, Y .J.: Visual instruction tuning. In: NeurIPS (2023) 16

work page 2023
[40]

Liu, Y ., Duan, H., Zhang, Y ., Li, B., Zhang, S., Zhao, W., Yuan, Y ., Wang, J., He, C., Liu, Z., et al.: Mmbench: Is your multi-modal model an all-around player? arXiv preprint arXiv:2307.06281 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[41]

MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts

Lu, P., Bansal, H., Xia, T., Liu, J., yue Li, C., Hajishirzi, H., Cheng, H., Chang, K.W., Galley, M., Gao, J.: Mathvista: Evaluating math reasoning in visual contexts with gpt-4v, bard, and other large multimodal models. ArXiv abs/2310.02255 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[42]

Inter-gps: Interpretable geometry problem solving with formal language and sym- bolic reasoning.arXiv preprint arXiv:2105.04165, 2021

Lu, P., Gong, R., Jiang, S., Qiu, L., Huang, S., Liang, X., Zhu, S.C.: Inter-gps: Interpretable geometry problem solving with formal language and symbolic reasoning. arXiv preprint arXiv:2105.04165 (2021)

work page arXiv 2021
[43]

In: Annual Meeting of the Association for Computational Linguistics (2021), https://api.semanticscholar

Lu, P., Gong, R., Jiang, S., Qiu, L., Huang, S., Liang, X., Zhu, S.C.: Inter-gps: Interpretable geometry problem solving with formal language and symbolic reasoning. In: Annual Meeting of the Association for Computational Linguistics (2021), https://api.semanticscholar. org/CorpusID:234337054

work page 2021
[44]

WizardMath: Empowering Mathematical Reasoning for Large Language Models via Reinforced Evol-Instruct

Luo, H., Sun, Q., Xu, C., Zhao, P., Lou, J., Tao, C., Geng, X., Lin, Q., Chen, S., Zhang, D.: Wizardmath: Empowering mathematical reasoning for large language models via reinforced evol-instruct. arXiv preprint arXiv:2308.09583 (2023)

work page internal anchor Pith review arXiv 2023
[45]

https://chat.openai.com (2023)

OpenAI: Chatgpt. https://chat.openai.com (2023)

work page 2023
[46]

GPT-4 Technical Report

OpenAI: Gpt-4 technical report. ArXiv abs/2303.08774 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[47]

OpenAI: GPT-4V(ision) system card (2023), https://openai.com/research/ gpt-4v-system-card

work page 2023
[48]

In: Advances in Neural Information Processing Systems (2022)

Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Gray, A., Schulman, J., Hilton, J., Kelton, F., Miller, L., Simens, M., Askell, A., Welinder, P., Christiano, P., Leike, J., Lowe, R.: Training language models to follow instructions with human feedback. In: Advances in Neural Information Processing...

work page 2022
[49]

In: International Conference on Machine Learning (2021), https://api.semanticscholar.org/CorpusID:231591445

Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., Sutskever, I.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning (2021), https://api.semanticscholar.org/CorpusID:231591445

work page 2021
[50]

Solving General Arithmetic Word Problems

Roy, S., Roth, D.: Solving general arithmetic word problems. ArXiv abs/1608.01413 (2016), https://api.semanticscholar.org/CorpusID:560565

work page internal anchor Pith review Pith/arXiv arXiv 2016
[51]

In: Proceedings of the 2015 conference on empirical methods in natural language processing

Seo, M., Hajishirzi, H., Farhadi, A., Etzioni, O., Malcolm, C.: Solving geometry problems: Combining text and diagram interpretation. In: Proceedings of the 2015 conference on empirical methods in natural language processing. pp. 1466–1476 (2015)

work page 2015
[52]

PandaGPT: One Model To Instruction-Follow Them All

Su, Y ., Lan, T., Li, H., Xu, J., Wang, Y ., Cai, D.: Pandagpt: One model to instruction-follow them all. arXiv preprint arXiv:2305.16355 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[53]

Advances in Neural Information Processing Systems 36 (2024)

Sun, K., Pan, J., Ge, Y ., Li, H., Duan, H., Wu, X., Zhang, R., Zhou, A., Qin, Z., Wang, Y ., et al.: Journeydb: A benchmark for generative image understanding. Advances in Neural Information Processing Systems 36 (2024)

work page 2024
[54]

Team, I.: Internlm: A multilingual language model with progressively enhanced capabilities (2023)

work page 2023
[55]

LLaMA: Open and Efficient Foundation Language Models

Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[56]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y ., Bashlykov, N., Batra, S., Bhargava, P., Bhosale, S., et al.: Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 (2023) 17

work page internal anchor Pith review Pith/arXiv arXiv 2023
[57]

In: The Twelfth International Conference on Learning Representations (2024), https://openreview

Wang, K., Ren, H., Zhou, A., Lu, Z., Luo, S., Shi, W., Zhang, R., Song, L., Zhan, M., Li, H.: Mathcoder: Seamless code integration in LLMs for enhanced mathematical reasoning. In: The Twelfth International Conference on Learning Representations (2024), https://openreview. net/forum?id=z8TW0ttBPp

work page 2024
[58]

Advances in Neural Information Processing Systems 35, 24824–24837 (2022)

Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F., Chi, E., Le, Q.V ., Zhou, D., et al.: Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems 35, 24824–24837 (2022)

work page 2022
[59]

arXiv preprint arXiv:2306.09265 (2023)

Xu, P., Shao, W., Zhang, K., Gao, P., Liu, S., Lei, M., Meng, F., Huang, S., Qiao, Y ., Luo, P.: Lvlm-ehub: A comprehensive evaluation benchmark for large vision-language models. arXiv preprint arXiv:2306.09265 (2023)

work page arXiv 2023
[60]

Pointllm: Empowering large language models to understand point clouds

Xu, R., Wang, X., Wang, T., Chen, Y ., Pang, J., Lin, D.: Pointllm: Empowering large language models to understand point clouds. arXiv preprint arXiv:2308.16911 (2023)

work page arXiv 2023
[61]

Ye, Q., Xu, H., Xu, G., Ye, J., Yan, M., Zhou, Y ., Wang, J., Hu, A., Shi, P., Shi, Y ., Jiang, C., Li, C., Xu, Y ., Chen, H., Tian, J., Qian, Q., Zhang, J., Huang, F.: mplug-owl: Modularization empowers large language models with multimodality (2023)

work page 2023
[62]

Ye, Q., Xu, H., Ye, J., Yan, M., Hu, A., Liu, H., Qian, Q., Zhang, J., Huang, F., Zhou, J.: mplug-owl2: Revolutionizing multi-modal large language model with modality collaboration (2023)

work page 2023
[63]

MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI

Yue, X., Ni, Y ., Zhang, K., Zheng, T., Liu, R., Zhang, G., Stevens, S., Jiang, D., Ren, W., Sun, Y ., Wei, C., Yu, B., Yuan, R., Sun, R., Yin, M., Zheng, B., Yang, Z., Liu, Y ., Huang, W., Sun, H., Su, Y ., Chen, W.: Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. arXiv preprint arXiv:2311.16502 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[64]

arXiv preprint arXiv:2309.05653 (2023)

Yue, X., Qu, X., Zhang, G., Fu, Y ., Huang, W., Sun, H., Su, Y ., Chen, W.: Mammoth: Building math generalist models through hybrid instruction tuning. arXiv preprint arXiv:2309.05653 (2023)

work page arXiv 2023
[65]

Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding

Zhang, H., Li, X., Bing, L.: Video-llama: An instruction-tuned audio-visual language model for video understanding. arXiv preprint arXiv:2306.02858 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[66]

In: The Twelfth International Conference on Learning Representations (2024), https://openreview.net/ forum?id=d4UiXAHN2W

Zhang, R., Han, J., Zhou, A., Hu, X., Yan, S., Lu, P., Li, H., Gao, P., Qiao, Y .: LLaMA-adapter: Efficient fine-tuning of large language models with zero-initialized attention. In: The Twelfth International Conference on Learning Representations (2024), https://openreview.net/ forum?id=d4UiXAHN2W

work page 2024
[67]

CVPR 2023 (2023)

Zhang, R., Hu, X., Li, B., Huang, S., Deng, H., Li, H., Qiao, Y ., Gao, P.: Prompt, generate, then cache: Cascade of foundation models makes strong few-shot learners. CVPR 2023 (2023)

work page 2023
[68]

ICLR 2024 (2023)

Zhang, R., Jiang, Z., Guo, Z., Yan, S., Pan, J., Dong, H., Gao, P., Li, H.: Personalize segment anything model with one shot. ICLR 2024 (2023)

work page 2024
[69]

CVPR 2023 (2023)

Zhang, R., Wang, L., Qiao, Y ., Gao, P., Li, H.: Learning 3d representations from 2d pre-trained models via image-to-point masked autoencoders. CVPR 2023 (2023)

work page 2023
[70]

arXiv preprint arXiv:2407.08739 (2024)

Zhang, R., Wei, X., Jiang, D., Zhang, Y ., Guo, Z., Tong, C., Liu, J., Zhou, A., Wei, B., Zhang, S., et al.: Mavis: Mathematical visual instruction tuning. arXiv preprint arXiv:2407.08739 (2024)

work page arXiv 2024
[71]

Solving challenging math word problems using gpt-4 code interpreter with code-based self-verification

Zhou, A., Wang, K., Lu, Z., Shi, W., Luo, S., Qin, Z., Lu, S., Jia, A., Song, L., Zhan, M., et al.: Solving challenging math word problems using gpt-4 code interpreter with code-based self-verification. arXiv preprint arXiv:2308.07921 (2023)

work page arXiv 2023
[72]

MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

Zhu, D., Chen, J., Shen, X., Li, X., Elhoseiny, M.: Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592 (2023) 18 Appendix Overview • Section A: Related work. • Section B: Additional experimental details. • Section C: More dataset details. • Section D: Comparison to current benchmarks. • ...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[73]

Calculatethesquareofthelengthsofthelegs: y^2=11^2+11^2,y^2=121+121,y^2=242.3

UsethePythagoreantheoremtofindthelengthofthediagonal.Thetheoremstatesthatthesquareofthelengthofthehypotenuse(thediagonalinthiscase)isequaltothesumofthesquaresofthelengthsofthelegs.Inthiscase,thelegsarethesidesofthesquareprism,whichareofequallength.2. Calculatethesquareofthelengthsofthelegs: y^2=11^2+11^2,y^2=121+121,y^2=242.3. Takethesquarerootofbothsides...

work page
[74]

Inthiscase,thebaselengthis10andtheheightis13.3

TheareaofanisoscelestriangleisgivenbytheformulaA=(1/2)*b*h,wherebisthebaselengthandhistheheight.2. Inthiscase,thebaselengthis10andtheheightis13.3. Pluggingthesevaluesintotheformula,wegetA=(1/2)*10*13=65.4. Therefore,theareaofthetriangleis65.ThecorrectoptionisB:60. SPHINX-MoE: ×0/1×0/1 ×0/1×0/1×0/1 ✓1/1✓1/1 ✓1/1✓1/1 ✓1/1 ✓1/1 ✓1/1 ✓1/1 Multi-step Score: 0....

work page
[75]

We adopt the Text-lite version of the problem, and highlight the key-step extraction and scoring by the CoT evaluation strategy

Thesingleansweristhatthevalueofhis30.SPHINX-MoE: ×0/1 ×0/1 ×0/1 ✓1/1✓1/1✓1/1✓1/1 ✓1/1 ✓1/1 ✓1/1 ×0/1×0/1 Multi-step Score: 0Final CoTEvaluation Score: 0 Multi-step Score: 0.3Final CoTEvaluation Score: 0.2 Multi-step Score: 1Final CoTEvaluation Score: 1 Visual Perception Error Reasoning Error Figure 26: Response Comparison of GPT-4V [47], LLaV A-NeXT [38],...

work page
[76]

Identify that the polynomial has a degree of 2 for the terms with the highest degree, which are the squares of the linear terms.3

Analyze the graph and identify the degree of the terms in the polynomial.2. Identify that the polynomial has a degree of 2 for the terms with the highest degree, which are the squares of the linear terms.3. Identify the highest degree term that is not squared is the term with the variable raised to the second power, which is (x+1)2.4. Identify the next hi...

work page
[77]

The graph has a vertical tangent at x=-5/2.4.The graph of p also passes through the point (-1,0).5

The graph of p has a vertical tangent at x=-1.2.The graph has a horizontal tangent at x=3.3. The graph has a vertical tangent at x=-5/2.4.The graph of p also passes through the point (-1,0).5. The equation of p is p(x)=(x+1)2(2 x+5)(x-3)2.6. The correct option is A. SPHINX-MoE:×0/1 Multi-step Score: 0.17Final CoTEvaluation Score: 0.12 Multi-step Score: 0....

work page
[78]

The function starts off decreasing from the point (-5, something above 6) to the point (-4, 1).3

Determine the intervals on which the function is increasing or decreasing by looking at the slope of the curve on the graph.2. The function starts off decreasing from the point (-5, something above 6) to the point (-4, 1).3. The function then starts increasing at the point (-4, 1) and continues to increase until it reaches a peak somewhere between -2 and ...

work page
[79]

Identify the points where the function reaches its peaks and its lowest points.3

Determine the slope of the function as it moves from left to right along the x-axis.2. Identify the points where the function reaches its peaks and its lowest points.3. The function is increasing from the leftmost part of the graph until it reaches its first peak, which is approximately at x = -3.4. The function starts decreasing after this peak until it ...

work page
[80]

The graph shows a function D(t) that represents the horizontal distance between Sia and the wall over time. 2. The highlighted segment with a red double arrow spans vertically from the point (3.75, 3) to the point (3.75, 5) on the graph. 3. This segment is at a constant time of 3.75 seconds and shows a change in the distance D(t) from 3 meters to 5 meters...

work page
[81]

The horizontal axis is labeled with time t in seconds, and the vertical axis is labeled with distance D in meters

Analyze the graph which shows a periodic function, representing the back-and-forth motion of Sia swinging from the chandelier. The horizontal axis is labeled with time t in seconds, and the vertical axis is labeled with distance D in meters. The highlighted segment on the graph indicates a change in the distance from one point to another over time.2. Eval...

work page
[82]

Use the property of similar triangles that the ratios of their corresponding sides are equal:AB/BD = BC/CE.3

Identify that the triangles ABD and CEB are similar because they both have a right angle, and they share the angle at point B.2. Use the property of similar triangles that the ratios of their corresponding sides are equal:AB/BD = BC/CE.3. Substitute the given values into the ratio:2.0 / (7.0/3.0) = BC / 9.0.4. Cross-multiply to solve for BC:2.0 * 9.0 = BC...

work page

Showing first 80 references.