Recognition: 2 theorem links
MathVerse: Does Your Multi-modal LLM Truly See the Diagrams in Visual Math Problems?
Pith reviewed 2026-05-17 01:22 UTC · model grok-4.3
The pith
MathVerse shows multi-modal LLMs often solve visual math problems using text rather than diagrams.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
MathVerse collects 2,612 high-quality math problems with diagrams from public sources and transforms each into six versions with varying multi-modal information content for a total of 15K samples. This design enables an equitable evaluation of whether MLLMs truly understand visual diagrams when solving math problems. The benchmark further includes a Chain-of-Thought evaluation strategy that uses GPT-4(V) to extract key reasoning steps and provide detailed error analysis on intermediate outputs.
What carries the argument
The multi-version problem transformation that systematically varies the amount of textual information provided alongside the diagram to measure reliance on visual input.
Load-bearing premise
The six versions of each problem preserve the original mathematical intent and difficulty while only changing the distribution of information between text and diagram.
What would settle it
If MLLM accuracy remains consistent even on the versions with the least textual information and greatest dependence on the diagram, the benchmark would fail to show that models need to interpret visuals.
read the original abstract
The remarkable progress of Multi-modal Large Language Models (MLLMs) has garnered unparalleled attention, due to their superior performance in visual contexts. However, their capabilities in visual math problem-solving remain insufficiently evaluated and understood. We investigate current benchmarks to incorporate excessive visual content within textual questions, which potentially assist MLLMs in deducing answers without truly interpreting the input diagrams. To this end, we introduce MathVerse, an all-around visual math benchmark designed for an equitable and in-depth evaluation of MLLMs. We meticulously collect 2,612 high-quality, multi-subject math problems with diagrams from publicly available sources. Each problem is then transformed by human annotators into six distinct versions, each offering varying degrees of information content in multi-modality, contributing to 15K test samples in total. This approach allows MathVerse to comprehensively assess whether and how much MLLMs can truly understand the visual diagrams for mathematical reasoning. In addition, we propose a Chain-of-Thought (CoT) evaluation strategy for a fine-grained assessment of the output answers. Rather than naively judging True or False, we employ GPT-4(V) to adaptively extract crucial reasoning steps, and then score each step with detailed error analysis, which can reveal the intermediate CoT reasoning quality by MLLMs. We hope the MathVerse benchmark may provide unique insights to guide the future development of MLLMs. Project page: https://mathverse-cuhk.github.io
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that existing visual math benchmarks embed excessive textual information in questions, enabling MLLMs to deduce answers without genuinely interpreting diagrams. To enable equitable evaluation, the authors collect 2,612 high-quality multi-subject math problems with diagrams from public sources and have human annotators transform each into six versions that vary the amount of multi-modal information (textual vs. visual), yielding 15K test samples total. They further propose a Chain-of-Thought evaluation that uses GPT-4(V) to extract key reasoning steps and score them with error analysis rather than binary correctness.
Significance. If the version transformations are shown to preserve mathematical semantics and difficulty, MathVerse would provide a valuable, fine-grained benchmark for isolating and measuring visual diagram understanding in MLLMs. The multi-version design and step-level GPT-4 scoring could become a standard protocol for diagnosing whether models truly integrate visual and textual cues in mathematical reasoning, directly informing future architecture and training improvements.
major comments (2)
- [§3.2] §3.2 (Human Annotation and Version Transformation): The manuscript describes the process of transforming each problem into six versions with differing textual/visual content but reports no quantitative validation—such as inter-annotator agreement scores, expert equivalence ratings, or solve-rate consistency on a held-out set—to confirm that core problem semantics, difficulty, and mathematical equivalence are preserved. Without these checks, performance gaps across versions could reflect annotation artifacts rather than genuine differences in visual understanding.
- [§4.3] §4.3 (CoT Evaluation with GPT-4(V)): The adaptive step-extraction and scoring procedure is presented as enabling fine-grained assessment, yet the paper provides insufficient detail on prompt templates, exact scoring rubrics, or any human-GPT agreement study. This weakens the claim that the method reliably reveals intermediate reasoning quality, as GPT-4(V) errors could systematically bias the reported insights.
minor comments (2)
- [§3.1] The selection criteria and filtering steps used to arrive at the final 2,612 problems from public sources are only briefly summarized; a table or paragraph detailing subject distribution, diagram complexity, and exclusion reasons would improve reproducibility.
- In the example figures illustrating the six versions, the visual differences between versions could be highlighted with explicit callouts or color coding to make the information-content gradient immediately clear to readers.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback, which highlights key areas for strengthening the validation and transparency of MathVerse. We address each major comment below and will incorporate the necessary additions in the revised manuscript.
read point-by-point responses
-
Referee: [§3.2] §3.2 (Human Annotation and Version Transformation): The manuscript describes the process of transforming each problem into six versions with differing textual/visual content but reports no quantitative validation—such as inter-annotator agreement scores, expert equivalence ratings, or solve-rate consistency on a held-out set—to confirm that core problem semantics, difficulty, and mathematical equivalence are preserved. Without these checks, performance gaps across versions could reflect annotation artifacts rather than genuine differences in visual understanding.
Authors: We agree that quantitative validation would strengthen confidence in the version transformations. The manuscript details the annotation guidelines and process used by human annotators to derive the six versions through controlled, incremental removal of textual or visual information while aiming to preserve mathematical semantics and difficulty. However, we did not report inter-annotator agreement or equivalence metrics. In the revision, we will add inter-annotator agreement scores on a multi-annotated subset, expert equivalence ratings, and solve-rate consistency analysis on a held-out set to empirically confirm preservation of core problem properties. revision: yes
-
Referee: [§4.3] §4.3 (CoT Evaluation with GPT-4(V)): The adaptive step-extraction and scoring procedure is presented as enabling fine-grained assessment, yet the paper provides insufficient detail on prompt templates, exact scoring rubrics, or any human-GPT agreement study. This weakens the claim that the method reliably reveals intermediate reasoning quality, as GPT-4(V) errors could systematically bias the reported insights.
Authors: We concur that greater detail on the evaluation procedure is warranted to support its reliability. The manuscript describes the adaptive use of GPT-4(V) for step extraction and error-annotated scoring as an alternative to binary correctness. In the revised version, we will include the full prompt templates, the precise scoring rubrics with error categories, and a human-GPT agreement study on a sampled set of responses to quantify alignment and address potential biases in the automated assessment. revision: yes
Circularity Check
No circularity: benchmark construction is externally grounded
full rationale
The paper collects 2,612 math problems from publicly available sources and applies explicit human annotation to produce six modality variants, yielding 15K samples. No equations, fitted parameters, or model predictions are defined; the evaluation strategy (CoT extraction via GPT-4(V)) operates on external test samples rather than quantities derived from the paper's own outputs. The central claim—that performance differences isolate visual understanding—rests on the annotation process itself, which is described as an independent construction step without self-referential reduction or load-bearing self-citations. The derivation chain is therefore self-contained against external benchmarks and sources.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Human annotators can produce six versions of each problem that differ only in the amount of visual versus textual information while preserving mathematical equivalence.
- domain assumption GPT-4(V) can accurately extract and score individual reasoning steps from MLLM outputs for error analysis.
Forward citations
Cited by 18 Pith papers
-
MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark
MMMU-Pro is a stricter multimodal benchmark that removes text-only solvable questions, augments options, and requires reading text from images, yielding substantially lower model scores of 16.8-26.9%.
-
Reflection Anchors for Propagation-Aware Visual Retention in Long-Chain Multimodal Reasoning
RAPO uses an information-theoretic lower bound on visual gain to select high-entropy reflection anchors and optimizes a chain-masked KL surrogate, delivering gains over baselines on reasoning benchmarks across LVLM backbones.
-
Structured Role-Aware Policy Optimization for Multimodal Reasoning
SRPO refines GRPO into role-aware token-level advantages by emphasizing perception tokens based on visual dependency (original vs. corrupted inputs) and reasoning tokens based on consistency with perception, unified v...
-
Why and When Visual Token Pruning Fails? A Study on Relevant Visual Information Shift in MLLMs Decoding
Visual token pruning in MLLMs fails on complex reasoning due to Relevant Visual Information Shift during decoding, but the DSTP framework fixes it training-free across models.
-
Addressing Overthinking in Large Vision-Language Models via Gated Perception-Reasoning Optimization
GPRO trains a meta-controller on 790k failure-labeled samples to dynamically select fast, perception, or reasoning paths in LVLMs, yielding higher accuracy and shorter responses than prior slow-thinking methods.
-
Omni-MATH: A Universal Olympiad Level Mathematic Benchmark For Large Language Models
Omni-MATH supplies 4428 human-verified Olympiad math problems that expose top LLMs achieving only 52.55% to 60.54% accuracy on the most difficult items.
-
LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models
LLaVA-NeXT-Interleave unifies multi-image, video, and 3D capabilities in large multimodal models via a new 1.18M-sample interleaved dataset and benchmark, achieving leading results across those tasks while preserving ...
-
We-Math: Does Your Large Multimodal Model Achieve Human-like Mathematical Reasoning?
WE-MATH benchmark reveals most LMMs rely on rote memorization for visual math while GPT-4o has shifted toward knowledge generalization.
-
MuirBench: A Comprehensive Benchmark for Robust Multi-image Understanding
MuirBench is a new benchmark showing that top multimodal LLMs struggle with robust multi-image understanding, with GPT-4o at 68% and open-source models below 33% accuracy.
-
LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention
LLaMA-Adapter turns frozen LLaMA 7B into a capable instruction follower using only 1.2M new parameters and zero-init attention, matching Alpaca while extending to image-conditioned reasoning on ScienceQA and COCO.
-
LatentRouter: Can We Choose the Right Multimodal Model Before Seeing Its Answer?
LatentRouter routes image-question queries to the best MLLM by predicting counterfactual performance via latent communication between learned query capsules and model capability tokens.
-
Co-Evolving Policy Distillation
CoPD integrates multiple expert capabilities by running parallel RLVR training with bidirectional online policy distillation among experts, outperforming mixed RLVR and sequential OPD while surpassing domain-specific ...
-
VLMs Need Words: Vision Language Models Ignore Visual Detail In Favor of Semantic Anchors
VLMs bypass visual comparison by recovering semantic labels for nameable entities and hallucinate on unnamable ones, as shown by performance gaps and Logit Lens analysis.
-
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency
InternVL3.5 advances open-source multimodal models with Cascade RL for +16% reasoning gains and ViR for 4x inference speedup, with the 241B model reaching SOTA among open-source MLLMs on multimodal, reasoning, and age...
-
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models
InternVL3-78B sets a new open-source SOTA of 72.2 on MMMU via native joint multimodal pre-training, V2PE, MPO, and test-time scaling while remaining competitive with proprietary models.
-
Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization
Mixed Preference Optimization with the MMPR dataset boosts multimodal CoT reasoning, lifting InternVL2-8B to 67.0 accuracy on MathVista (+8.7 points) and matching the 76B model.
-
Cognitive Pivot Points and Visual Anchoring: Unveiling and Rectifying Hallucinations in Multimodal Reasoning Models
Multimodal reasoning models hallucinate at high-entropy cognitive bifurcation points due to loss of visual semantic anchoring, and the V-STAR training paradigm with HVAR rewards and FRM reflection mitigates this by re...
-
LLaVA-OneVision: Easy Visual Task Transfer
LLaVA-OneVision is the first single open LMM to simultaneously achieve strong performance in single-image, multi-image, and video scenarios with cross-scenario transfer capabilities.
Reference graph
Works this paper leans on
-
[1]
Advances in Neural Information Processing Systems 35, 23716–23736 (2022)
Alayrac, J.B., Donahue, J., Luc, P., Miech, A., Barr, I., Hasson, Y ., Lenc, K., Mensch, A., Millican, K., Reynolds, M., et al.: Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems 35, 23716–23736 (2022)
work page 2022
-
[2]
MathQA: Towards Interpretable Math Word Problem Solving with Operation-Based Formalisms
Amini, A., Gabriel, S., Lin, P., Koncel-Kedziorski, R., Choi, Y ., Hajishirzi, H.: Mathqa: Towards interpretable math word problem solving with operation-based formalisms. arXiv preprint arXiv:1905.13319 (2019)
work page internal anchor Pith review Pith/arXiv arXiv 1905
-
[3]
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
Bai, J., Bai, S., Yang, S., Wang, S., Tan, S., Wang, P., Lin, J., Zhou, C., Zhou, J.: Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond. arXiv preprint arXiv:2308.12966 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[4]
In: Advances in neural information processing systems
Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. In: Advances in neural information processing systems. pp. 1877–1901 (2020) 14
work page 1901
-
[5]
In: Proceedings of the 29th International Conference on Computa- tional Linguistics
Cao, J., Xiao, J.: An augmented benchmark dataset for geometric question answering through dual parallel text encoding. In: Proceedings of the 29th International Conference on Computa- tional Linguistics. pp. 1511–1520 (2022)
work page 2022
-
[6]
arXiv preprint arXiv:2305.13292 (2023)
Chen, G., Zheng, Y .D., Wang, J., Xu, J., Huang, Y ., Pan, J., Wang, Y ., Wang, Y ., Qiao, Y ., Lu, T., et al.: Videollm: Modeling video sequence with large language models. arXiv preprint arXiv:2305.13292 (2023)
-
[8]
Chen, J., Li, T., Qin, J., Lu, P., Lin, L., Chen, C., Liang, X.: Unigeo: Unifying geometry logical reasoning via reformulating mathematical expression. ArXiv abs/2212.02746 (2022)
-
[10]
ArXiv abs/2105.14517 (2021), https://api.semanticscholar.org/CorpusID:235253782
Chen, J., Tang, J., Qin, J., Liang, X., Liu, L., Xing, E.P., Lin, L.: Geoqa: A geometric question answering benchmark towards multimodal numerical reasoning. ArXiv abs/2105.14517 (2021), https://api.semanticscholar.org/CorpusID:235253782
-
[11]
MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning
Chen, J., Li, D.Z.X.S.X., Zhang, Z.L.P., Xiong, R.K.V .C.Y ., Elhoseiny, M.: Minigpt-v2: Large language model as a unified interface for vision-language multi-task learning. arXiv preprint arXiv:2310.09478 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[12]
ShareGPT4V: Improving Large Multi-Modal Models with Better Captions
Chen, L., Li, J., wen Dong, X., Zhang, P., He, C., Wang, J., Zhao, F., Lin, D.: Sharegpt4v: Improving large multi-modal models with better captions. ArXiv abs/2311.12793 (2023), https://api.semanticscholar.org/CorpusID:265308687
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[13]
https://lmsys.org/blog/2023-03-30-vicuna/ (March 2023)
Chiang, W.L., Li, Z., Lin, Z., Sheng, Y ., Wu, Z., Zhang, H., Zheng, L., Zhuang, S., Zhuang, Y ., Gonzalez, J.E., Stoica, I., Xing, E.P.: Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. https://lmsys.org/blog/2023-03-30-vicuna/ (March 2023)
work page 2023
-
[14]
Training Verifiers to Solve Math Word Problems
Cobbe, K., Kosaraju, V ., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., et al.: Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168 (2021)
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[15]
Dai, W., Li, J., Li, D., Tiong, A.M.H., Zhao, J., Wang, W., Li, B., Fung, P., Hoi, S.: Instructblip: Towards general-purpose vision-language models with instruction tuning (2023)
work page 2023
-
[16]
Dong, X., Zhang, P., Zang, Y ., Cao, Y ., Wang, B., Ouyang, L., Wei, X., Zhang, S., Duan, H., Cao, M., et al.: Internlm-xcomposer2: Mastering free-form text-image composition and comprehension in vision-language large model. arXiv preprint arXiv:2401.16420 (2024)
work page internal anchor Pith review arXiv 2024
-
[17]
MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models
Fu, C., Chen, P., Shen, Y ., Qin, Y ., Zhang, M., Lin, X., Yang, J., Zheng, X., Li, K., Sun, X., Wu, Y ., Ji, R.: Mme: A comprehensive evaluation benchmark for multimodal large language models. arXiv preprint arXiv:2306.13394 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[18]
A challenger to gpt-4v? early explorations of gemini in visual expertise
Fu, C., Zhang, R., Lin, H., Wang, Z., Gao, T., Luo, Y ., Huang, Y ., Zhang, Z., Qiu, L., Ye, G., et al.: A challenger to gpt-4v? early explorations of gemini in visual expertise. arXiv preprint arXiv:2312.12436 (2023)
-
[19]
arXiv preprint arXiv:2312.11370 , year=
Gao, J., Pi, R., Zhang, J., Ye, J., Zhong, W., Wang, Y ., Hong, L., Han, J., Xu, H., Li, Z., et al.: G-llava: Solving geometric problem with multi-modal large language model. arXiv preprint arXiv:2312.11370 (2023)
-
[20]
LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model
Gao, P., Han, J., Zhang, R., Lin, Z., Geng, S., Zhou, A., Zhang, W., Lu, P., He, C., Yue, X., Li, H., Qiao, Y .: Llama-adapter v2: Parameter-efficient visual instruction model. arXiv preprint arXiv:2304.15010 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[21]
arXiv preprint arXiv:2402.05935 , year=
Gao, P., Zhang, R., Liu, C., Qiu, L., Huang, S., Lin, W., Zhao, S., Geng, S., Lin, Z., Jin, P., et al.: Sphinx-x: Scaling data and parameters for a family of multi-modal large language models. arXiv preprint arXiv:2402.05935 (2024) 15
-
[22]
Gemini: A Family of Highly Capable Multimodal Models
Gemini Team, G.: Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[23]
Guo, Z., Zhang, R., Zhu, X., Tang, Y ., Ma, X., Han, J., Chen, K., Gao, P., Li, X., Li, H., et al.: Point-bind & point-llm: Aligning point cloud with multi-modality for 3d understanding, generation, and instruction following. arXiv preprint arXiv:2309.00615 (2023)
-
[24]
Imagebind-llm: Multi-modality instruction tuning
Han, J., Zhang, R., Shao, W., Gao, P., Xu, P., Xiao, H., Zhang, K., Liu, C., Wen, S., Guo, Z., et al.: Imagebind-llm: Multi-modality instruction tuning. arXiv preprint arXiv:2309.03905 (2023)
-
[25]
Proceedings of the International Conference on Learning Representations (ICLR) (2021)
Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D., Steinhardt, J.: Measuring massive multitask language understanding. Proceedings of the International Conference on Learning Representations (ICLR) (2021)
work page 2021
-
[26]
Hendrycks, D., Burns, C., Kadavath, S., Arora, A., Basart, S., Tang, E., Song, D., Steinhardt, J.: Measuring mathematical problem solving with the math dataset. NeurIPS (2021)
work page 2021
-
[27]
Advances in Neural Information Processing Systems 36 (2024)
Hong, Y ., Zhen, H., Chen, P., Zheng, S., Du, Y ., Chen, Z., Gan, C.: 3d-llm: Injecting the 3d world into large language models. Advances in Neural Information Processing Systems 36 (2024)
work page 2024
-
[28]
Jiang, A.Q., Sablayrolles, A., Roux, A., Mensch, A., Savary, B., Bamford, C., Chaplot, D.S., de Las Casas, D., Hanna, E.B., Bressand, F., Lengyel, G., Bour, G., Lample, G., Lavaud, L.R., Saulnier, L., Lachaux, M., Stock, P., Subramanian, S., Yang, S., Antoniak, S., Scao, T.L., Gervet, T., Lavril, T., Wang, T., Lacroix, T., Sayed, W.E.: Mixtral of experts....
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[29]
Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.Y ., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[30]
https://llava- vl.github.io/blog/2024-05-10-llava-next-stronger-llms/ (2024)
Li, B., Zhang, K., Zhang, H., Guo, D., Zhang, R., Li, F., Zhang, Y ., Liu, Z., Li, C.: Llava-next: Stronger llms supercharge multimodal capabilities in the wild. https://llava- vl.github.io/blog/2024-05-10-llava-next-stronger-llms/ (2024)
work page 2024
-
[31]
Li, B., Zhang, Y ., Chen, L., Wang, J., Pu, F., Yang, J., Li, C., Liu, Z.: Mimic-it: Multi-modal in-context instruction tuning. arXiv preprint arXiv:2306.05425 (2023)
-
[32]
LLaVA-OneVision: Easy Visual Task Transfer
Li, B., Zhang, Y ., Guo, D., Zhang, R., Li, F., Zhang, H., Zhang, K., Li, Y ., Liu, Z., Li, C.: Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[33]
SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension
Li, B., Wang, R., Wang, G., Ge, Y ., Ge, Y ., Shan, Y .: Seed-bench: Benchmarking multimodal llms with generative comprehension. ArXiv abs/2307.16125 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[34]
LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models
Li, F., Zhang, R., Zhang, H., Zhang, Y ., Li, B., Li, W., Ma, Z., Li, C.: Llava-next- interleave: Tackling multi-image, video, and 3d in large multimodal models. arXiv preprint arXiv:2407.07895 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[35]
In: International Conference on Machine Learning
Li, J., Li, D., Xiong, C., Hoi, S.: Blip: Bootstrapping language-image pre-training for uni- fied vision-language understanding and generation. In: International Conference on Machine Learning. pp. 12888–12900. PMLR (2022)
work page 2022
-
[36]
Lin, Z., Liu, C., Zhang, R., Gao, P., Qiu, L., Xiao, H., Qiu, H., Lin, C., Shao, W., Chen, K., et al.: Sphinx: The joint mixing of weights, tasks, and visual embeddings for multi-modal large language models. arXiv preprint arXiv:2311.07575 (2023)
work page internal anchor Pith review arXiv 2023
-
[37]
Liu, H., Li, C., Li, Y ., Lee, Y .J.: Improved baselines with visual instruction tuning (2023)
work page 2023
-
[38]
Liu, H., Li, C., Li, Y ., Li, B., Zhang, Y ., Shen, S., Lee, Y .J.: Llava-next: Improved rea- soning, ocr, and world knowledge (January 2024), https://llava-vl.github.io/blog/ 2024-01-30-llava-next/
work page 2024
-
[39]
Liu, H., Li, C., Wu, Q., Lee, Y .J.: Visual instruction tuning. In: NeurIPS (2023) 16
work page 2023
-
[40]
Liu, Y ., Duan, H., Zhang, Y ., Li, B., Zhang, S., Zhao, W., Yuan, Y ., Wang, J., He, C., Liu, Z., et al.: Mmbench: Is your multi-modal model an all-around player? arXiv preprint arXiv:2307.06281 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[41]
MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts
Lu, P., Bansal, H., Xia, T., Liu, J., yue Li, C., Hajishirzi, H., Cheng, H., Chang, K.W., Galley, M., Gao, J.: Mathvista: Evaluating math reasoning in visual contexts with gpt-4v, bard, and other large multimodal models. ArXiv abs/2310.02255 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[42]
Lu, P., Gong, R., Jiang, S., Qiu, L., Huang, S., Liang, X., Zhu, S.C.: Inter-gps: Interpretable geometry problem solving with formal language and symbolic reasoning. arXiv preprint arXiv:2105.04165 (2021)
-
[43]
Lu, P., Gong, R., Jiang, S., Qiu, L., Huang, S., Liang, X., Zhu, S.C.: Inter-gps: Interpretable geometry problem solving with formal language and symbolic reasoning. In: Annual Meeting of the Association for Computational Linguistics (2021), https://api.semanticscholar. org/CorpusID:234337054
work page 2021
-
[44]
WizardMath: Empowering Mathematical Reasoning for Large Language Models via Reinforced Evol-Instruct
Luo, H., Sun, Q., Xu, C., Zhao, P., Lou, J., Tao, C., Geng, X., Lin, Q., Chen, S., Zhang, D.: Wizardmath: Empowering mathematical reasoning for large language models via reinforced evol-instruct. arXiv preprint arXiv:2308.09583 (2023)
work page internal anchor Pith review arXiv 2023
- [45]
-
[46]
OpenAI: Gpt-4 technical report. ArXiv abs/2303.08774 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[47]
OpenAI: GPT-4V(ision) system card (2023), https://openai.com/research/ gpt-4v-system-card
work page 2023
-
[48]
In: Advances in Neural Information Processing Systems (2022)
Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Gray, A., Schulman, J., Hilton, J., Kelton, F., Miller, L., Simens, M., Askell, A., Welinder, P., Christiano, P., Leike, J., Lowe, R.: Training language models to follow instructions with human feedback. In: Advances in Neural Information Processing...
work page 2022
-
[49]
Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., Sutskever, I.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning (2021), https://api.semanticscholar.org/CorpusID:231591445
work page 2021
-
[50]
Solving General Arithmetic Word Problems
Roy, S., Roth, D.: Solving general arithmetic word problems. ArXiv abs/1608.01413 (2016), https://api.semanticscholar.org/CorpusID:560565
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[51]
In: Proceedings of the 2015 conference on empirical methods in natural language processing
Seo, M., Hajishirzi, H., Farhadi, A., Etzioni, O., Malcolm, C.: Solving geometry problems: Combining text and diagram interpretation. In: Proceedings of the 2015 conference on empirical methods in natural language processing. pp. 1466–1476 (2015)
work page 2015
-
[52]
PandaGPT: One Model To Instruction-Follow Them All
Su, Y ., Lan, T., Li, H., Xu, J., Wang, Y ., Cai, D.: Pandagpt: One model to instruction-follow them all. arXiv preprint arXiv:2305.16355 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[53]
Advances in Neural Information Processing Systems 36 (2024)
Sun, K., Pan, J., Ge, Y ., Li, H., Duan, H., Wu, X., Zhang, R., Zhou, A., Qin, Z., Wang, Y ., et al.: Journeydb: A benchmark for generative image understanding. Advances in Neural Information Processing Systems 36 (2024)
work page 2024
-
[54]
Team, I.: Internlm: A multilingual language model with progressively enhanced capabilities (2023)
work page 2023
-
[55]
LLaMA: Open and Efficient Foundation Language Models
Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[56]
Llama 2: Open Foundation and Fine-Tuned Chat Models
Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y ., Bashlykov, N., Batra, S., Bhargava, P., Bhosale, S., et al.: Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 (2023) 17
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[57]
In: The Twelfth International Conference on Learning Representations (2024), https://openreview
Wang, K., Ren, H., Zhou, A., Lu, Z., Luo, S., Shi, W., Zhang, R., Song, L., Zhan, M., Li, H.: Mathcoder: Seamless code integration in LLMs for enhanced mathematical reasoning. In: The Twelfth International Conference on Learning Representations (2024), https://openreview. net/forum?id=z8TW0ttBPp
work page 2024
-
[58]
Advances in Neural Information Processing Systems 35, 24824–24837 (2022)
Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F., Chi, E., Le, Q.V ., Zhou, D., et al.: Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems 35, 24824–24837 (2022)
work page 2022
-
[59]
arXiv preprint arXiv:2306.09265 (2023)
Xu, P., Shao, W., Zhang, K., Gao, P., Liu, S., Lei, M., Meng, F., Huang, S., Qiao, Y ., Luo, P.: Lvlm-ehub: A comprehensive evaluation benchmark for large vision-language models. arXiv preprint arXiv:2306.09265 (2023)
-
[60]
Pointllm: Empowering large language models to understand point clouds
Xu, R., Wang, X., Wang, T., Chen, Y ., Pang, J., Lin, D.: Pointllm: Empowering large language models to understand point clouds. arXiv preprint arXiv:2308.16911 (2023)
-
[61]
Ye, Q., Xu, H., Xu, G., Ye, J., Yan, M., Zhou, Y ., Wang, J., Hu, A., Shi, P., Shi, Y ., Jiang, C., Li, C., Xu, Y ., Chen, H., Tian, J., Qian, Q., Zhang, J., Huang, F.: mplug-owl: Modularization empowers large language models with multimodality (2023)
work page 2023
-
[62]
Ye, Q., Xu, H., Ye, J., Yan, M., Hu, A., Liu, H., Qian, Q., Zhang, J., Huang, F., Zhou, J.: mplug-owl2: Revolutionizing multi-modal large language model with modality collaboration (2023)
work page 2023
-
[63]
MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI
Yue, X., Ni, Y ., Zhang, K., Zheng, T., Liu, R., Zhang, G., Stevens, S., Jiang, D., Ren, W., Sun, Y ., Wei, C., Yu, B., Yuan, R., Sun, R., Yin, M., Zheng, B., Yang, Z., Liu, Y ., Huang, W., Sun, H., Su, Y ., Chen, W.: Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. arXiv preprint arXiv:2311.16502 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[64]
arXiv preprint arXiv:2309.05653 (2023)
Yue, X., Qu, X., Zhang, G., Fu, Y ., Huang, W., Sun, H., Su, Y ., Chen, W.: Mammoth: Building math generalist models through hybrid instruction tuning. arXiv preprint arXiv:2309.05653 (2023)
-
[65]
Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding
Zhang, H., Li, X., Bing, L.: Video-llama: An instruction-tuned audio-visual language model for video understanding. arXiv preprint arXiv:2306.02858 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[66]
Zhang, R., Han, J., Zhou, A., Hu, X., Yan, S., Lu, P., Li, H., Gao, P., Qiao, Y .: LLaMA-adapter: Efficient fine-tuning of large language models with zero-initialized attention. In: The Twelfth International Conference on Learning Representations (2024), https://openreview.net/ forum?id=d4UiXAHN2W
work page 2024
-
[67]
Zhang, R., Hu, X., Li, B., Huang, S., Deng, H., Li, H., Qiao, Y ., Gao, P.: Prompt, generate, then cache: Cascade of foundation models makes strong few-shot learners. CVPR 2023 (2023)
work page 2023
-
[68]
Zhang, R., Jiang, Z., Guo, Z., Yan, S., Pan, J., Dong, H., Gao, P., Li, H.: Personalize segment anything model with one shot. ICLR 2024 (2023)
work page 2024
-
[69]
Zhang, R., Wang, L., Qiao, Y ., Gao, P., Li, H.: Learning 3d representations from 2d pre-trained models via image-to-point masked autoencoders. CVPR 2023 (2023)
work page 2023
-
[70]
arXiv preprint arXiv:2407.08739 (2024)
Zhang, R., Wei, X., Jiang, D., Zhang, Y ., Guo, Z., Tong, C., Liu, J., Zhou, A., Wei, B., Zhang, S., et al.: Mavis: Mathematical visual instruction tuning. arXiv preprint arXiv:2407.08739 (2024)
-
[71]
Zhou, A., Wang, K., Lu, Z., Shi, W., Luo, S., Qin, Z., Lu, S., Jia, A., Song, L., Zhan, M., et al.: Solving challenging math word problems using gpt-4 code interpreter with code-based self-verification. arXiv preprint arXiv:2308.07921 (2023)
-
[72]
MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models
Zhu, D., Chen, J., Shen, X., Li, X., Elhoseiny, M.: Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592 (2023) 18 Appendix Overview • Section A: Related work. • Section B: Additional experimental details. • Section C: More dataset details. • Section D: Comparison to current benchmarks. • ...
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[73]
Calculatethesquareofthelengthsofthelegs: y^2=11^2+11^2,y^2=121+121,y^2=242.3
UsethePythagoreantheoremtofindthelengthofthediagonal.Thetheoremstatesthatthesquareofthelengthofthehypotenuse(thediagonalinthiscase)isequaltothesumofthesquaresofthelengthsofthelegs.Inthiscase,thelegsarethesidesofthesquareprism,whichareofequallength.2. Calculatethesquareofthelengthsofthelegs: y^2=11^2+11^2,y^2=121+121,y^2=242.3. Takethesquarerootofbothsides...
-
[74]
Inthiscase,thebaselengthis10andtheheightis13.3
TheareaofanisoscelestriangleisgivenbytheformulaA=(1/2)*b*h,wherebisthebaselengthandhistheheight.2. Inthiscase,thebaselengthis10andtheheightis13.3. Pluggingthesevaluesintotheformula,wegetA=(1/2)*10*13=65.4. Therefore,theareaofthetriangleis65.ThecorrectoptionisB:60. SPHINX-MoE: ×0/1×0/1 ×0/1×0/1×0/1 ✓1/1✓1/1 ✓1/1✓1/1 ✓1/1 ✓1/1 ✓1/1 ✓1/1 Multi-step Score: 0....
-
[75]
Thesingleansweristhatthevalueofhis30.SPHINX-MoE: ×0/1 ×0/1 ×0/1 ✓1/1✓1/1✓1/1✓1/1 ✓1/1 ✓1/1 ✓1/1 ×0/1×0/1 Multi-step Score: 0Final CoTEvaluation Score: 0 Multi-step Score: 0.3Final CoTEvaluation Score: 0.2 Multi-step Score: 1Final CoTEvaluation Score: 1 Visual Perception Error Reasoning Error Figure 26: Response Comparison of GPT-4V [47], LLaV A-NeXT [38],...
-
[76]
Analyze the graph and identify the degree of the terms in the polynomial.2. Identify that the polynomial has a degree of 2 for the terms with the highest degree, which are the squares of the linear terms.3. Identify the highest degree term that is not squared is the term with the variable raised to the second power, which is (x+1)2.4. Identify the next hi...
-
[77]
The graph has a vertical tangent at x=-5/2.4.The graph of p also passes through the point (-1,0).5
The graph of p has a vertical tangent at x=-1.2.The graph has a horizontal tangent at x=3.3. The graph has a vertical tangent at x=-5/2.4.The graph of p also passes through the point (-1,0).5. The equation of p is p(x)=(x+1)2(2 x+5)(x-3)2.6. The correct option is A. SPHINX-MoE:×0/1 Multi-step Score: 0.17Final CoTEvaluation Score: 0.12 Multi-step Score: 0....
-
[78]
The function starts off decreasing from the point (-5, something above 6) to the point (-4, 1).3
Determine the intervals on which the function is increasing or decreasing by looking at the slope of the curve on the graph.2. The function starts off decreasing from the point (-5, something above 6) to the point (-4, 1).3. The function then starts increasing at the point (-4, 1) and continues to increase until it reaches a peak somewhere between -2 and ...
-
[79]
Identify the points where the function reaches its peaks and its lowest points.3
Determine the slope of the function as it moves from left to right along the x-axis.2. Identify the points where the function reaches its peaks and its lowest points.3. The function is increasing from the leftmost part of the graph until it reaches its first peak, which is approximately at x = -3.4. The function starts decreasing after this peak until it ...
-
[80]
The graph shows a function D(t) that represents the horizontal distance between Sia and the wall over time. 2. The highlighted segment with a red double arrow spans vertically from the point (3.75, 3) to the point (3.75, 5) on the graph. 3. This segment is at a constant time of 3.75 seconds and shows a change in the distance D(t) from 3 meters to 5 meters...
-
[81]
Analyze the graph which shows a periodic function, representing the back-and-forth motion of Sia swinging from the chandelier. The horizontal axis is labeled with time t in seconds, and the vertical axis is labeled with distance D in meters. The highlighted segment on the graph indicates a change in the distance from one point to another over time.2. Eval...
-
[82]
Identify that the triangles ABD and CEB are similar because they both have a right angle, and they share the angle at point B.2. Use the property of similar triangles that the ratios of their corresponding sides are equal:AB/BD = BC/CE.3. Substitute the given values into the ratio:2.0 / (7.0/3.0) = BC / 9.0.4. Cross-multiply to solve for BC:2.0 * 9.0 = BC...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.