arxiv: 2503.10615 · v2 · submitted 2025-03-13 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

R1-Onevision: Advancing Generalized Multimodal Reasoning through Cross-Modal Formalization

Yi Yang , Xiaoxuan He , Hongkun Pan , Xiyan Jiang , Yan Deng , Xingtao Yang , Haoyu Lu , Dacheng Yin

show 4 more authors

Fengyun Rao Minfeng Zhu Bo Zhang Wei Chen

Authors on Pith no claims yet

Pith reviewed 2026-05-16 00:14 UTC · model grok-4.3

classification 💻 cs.CV

keywords multimodal reasoningcross-modal formalizationvision-language modelsreinforcement learningreasoning benchmarkssupervised fine-tuningR1-Onevision

0 comments

The pith

Converting images to formal textual representations lets a new model reason more precisely about visual content and outperform GPT-4o on multimodal benchmarks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces R1-Onevision, which uses a pipeline to turn visual inputs into structured text forms that support exact step-by-step language reasoning. This addresses the common failure of vision-language models to integrate images and text reliably for hard problems. The authors build a dataset of detailed multimodal reasoning traces, train the model first with supervised fine-tuning and then reinforcement learning, and test it on a new benchmark spanning junior high school through university exams. If the approach holds, it shows that formal text conversion can close the gap between perception and deep reasoning without discarding essential visual details.

Core claim

R1-Onevision achieves state-of-the-art results by applying a cross-modal reasoning pipeline that converts images into formal textual representations, enabling precise language-based reasoning; the same pipeline supports construction of a large annotated dataset and training via supervised fine-tuning followed by reinforcement learning, yielding superior performance over GPT-4o and Qwen2.5-VL across multiple challenging multimodal benchmarks including the new R1-Onevision-Bench aligned with educational stages.

What carries the argument

The cross-modal reasoning pipeline that transforms images into formal textual representations for subsequent language-based reasoning.

If this is right

The model generalizes across domains from junior high school to university-level exam questions.
Step-by-step textual reasoning traces produced by the pipeline improve both accuracy and interpretability compared with direct vision-language baselines.
Reinforcement learning applied after supervised fine-tuning further strengthens robustness on out-of-distribution multimodal problems.
The new R1-Onevision-Bench provides a graded test suite that measures reasoning capability by educational stage.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same image-to-formal-text conversion could be applied to video sequences or 3D scenes to support temporal or spatial reasoning.
If the formal representations prove lossless for most tasks, they could serve as a common intermediate language linking multiple input modalities.
Educational tools might use the generated reasoning traces to produce transparent explanations for students at different grade levels.

Load-bearing premise

Converting an image into a formal textual representation preserves all critical visual information needed for accurate reasoning.

What would settle it

A direct comparison in which the same base model is run once with the formal text pipeline and once with raw image input on tasks that require fine-grained visual details, such as exact spatial counting or subtle pattern recognition, showing no accuracy gain or a loss for the pipeline version.

read the original abstract

Large Language Models have demonstrated remarkable reasoning capability in complex textual tasks. However, multimodal reasoning, which requires integrating visual and textual information, remains a significant challenge. Existing visual-language models often struggle to effectively analyze and reason visual content, resulting in suboptimal performance on complex reasoning tasks. Moreover, the absence of comprehensive benchmarks hinders the accurate assessment of multimodal reasoning capabilities. In this paper, we introduce R1-Onevision, a multimodal reasoning model designed to bridge the gap between visual perception and deep reasoning. To achieve this, we propose a cross-modal reasoning pipeline that transforms images into formal textural representations, enabling precise language-based reasoning. Leveraging this pipeline, we construct the R1-Onevision dataset which provides detailed, step-by-step multimodal reasoning annotations across diverse domains. We further develop the R1-Onevision model through supervised fine-tuning and reinforcement learning to cultivate advanced reasoning and robust generalization abilities. To comprehensively evaluate multimodal reasoning performance across different grades, we introduce R1-Onevision-Bench, a benchmark aligned with human educational stages, covering exams from junior high school to university and beyond. Experimental results show that R1-Onevision achieves state-of-the-art performance, outperforming models such as GPT-4o and Qwen2.5-VL on multiple challenging multimodal reasoning benchmarks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

R1-Onevision offers a formalization pipeline and educational benchmark for multimodal reasoning, but experimental validation is thin.

read the letter

The main point with R1-Onevision is that it formalizes visual input into textual representations before doing the reasoning step. This is meant to let the model use precise language reasoning on top of images. They built a dataset with detailed step-by-step annotations and trained the model using supervised fine-tuning followed by reinforcement learning. They also created R1-Onevision-Bench, which tests across educational levels from junior high to university. The new elements are the specific pipeline for cross-modal formalization and the grade-aligned benchmark. The benchmark is a solid addition because it tries to measure reasoning at different stages of complexity, which could help in applications like tutoring systems. The approach of turning images into text is a reasonable way to leverage strong language models for multimodal tasks. The soft spots are around the lack of detailed results. Claims of outperforming GPT-4o and Qwen2.5-VL on multiple benchmarks are stated but without tables, ablations, or error analysis it's difficult to see how much the formalization helps or if it introduces errors by losing visual information. There's also the usual concern with new benchmarks about how they were constructed and whether the comparisons are apples-to-apples. This work is for people in multimodal AI and educational technology. A reader interested in practical improvements to reasoning pipelines would get value from the dataset and benchmark even if the model itself is an incremental step. I would recommend sending it for peer review. The contributions in resources are concrete enough to warrant referee input on the experimental setup.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces R1-Onevision, a multimodal reasoning model that uses a cross-modal pipeline to transform images into formal textual representations, enabling precise language-based reasoning. It constructs the R1-Onevision dataset with detailed step-by-step multimodal annotations across domains, trains the model via supervised fine-tuning followed by reinforcement learning, and introduces R1-Onevision-Bench, a new benchmark aligned with human educational stages from junior high school through university level. The central claim is that this yields state-of-the-art performance, outperforming GPT-4o and Qwen2.5-VL on multiple challenging multimodal reasoning benchmarks.

Significance. If the performance claims and pipeline validity are substantiated with rigorous experiments, the cross-modal formalization approach could meaningfully advance multimodal reasoning by converting visual input into structured text that supports reliable step-by-step inference and better generalization. The education-stage benchmark is a constructive addition for evaluating reasoning progression. No machine-checked proofs, reproducible code releases, or parameter-free derivations are described, so these strengths are not present to credit.

major comments (2)

[Abstract] Abstract: the assertion that 'Experimental results show that R1-Onevision achieves state-of-the-art performance, outperforming models such as GPT-4o and Qwen2.5-VL' is unsupported because the manuscript contains no quantitative tables, ablation studies, error analysis, or matched-condition comparisons; this directly undermines the central performance claim.
[Method] Cross-modal reasoning pipeline description: no implementation details, pseudocode, or validation experiments are provided for how images are transformed into formal textual representations or for confirming that critical visual information is preserved without loss; this is load-bearing for the claim that the pipeline enables precise reasoning.

minor comments (2)

[Dataset] The description of the R1-Onevision dataset would benefit from explicit statistics on domain coverage, annotation length, and example instances to allow reproducibility assessment.
[Method] Notation for the formal textual representation step is introduced without a clear diagram or formal definition, which could be clarified for readers.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript accordingly to strengthen the presentation of results and methodological details.

read point-by-point responses

Referee: [Abstract] Abstract: the assertion that 'Experimental results show that R1-Onevision achieves state-of-the-art performance, outperforming models such as GPT-4o and Qwen2.5-VL' is unsupported because the manuscript contains no quantitative tables, ablation studies, error analysis, or matched-condition comparisons; this directly undermines the central performance claim.

Authors: We acknowledge the referee's concern. While the manuscript includes an experiments section with performance comparisons, we agree that the current version lacks sufficient quantitative tables, ablation studies, error analysis, and explicit matched-condition comparisons to fully substantiate the SOTA claim in the abstract. In the revised manuscript, we will add detailed tables reporting exact metrics on R1-Onevision-Bench and additional multimodal reasoning benchmarks, include ablation studies isolating the cross-modal formalization and RL components, and provide error analysis with direct side-by-side comparisons to GPT-4o and Qwen2.5-VL. revision: yes
Referee: [Method] Cross-modal reasoning pipeline description: no implementation details, pseudocode, or validation experiments are provided for how images are transformed into formal textual representations or for confirming that critical visual information is preserved without loss; this is load-bearing for the claim that the pipeline enables precise reasoning.

Authors: We agree that additional details are required for reproducibility and to validate the pipeline's effectiveness. In the revised version, we will expand the method section with concrete implementation details on the image-to-formal-text transformation (including the structured representation format and extraction rules), provide pseudocode for the full cross-modal pipeline, and add validation experiments such as quantitative information-preservation metrics and human evaluations confirming that critical visual elements are retained without loss. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected in derivation chain

full rationale

The paper's core argument consists of proposing a cross-modal pipeline to convert images to formal text, using that pipeline to annotate a new dataset, training via SFT+RL, and evaluating on a newly introduced educational-stage benchmark. These steps are constructive and empirical; the SOTA performance claims rest on experimental comparisons rather than any equation or claim that reduces by construction to fitted inputs, self-citations, or renamed prior results. No load-bearing derivation equates a prediction to its own training signal or invokes an unverified uniqueness theorem from the same authors. The work is self-contained against external benchmarks and does not exhibit any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies no explicit free parameters, axioms, or invented entities; full technical details are unavailable.

pith-pipeline@v0.9.0 · 5561 in / 1044 out tokens · 57663 ms · 2026-05-16T00:14:57.086487+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith.Cost.FunctionalEquation washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Experimental results show that R1-Onevision achieves state-of-the-art performance, outperforming models such as GPT-4o and Qwen2.5-VL

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 20 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Perception Without Engagement: Dissecting the Causal Discovery Deficit in LMMs
cs.CL 2026-05 unverdicted novelty 7.0

LMMs perceive videos but underexploit visual content for causal reasoning due to textual shortcuts; ProCauEval diagnoses this and ADPO training reduces reliance on priors.
SeePhys Pro: Diagnosing Modality Transfer and Blind-Training Effects in Multimodal RLVR for Physics Reasoning
cs.AI 2026-05 unverdicted novelty 7.0

SeePhys Pro benchmark reveals multimodal models degrade on physics reasoning as information transfers from text to images, with blind training improvements often stemming from textual cues rather than visual evidence.
SeePhys Pro: Diagnosing Modality Transfer and Blind-Training Effects in Multimodal RLVR for Physics Reasoning
cs.AI 2026-05 unverdicted novelty 7.0

Multimodal AI models for physics reasoning lose performance when information shifts from text to images, and RLVR training gains often come from non-visual textual or distributional cues rather than actual visual evidence.
ProjLens: Unveiling the Role of Projectors in Multimodal Model Safety
cs.CR 2026-04 unverdicted novelty 7.0

ProjLens shows that backdoor parameters in MLLMs are encoded in low-rank subspaces of the projector and that embeddings shift toward the target direction with magnitude linear in input norm, activating only on poisone...
Understanding the Role of Hallucination in Reinforcement Post-Training of Multimodal Reasoning Models
cs.LG 2026-04 unverdicted novelty 7.0

RL post-training on hallucination-forced multimodal data improves reasoning performance and can outperform standard training.
Make Each Token Count: Towards Improving Long-Context Performance with KV Cache Eviction
cs.LG 2026-05 unverdicted novelty 6.0

A unified learnable KV eviction policy with cross-layer calibration reduces memory and matches or exceeds full-cache performance on long-context tasks by retaining useful tokens and limiting attention dilution.
Reinforcing Multimodal Reasoning Against Visual Degradation
cs.CV 2026-05 unverdicted novelty 6.0

ROMA improves MLLM robustness to seen and unseen visual corruptions by +2.3-2.4% over GRPO on seven reasoning benchmarks while matching clean accuracy.
CharTide: Data-Centric Chart-to-Code Generation via Tri-Perspective Tuning and Inquiry-Driven Evolution
cs.CV 2026-04 unverdicted novelty 6.0

CharTide decouples chart-to-code data into three perspectives and uses inquiry-driven RL with atomic QA verification to let smaller VLMs surpass GPT-4o on chart-to-code tasks.
One Step Forward and K Steps Back: Better Reasoning with Denoising Recursion Models
cs.LG 2026-04 unverdicted novelty 6.0

Denoising Recursion Models train multi-step noise reversal in looped transformers and outperform the prior Tiny Recursion Model on ARC-AGI.
Saliency-R1: Enforcing Interpretable and Faithful Vision-language Reasoning via Saliency-map Alignment Reward
cs.CV 2026-04 unverdicted novelty 6.0

Saliency-R1 uses a novel saliency map technique and GRPO with human bounding-box overlap as reward to improve VLM reasoning faithfulness and interpretability.
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency
cs.CV 2025-08 unverdicted novelty 6.0

InternVL3.5 advances open-source multimodal models with Cascade RL for +16% reasoning gains and ViR for 4x inference speedup, with the 241B model reaching SOTA among open-source MLLMs on multimodal, reasoning, and age...
VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model
cs.CV 2025-04 unverdicted novelty 6.0

VLM-R1 applies R1-style RL using rule-based rewards on visual tasks with clear ground truth to achieve competitive performance and superior generalization over SFT in vision-language models.
Measure Twice, Click Once: Co-evolving Proposer and Visual Critic via Reinforcement Learning for GUI Grounding
cs.LG 2026-04 unverdicted novelty 5.0

A co-evolving proposer-critic RL framework improves GUI grounding accuracy by letting the model critique its own proposals rendered on screenshots.
SignReasoner: Compositional Reasoning for Complex Traffic Sign Understanding via Functional Structure Units
cs.CV 2026-04 unverdicted novelty 5.0

SignReasoner decomposes traffic signs into functional structure units and uses a two-stage VLM post-training pipeline to achieve state-of-the-art compositional reasoning on a new benchmark.
Cognitive Pivot Points and Visual Anchoring: Unveiling and Rectifying Hallucinations in Multimodal Reasoning Models
cs.AI 2026-04 unverdicted novelty 5.0

Multimodal reasoning models hallucinate at high-entropy cognitive bifurcation points due to loss of visual semantic anchoring, and the V-STAR training paradigm with HVAR rewards and FRM reflection mitigates this by re...
MedLVR: Latent Visual Reasoning for Reliable Medical Visual Question Answering
cs.CV 2026-04 unverdicted novelty 5.0

MedLVR interleaves latent visual reasoning segments in autoregressive decoding and uses two-stage training to raise average medical VQA accuracy from 48.3% to 53.4% over a Qwen2.5-VL-7B backbone on OmniMedVQA and five...
Can Textual Reasoning Improve the Performance of MLLMs on Fine-grained Visual Classification?
cs.CV 2026-01 unverdicted novelty 5.0

Longer textual reasoning chains degrade MLLM accuracy on fine-grained visual tasks; a new normalization and constrained-reward training framework mitigates the effect and sets new SOTA numbers.
VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement Fine-Tuning
cs.CV 2025-04 unverdicted novelty 5.0

Reinforcement fine-tuning with temporal rewards produces VideoChat-R1, a video MLLM showing large gains on spatio-temporal perception benchmarks such as +31.8 temporal grounding and +31.2 object tracking.
Enhancing Visual Question Answering with Multimodal LLMs via Chain-of-Question Guided Retrieval-Augmented Generation
cs.CV 2026-05 unverdicted novelty 4.0

A new CoVQD-guided retrieval-augmented generation framework improves multimodal LLMs on visual question answering by using structured reasoning to retrieve better external knowledge.
From System 1 to System 2: A Survey of Reasoning Large Language Models
cs.AI 2025-02 accept novelty 3.0

The survey organizes the shift of LLMs toward deliberate System 2 reasoning, covering model construction techniques, performance on math and coding benchmarks, and future research directions.

Reference graph

Works this paper leans on

52 extracted references · 52 canonical work pages · cited by 19 Pith papers · 11 internal anchors

[1]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ah- mad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774 ,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Large language models for mathematical reasoning: Progresses and challenges

Janice Ahn, Rishu Verma, Renze Lou, Di Liu, Rui Zhang, and Wenpeng Yin. Large language models for mathemat- ical reasoning: Progresses and challenges. arXiv preprint arXiv:2402.00157, 2024. 2

work page arXiv 2024
[3]

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report. arXiv preprint arXiv:2502.13923, 2025. 2, 7

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

Evaluating Large Language Models Trained on Code

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Hen- rique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evalu- ating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021. 1

work page internal anchor Pith review Pith/arXiv arXiv 2021
[5]

Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhang- wei Gao, Erfei Cui, Jinguo Zhu, Shenglong Ye, Hao Tian, Zhaoyang Liu, et al. Expanding performance boundaries of open-source multimodal models with model, data, and test- time scaling. arXiv preprint arXiv:2412.05271, 2024. 2, 7

work page internal anchor Pith review Pith/arXiv arXiv 2024
[6]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Train- ing verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021. 1

work page internal anchor Pith review Pith/arXiv arXiv 2021
[7]

Opencompass: A universal evaluation platform for foundation models, 2023

OpenCompass Contributors. Opencompass: A universal evaluation platform for foundation models, 2023. 2, 6

work page 2023
[8]

Cruxeval: A benchmark for code reasoning, understanding and execution

Alex Gu, Baptiste Rozi `ere, Hugh Leather, Armando Solar- Lezama, Gabriel Synnaeve, and Sida I Wang. Cruxeval: A benchmark for code reasoning, understanding and execution. In International Conference on Machine Learning, 2024. 1

work page 2024
[9]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025. 1, 3

work page internal anchor Pith review Pith/arXiv arXiv 2025
[10]

Mammoth-vl: Eliciting multimodal reasoning with instruction tuning at scale

Jarvis Guo, Tuney Zheng, Yuelin Bai, Bo Li, Yubo Wang, King Zhu, Yizhi Li, Graham Neubig, Wenhu Chen, and Xi- ang Yue. Mammoth-vl: Eliciting multimodal reasoning with instruction tuning at scale. arXiv preprint arXiv:2412.05237,

work page arXiv
[11]

Measuring mathematical problem solving with the MATH dataset

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the MATH dataset. In Proceedings of the Neural Informa- tion Processing Systems Track on Datasets and Benchmarks,

work page
[12]

Towards reasoning in large language models: A survey.arXiv preprint arXiv:2212.10403, 2022

Jie Huang and Kevin Chen-Chuan Chang. Towards reason- ing in large language models: A survey. arXiv preprint arXiv:2212.10403, 2022. 2

work page arXiv 2022
[13]

Gqa: A new dataset for real-world visual reasoning and compositional question answering

Drew A Hudson and Christopher D Manning. Gqa: A new dataset for real-world visual reasoning and compositional question answering. In Proceedings of the IEEE/CVF con- ference on computer vision and pattern recognition , pages 6700–6709, 2019. 3

work page 2019
[14]

Lawrence Zitnick, and Ross Girshick

Justin Johnson, Bharath Hariharan, Laurens van der Maaten, Li Fei-Fei, C. Lawrence Zitnick, and Ross Girshick. Clevr: A diagnostic dataset for compositional language and elemen- tary visual reasoning, 2016. 6

work page 2016
[15]

Fig- ureqa: An annotated figure dataset for visual reasoning

Samira Ebrahimi Kahou, Vincent Michalski, Adam Atkin- son, ´Akos K´ad´ar, Adam Trischler, and Yoshua Bengio. Fig- ureqa: An annotated figure dataset for visual reasoning. In International Conference on Learning Representations Workshop Track, 2018. 3

work page 2018
[16]

Large language models are zero-shot reasoners

Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners. Advances in neural information pro- cessing systems, 35:22199–22213, 2022. 2

work page 2022
[17]

LLaV A-onevision: Easy visual task transfer

Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, and Chunyuan Li. LLaV A-onevision: Easy visual task transfer. Transactions on Machine Learning Research,

work page
[18]

Mathvista: Evaluating mathe- matical reasoning of foundation models in visual contexts

Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathe- matical reasoning of foundation models in visual contexts. In The Twelfth International Conference on Learning Repre- sentations, 2024. 2, 6, 12

work page 2024
[19]

Llama 3.2: Revolutionizing edge ai and vision with open, customizable models

AI Meta. Llama 3.2: Revolutionizing edge ai and vision with open, customizable models. Meta AI Blog. Retrieved December, 20:2024, 2024. 2

work page 2024
[20]

Gpt-4o system card, 2024

OpenAI. Gpt-4o system card, 2024. 2, 7

work page 2024
[21]

Humanity's Last Exam

Long Phan, Alice Gatti, Ziwen Han, Nathaniel Li, Josephina Hu, Hugh Zhang, Sean Shi, Michael Choi, Anish Agrawal, Arnav Chopra, et al. Humanity’s last exam. arXiv preprint arXiv:2501.14249, 2025. 2

work page internal anchor Pith review Pith/arXiv arXiv 2025
[22]

Reasoning with large lan- guage models, a survey

Aske Plaat, Annie Wong, Suzan Verberne, Joost Broekens, Niki van Stein, and Thomas Back. Reasoning with large lan- guage models, a survey. arXiv preprint arXiv:2407.11511 ,

work page arXiv
[23]

We-math: Does your large multimodal model achieve human-like mathematical reasoning? arXiv preprint arXiv:2407.01284, 2024

Runqi Qiao, Qiuna Tan, Guanting Dong, Minhui Wu, Chong Sun, Xiaoshuai Song, Zhuoma GongQue, Shanglin Lei, Zhe Wei, Miaoxuan Zhang, et al. We-math: Does your large mul- timodal model achieve human-like mathematical reasoning? arXiv preprint arXiv:2407.01284, 2024. 2, 6

work page arXiv 2024
[24]

Zer- obench: An impossible visual benchmark for contemporary large multimodal models

Jonathan Roberts, Mohammad Reza Taesiri, Ansh Sharma, Akash Gupta, Samuel Roberts, Ioana Croitoru, Simion-Vlad Bogolin, Jialu Tang, Florian Langer, Vyas Raina, et al. Zer- obench: An impossible visual benchmark for contemporary large multimodal models. arXiv preprint arXiv:2502.09696,

work page arXiv
[25]

Llamav- o1: Rethinking step-by-step visual reasoning in llms

Omkar Thawakar, Dinura Dissanayake, Ketan More, Ritesh Thawkar, Ahmed Heakl, Noor Ahsan, Yuhao Li, Mohammed 9 Zumri, Jean Lahoud, Rao Muhammad Anwer, et al. Llamav- o1: Rethinking step-by-step visual reasoning in llms. arXiv preprint arXiv:2501.06186, 2025. 2

work page arXiv 2025
[26]

Mea- suring multimodal mathematical reasoning with math-vision dataset

Ke Wang, Junting Pan, Weikang Shi, Zimu Lu, Houxing Ren, Aojun Zhou, Mingjie Zhan, and Hongsheng Li. Mea- suring multimodal mathematical reasoning with math-vision dataset. Advances in Neural Information Processing Sys- tems, 37:95095–95169, 2025. 2, 3, 6, 12

work page 2025
[27]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution. arXiv preprint arXiv:2409.12191, 2024. 2, 7

work page internal anchor Pith review Pith/arXiv arXiv 2024
[28]

Mmlu-pro: A more robust and challenging multi-task language understanding benchmark

Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, et al. Mmlu-pro: A more robust and challenging multi-task language understanding benchmark. In Advances in Neural Information Processing Systems, 2024. 1

work page 2024
[29]

Chain-of-thought prompting elicits reasoning in large lan- guage models

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large lan- guage models. Advances in neural information processing systems, 35:24824–24837, 2022. 1

work page 2022
[30]

Large language models are better reasoners with self-verification

Yixuan Weng, Minjun Zhu, Fei Xia, Bin Li, Shizhu He, Shengping Liu, Bin Sun, Kang Liu, and Jun Zhao. Large language models are better reasoners with self-verification. arXiv preprint arXiv:2212.09561, 2022. 2

work page arXiv 2022
[31]

Deepseek-vl2: Mixture-of-experts vision-language models for advanced multimodal understanding, 2024

Zhiyu Wu, Xiaokang Chen, Zizheng Pan, Xingchao Liu, Wen Liu, Damai Dai, Huazuo Gao, Yiyang Ma, Chengyue Wu, Bingxuan Wang, Zhenda Xie, Yu Wu, Kai Hu, Jiawei Wang, Yaofeng Sun, Yukun Li, Yishi Piao, Kang Guan, Aixin Liu, Xin Xie, Yuxiang You, Kai Dong, Xingkai Yu, Haowei Zhang, Liang Zhao, Yisong Wang, and Chong Ruan. Deepseek-vl2: Mixture-of-experts visio...

work page 2024
[32]

LLaVA-CoT: Let Vision Language Models Reason Step-by-Step

Guowei Xu, Peng Jin, Li Hao, Yibing Song, Lichao Sun, and Li Yuan. Llava-o1: Let vision language models reason step- by-step. arXiv preprint arXiv:2411.10440, 2024. 2

work page internal anchor Pith review arXiv 2024
[33]

Llava-cot: Let vision language models reason step- by-step, 2025

Guowei Xu, Peng Jin, Hao Li, Yibing Song, Lichao Sun, and Li Yuan. Llava-cot: Let vision language models reason step- by-step, 2025. 2, 7

work page 2025
[34]

Mulberry: Empowering mllm with o1-like reasoning and reflection via collective monte carlo tree search

Huanjin Yao, Jiaxing Huang, Wenhao Wu, Jingyi Zhang, Yibo Wang, Shunyu Liu, Yingjie Wang, Yuxin Song, Haocheng Feng, Li Shen, et al. Mulberry: Empowering mllm with o1-like reasoning and reflection via collective monte carlo tree search. arXiv preprint arXiv:2412.18319 , 2024. 2

work page arXiv 2024
[35]

Tree of thoughts: Deliberate problem solving with large language models

Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models. Advances in neural information processing systems, 36:11809–11822, 2023. 2

work page 2023
[36]

MiniCPM-V: A GPT-4V Level MLLM on Your Phone

Yuan Yao, Tianyu Yu, Ao Zhang, Chongyi Wang, Junbo Cui, Hongji Zhu, Tianchi Cai, Haoyu Li, Weilin Zhao, Zhihui He, et al. Minicpm-v: A gpt-4v level mllm on your phone. arXiv preprint arXiv:2408.01800, 2024. 7

work page internal anchor Pith review Pith/arXiv arXiv 2024
[37]

Raven: A dataset for relational and analogical vi- sual reasoning

Chi Zhang, Feng Gao, Baoxiong Jia, Yixin Zhu, and Song- Chun Zhu. Raven: A dataset for relational and analogical vi- sual reasoning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages 5317– 5327, 2019. 3

work page 2019
[38]

Humaneval-v: Evaluating visual understanding and rea- soning abilities of large multimodal models through coding tasks

Fengji Zhang, Linquan Wu, Huiyu Bai, Guancheng Lin, Xiao Li, Xiao Yu, Yue Wang, Bei Chen, and Jacky Ke- ung. Humaneval-v: Evaluating visual understanding and rea- soning abilities of large multimodal models through coding tasks. arXiv preprint arXiv:2410.12381, 2024. 2

work page arXiv 2024
[39]

Mathverse: Does your multi-modal llm truly see the diagrams in visual math problems? In European Conference on Computer Vision, page 169–186, 2024

Renrui Zhang, Dongzhi Jiang, Yichi Zhang, Haokun Lin, Ziyu Guo, Pengshuo Qiu, Aojun Zhou, Pan Lu, Kai-Wei Chang, Yu Qiao, Peng Gao, and Hongsheng Li. Mathverse: Does your multi-modal llm truly see the diagrams in visual math problems? In European Conference on Computer Vision, page 169–186, 2024. 2, 6

work page 2024
[40]

Cumulative Reasoning with Large Language Models

Yifan Zhang, Jingqin Yang, Yang Yuan, and Andrew Chi- Chih Yao. Cumulative reasoning with large language mod- els. arXiv preprint arXiv:2308.04371, 2023. 1

work page internal anchor Pith review Pith/arXiv arXiv 2023
[41]

Dynamath: A dynamic visual bench- mark for evaluating mathematical reasoning robustness of vision language models

Chengke Zou, Xingang Guo, Rui Yang, Junyu Zhang, Bin Hu, and Huan Zhang. Dynamath: A dynamic visual bench- mark for evaluating mathematical reasoning robustness of vision language models. arXiv preprint arXiv:2411.00836,

work page arXiv
[42]

Subsequently, Section B encompasses a presentation of supplementary visualization results

3 10 Roadmap of Appendix The structure of the appendix is delineated as follows: De- scriptions of the relevant experimental details are provided in the Section A. Subsequently, Section B encompasses a presentation of supplementary visualization results. A. More Implementation Details A.1. Data Details Our cross-modal reasoning pipeline consists of: Data ...

work page
[43]

Simulate reasoning by imagining you are looking at the image, and act as if you can see it

Simulate image reasoning: Treat the image cap- tion as an image. Simulate reasoning by imagining you are looking at the image, and act as if you can see it. However, avoid visualization as a step in the reasoning process

work page
[44]

The image shows

Direct visual language: Frame observations as if you are directly viewing the image (e.g., “The image shows...”). Avoid reasoning through image caption or description

work page
[45]

based on the caption

Forbidden phrases: Avoid phrases like “based on the caption”, “based on the description”, “visualiz- ing the image”. Question: {question} Image Content: {caption}. Then, we introduce “role play” to bridge the gap in in real image understanding and then filter the data. The prompts are as follows: Revise the provided Chain of Thought (CoT) to fol- low thes...

work page
[46]

based on the description

Style Shift: Convert all references to im- age description-based reasoning into direct image- based reasoning. For example: Replace phrases like “based on the description” “based on the caption” with “the image shows” or “as seen in the image”

work page
[47]

Apply these changes rigorously to ensure that the final CoT reflects direct image interpretation, unin- fluenced by description, caption, image visualiza- tion

Remove image visualization step: If the CoT contains an inference step for image visualization, remove it and rewrite the CoT to reflect reasoning directly on the image itself, rather than reasoning after visualization from the image description. Apply these changes rigorously to ensure that the final CoT reflects direct image interpretation, unin- fluenc...

work page
[48]

The assistant’s response has correct reasoning steps

work page
[49]

The assistant’s response has the final reasoning answer, and the final reasoning answer is consistent with the meaning of the standard answer

work page
[50]

The assistant’s response is based on the reasoning process of the image, not the image description or caption

work page
[51]

First output the thinking process in <think> < /think> tags and then output the final answer in <answer> </answer> tags

There are no steps in the assistant’s response that are irrelevant to the reasoning, and each reasoning step is closely related. Standard answer: {gt} Assistant’s response: {augmented answer} Output: A.2. Model Details For model training, we utilized the llama-factory and adopted a full fine-tuning startegy to optimize the model’s performance. Following t...

work page 2037
[52]

What is the total area of the unshaded region? Choices: A. 12 B. 18 C. 22 D. 24 E. 30 The image shows a large square with a side length of 6 units, which has a total area of 36 square units. Inside this large square, there are three smaller shaded squares with side lengths of 3, 2, and 1 units. The areas of these shaded squares are 9, 4, and 1 square unit...

work page