arxiv: 2503.07536 · v2 · submitted 2025-03-10 · 💻 cs.CL · cs.AI

Recognition: 2 theorem links

· Lean Theorem

LMM-R1: Empowering 3B LMMs with Strong Reasoning Abilities Through Two-Stage Rule-Based RL

Yingzhe Peng , Gongrui Zhang , Miaosen Zhang , Zhiyuan You , Jie Liu , Qipeng Zhu , Kai Yang , Xingzhong Xu

show 2 more authors

Xin Geng Xu Yang

Authors on Pith no claims yet

Pith reviewed 2026-05-16 15:11 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords large multimodal modelsrule-based reinforcement learningreasoning enhancementtwo-stage trainingmultimodal generalizationtext-only RL3B parameter modelsdata efficiency

0 comments

The pith

Strengthening reasoning first on text data then transferring to images improves 3B multimodal models without extra multimodal data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents LMM-R1, a two-stage training process for small 3B large multimodal models that first applies rule-based reinforcement learning solely on text to build stronger reasoning foundations. It then uses a second stage to extend those capabilities to inputs that combine images and text. Experiments on the Qwen2.5-VL-Instruct-3B model report gains of 4.83 percent on multimodal benchmarks and 4.5 percent on text benchmarks, including a 3.63 percent lift on complex Football Game tasks. The approach is positioned as data-efficient because it relies on abundant text data rather than scarce high-quality multimodal reasoning examples.

Core claim

LMM-R1 first strengthens reasoning abilities using text-only data with rule-based RL in the Foundational Reasoning Enhancement stage, then generalizes these capabilities to multimodal domains in the Multimodal Generalization Training stage, achieving 4.83 percent and 4.5 percent average improvements over baselines in multimodal and text-only benchmarks respectively.

What carries the argument

Two-stage LMM-R1 framework with Foundational Reasoning Enhancement (FRE) on text-only rule-based RL followed by Multimodal Generalization Training (MGT) to transfer reasoning to visual-text inputs.

If this is right

Small 3B LMMs reach higher accuracy on complex multimodal reasoning tasks such as Football Game.
Rule-based RL developed in text domains can be adapted for multimodal use through sequential stages.
Development of capable small multimodal models becomes feasible with less reliance on expensive multimodal datasets.
Foundational text reasoning gains carry over to multimodal settings while preserving baseline text performance.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Abundant text corpora could serve as a scalable bootstrap for multimodal reasoning in resource-limited settings.
The staged method might extend to other modalities such as video if the transfer property holds beyond static images.
Similar separation of text strengthening and multimodal application could be tested with non-rule-based RL variants.
If transfer works reliably, it lowers barriers for smaller research groups to build reasoning-focused LMMs.

Load-bearing premise

Reasoning skills strengthened on text data will transfer to multimodal tasks without being substantially disrupted by visual perception components.

What would settle it

Running multimodal training directly on the same base model without the preceding text-only FRE stage and obtaining equal or higher benchmark scores would show the two-stage separation is not required.

read the original abstract

Enhancing reasoning in Large Multimodal Models (LMMs) faces unique challenges from the complex interplay between visual perception and logical reasoning, particularly in compact 3B-parameter architectures where architectural constraints limit reasoning capacity and modality alignment. While rule-based reinforcement learning (RL) excels in text-only domains, its multimodal extension confronts two critical barriers: (1) data limitations due to ambiguous answers and scarce complex reasoning examples, and (2) degraded foundational reasoning induced by multimodal pretraining. To address these challenges, we propose \textbf{LMM-R1}, a two-stage framework adapting rule-based RL for multimodal reasoning through \textbf{Foundational Reasoning Enhancement (FRE)} followed by \textbf{Multimodal Generalization Training (MGT)}. The FRE stage first strengthens reasoning abilities using text-only data with rule-based RL, then the MGT stage generalizes these reasoning capabilities to multimodal domains. Experiments on Qwen2.5-VL-Instruct-3B demonstrate that LMM-R1 achieves 4.83\% and 4.5\% average improvements over baselines in multimodal and text-only benchmarks, respectively, with a 3.63\% gain in complex Football Game tasks. These results validate that text-based reasoning enhancement enables effective multimodal generalization, offering a data-efficient paradigm that bypasses costly high-quality multimodal training data.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LMM-R1 gets modest gains on a 3B model via two-stage rule-based RL, but the text-to-multimodal transfer claim lacks isolating ablations.

read the letter

The main thing to know is that this paper applies rule-based RL in two stages to Qwen2.5-VL-Instruct-3B: first text-only foundational reasoning enhancement, then multimodal generalization training. It reports average lifts of 4.83% on multimodal benchmarks and 4.5% on text ones, plus 3.63% on some complex tasks, framing it as a data-efficient route that avoids heavy multimodal corpora. The staged sequence itself is the concrete new piece; prior RL work on text reasoning and multimodal tuning exists, but the specific transfer path for compact 3B LMMs is presented as distinct. It does a reasonable job laying out the practical barriers—scarce complex multimodal examples and reasoning degradation from pretraining—and shows the pipeline can run on limited hardware. The soft spots sit in the experimental side. Gains stay modest, baselines are not defined in the abstract, and there are no ablations comparing MGT alone against the full FRE-plus-MGT sequence, no pre/post probes on reasoning transfer, and no error bars or statistical tests. Without those controls the central claim that text reasoning reliably generalizes to multimodal inputs without visual interference remains untested, so the improvements could stem from the second stage alone. This is aimed at people building efficient small LMMs who care about RL tricks that sidestep big multimodal datasets. A reader working on resource-constrained multimodal reasoning would find the method description useful even if the numbers need verification. It deserves a serious referee to see the full protocol, code, and any added controls; the idea is straightforward enough that feedback on the design would be worthwhile.

Referee Report

3 major / 2 minor

Summary. The paper proposes LMM-R1, a two-stage framework for improving reasoning in 3B-parameter LMMs: Foundational Reasoning Enhancement (FRE) applies rule-based RL on text-only data to strengthen core reasoning, followed by Multimodal Generalization Training (MGT) to transfer those skills to multimodal inputs. Experiments on Qwen2.5-VL-Instruct-3B report average gains of 4.83% on multimodal benchmarks and 4.5% on text-only benchmarks over baselines, plus 3.63% on complex Football Game tasks, positioning the method as a data-efficient alternative that avoids expensive high-quality multimodal training data.

Significance. If substantiated, the work offers a practical route to bootstrap multimodal reasoning in compact LMMs by leveraging abundant text-only RL signals, which could lower data costs and address capacity limits in small architectures. The empirical pipeline is straightforward and avoids circular parameter fitting, but the modest reported gains and absence of isolating controls currently limit the strength of the transfer claim.

major comments (3)

[Abstract and Experiments] Abstract and experimental results: the claimed 4.83% multimodal and 4.5% text-only average improvements are presented without defining the baselines, reporting error bars, statistical significance tests, or details on task selection and data filtering. This prevents assessment of whether the gains are robust or sensitive to post-hoc choices.
[Method and Experiments] Method and Experiments: the load-bearing claim that FRE (text-only rule-based RL) produces transferable reasoning skills that MGT then generalizes to multimodal inputs without degradation or visual interference is unsupported by ablations. No results compare MGT alone versus the full FRE+MGT pipeline, nor provide pre/post-FRE multimodal reasoning probes, so the contribution of the first stage cannot be isolated from MGT training itself.
[Method] Method: although the paper notes answer ambiguity as a barrier for multimodal rule-based RL, it supplies no concrete definition or computation details for the reward function used in the MGT stage. This is essential for verifying that the RL process remains well-defined and stable under multimodal inputs.

minor comments (2)

[Experiments] Clarify the exact set of multimodal and text-only benchmarks used, including any filtering criteria, so that the reported averages can be reproduced.
[Method] Provide the precise formulation of the rule-based reward (e.g., exact matching criteria or partial-credit rules) for both FRE and MGT stages.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback, which has helped us clarify key aspects of our work. We address each major comment point by point below and have revised the manuscript to incorporate additional details, ablations, and explanations as suggested.

read point-by-point responses

Referee: [Abstract and Experiments] Abstract and experimental results: the claimed 4.83% multimodal and 4.5% text-only average improvements are presented without defining the baselines, reporting error bars, statistical significance tests, or details on task selection and data filtering. This prevents assessment of whether the gains are robust or sensitive to post-hoc choices.

Authors: We agree that these details are required for proper evaluation. In the revised manuscript, we explicitly define all baselines (the base Qwen2.5-VL-Instruct-3B and other compared models), report error bars as standard deviations from three random seeds, include paired t-test results with p-values for statistical significance, and add a dedicated paragraph in Section 4 detailing task selection criteria and data filtering procedures. A summary table of these settings has also been added. revision: yes
Referee: [Method and Experiments] Method and Experiments: the load-bearing claim that FRE (text-only rule-based RL) produces transferable reasoning skills that MGT then generalizes to multimodal inputs without degradation or visual interference is unsupported by ablations. No results compare MGT alone versus the full FRE+MGT pipeline, nor provide pre/post-FRE multimodal reasoning probes, so the contribution of the first stage cannot be isolated from MGT training itself.

Authors: We acknowledge the need to isolate the FRE contribution. We have run additional experiments comparing (i) MGT alone on the base model, (ii) multimodal performance immediately before and after FRE, and (iii) the full FRE+MGT pipeline. These results, now included in the revised Experiments section, show that FRE yields measurable multimodal gains and that the combined pipeline outperforms MGT alone with no degradation, thereby supporting the transfer claim. revision: yes
Referee: [Method] Method: although the paper notes answer ambiguity as a barrier for multimodal rule-based RL, it supplies no concrete definition or computation details for the reward function used in the MGT stage. This is essential for verifying that the RL process remains well-defined and stable under multimodal inputs.

Authors: We apologize for the missing details. The revised manuscript now contains a new subsection (3.3) that fully specifies the MGT reward function: it applies rule-based verification to the extracted final answer, using exact string matching augmented by a confidence threshold inherited from the FRE stage to mitigate ambiguity. The exact formula, input preprocessing steps for multimodal cases, and stability safeguards are provided to allow verification of the RL process. revision: yes

Circularity Check

0 steps flagged

No circularity in empirical two-stage training pipeline

full rationale

The paper presents a purely empirical two-stage training procedure (FRE on text-only data followed by MGT on multimodal data) with no equations, derivations, or fitted parameters that reduce the reported benchmark gains to quantities defined inside the same work. Performance is measured on external benchmarks after training, and the central claim of text-to-multimodal transfer rests on experimental outcomes rather than any self-definitional, self-citation load-bearing, or ansatz-smuggling step. No uniqueness theorems or renamings of known results are invoked in a circular manner.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the transferability of text-only RL improvements to multimodal settings. No new physical entities are postulated. The main background assumption is that rule-based RL reliably strengthens reasoning in the text domain, drawn from prior text-only work.

axioms (1)

domain assumption Rule-based RL can strengthen foundational reasoning abilities when applied to text-only data
Invoked in the description of the FRE stage as the basis for the first training phase.

pith-pipeline@v0.9.0 · 5576 in / 1295 out tokens · 48600 ms · 2026-05-16T15:11:35.828514+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

two-stage framework adapting rule-based RL for multimodal reasoning through Foundational Reasoning Enhancement (FRE) followed by Multimodal Generalization Training (MGT)
IndisputableMonolith/Foundation/ArrowOfTime.lean z_monotone_absolute unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

response length trends ... FRE-Text demonstrates rapid growth in response length ... FRE-Multi shows a consistent downward trend

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 23 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

CurveBench: A Benchmark for Exact Topological Reasoning over Nested Jordan Curves
cs.CV 2026-05 unverdicted novelty 7.0

CurveBench benchmark reveals that even leading VLMs like Gemini 3.1 Pro reach only 71.1% accuracy recovering containment trees on easy nested-curve images and 19.1% on hard versions, while fine-tuning lifts an open 8B...
Improving Vision-language Models with Perception-centric Process Reward Models
cs.CV 2026-04 unverdicted novelty 7.0

Perceval is a perception-centric PRM that detects token-level perceptual errors in VLMs, supporting token-advantage RL training and iterative test-time scaling for improved reasoning.
AnchorSeg: Language Grounded Query Banks for Reasoning Segmentation
cs.CV 2026-04 unverdicted novelty 7.0

AnchorSeg uses ordered query banks of latent reasoning tokens plus a spatial anchor token and a Token-Mask Cycle Consistency loss to achieve 67.7% gIoU and 68.1% cIoU on the ReasonSeg benchmark.
V-Reflection: Transforming MLLMs from Passive Observers to Active Interrogators
cs.CV 2026-03 unverdicted novelty 7.0

V-Reflection introduces a think-then-look mechanism where MLLM latent states actively interrogate visual features via two-stage distillation from a box-guided teacher to a dynamic autoregressive student, narrowing the...
Addressing Overthinking in Large Vision-Language Models via Gated Perception-Reasoning Optimization
cs.CV 2026-01 unverdicted novelty 7.0

GPRO trains a meta-controller on 790k failure-labeled samples to dynamically select fast, perception, or reasoning paths in LVLMs, yielding higher accuracy and shorter responses than prior slow-thinking methods.
Latent Visual Reasoning
cs.CV 2025-09 unverdicted novelty 7.0

Latent Visual Reasoning enables autoregressive generation of latent visual states that reconstruct critical image tokens, yielding gains on perception-heavy VQA benchmarks such as 71.67% on MMVP.
DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning
cs.CV 2025-05 unverdicted novelty 7.0

DeepEyes uses reinforcement learning to teach vision-language models active perception and image-based thinking, yielding gains on perception, reasoning, grounding, and hallucination benchmarks.
SpaceR: Reinforcing MLLMs in Video Spatial Reasoning
cs.CV 2025-04 unverdicted novelty 7.0

SpaceR uses a new verifiable dataset and map-imagination-augmented RLVR to reach SOTA spatial reasoning accuracy in MLLMs, exceeding GPT-4o on VSI-Bench.
R1-VL: Learning to Reason with Multimodal Large Language Models via Step-wise Group Relative Policy Optimization
cs.AI 2025-03 conditional novelty 7.0

R1-VL uses StepGRPO with rule-based StepRAR and StepRVR rewards to let MLLMs learn step-by-step reasoning beyond imitation of positive paths.
Self-Consistent Latent Reasoning: Long Latent Sequence Reasoning for Vision-Language Model
cs.CV 2026-05 unverdicted novelty 6.0

SCOLAR fixes information gain collapse in latent visual reasoning by generating independent auxiliary visual tokens via a detransformer, extending acceptable CoT length over 30x and delivering +14.12% gains on reasoni...
Self-Consistent Latent Reasoning: Long Latent Sequence Reasoning for Vision-Language Model
cs.CV 2026-05 unverdicted novelty 6.0

SCOLAR addresses information gain collapse in latent visual reasoning by generating independent auxiliary visual tokens from LLM hidden states, extending acceptable CoT length over 30x and achieving +14.12% gains on b...
RadThinking: A Dataset for Longitudinal Clinical Reasoning in Radiology
cs.CV 2026-05 unverdicted novelty 6.0

RadThinking releases a large longitudinal CT VQA dataset stratified into foundation perception questions, single-rule reasoning questions, and compositional multi-step chains grounded in clinical reporting standards f...
Reinforcing Multimodal Reasoning Against Visual Degradation
cs.CV 2026-05 unverdicted novelty 6.0

ROMA improves MLLM robustness to seen and unseen visual corruptions by +2.3-2.4% over GRPO on seven reasoning benchmarks while matching clean accuracy.
See Further, Think Deeper: Advancing VLM's Reasoning Ability with Low-level Visual Cues and Reflection
cs.CV 2026-04 unverdicted novelty 6.0

ForeSight lets VLMs use low-level visual cues and mask-based visual feedback within an RL loop to reason more accurately, with the 7B model beating same-scale peers and some closed-source SOTA on a new benchmark.
Thinking Before Matching: A Reinforcement Reasoning Paradigm Towards General Person Re-Identification
cs.CV 2026-04 unverdicted novelty 6.0

ReID-R achieves competitive person re-identification performance using chain-of-thought reasoning and reinforcement learning with only 14.3K non-trivial samples, about 20.9% of typical data scales, while providing int...
Reinforce to Learn, Elect to Reason: A Dual Paradigm for Video Reasoning
cs.CV 2026-04 unverdicted novelty 6.0

RLER trains video-reasoning models with three task-driven RL rewards for evidence production and elects the best answer from a few candidates via evidence consistency scoring, yielding 6.3% average gains on eight benchmarks.
AdaTooler-V: Adaptive Tool-Use for Images and Videos
cs.CV 2025-12 conditional novelty 6.0

AdaTooler-V trains MLLMs to adaptively use vision tools via AT-GRPO reinforcement learning and new datasets, reaching 89.8% on V* and outperforming GPT-4o.
SPHINX: A Synthetic Environment for Visual Perception and Reasoning
cs.CV 2025-11 unverdicted novelty 6.0

SPHINX generates synthetic visual puzzles for benchmarking LVLMs, where GPT-5 scores 51.1% and RLVR training improves both in-domain and external visual reasoning performance.
VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model
cs.CV 2025-04 unverdicted novelty 6.0

VLM-R1 applies R1-style RL using rule-based rewards on visual tasks with clear ground truth to achieve competitive performance and superior generalization over SFT in vision-language models.
Quantifying and Understanding Uncertainty in Large Reasoning Models
cs.AI 2026-04 unverdicted novelty 5.0

Conformal prediction adapted to reasoning-answer pairs in LRMs yields distribution-free uncertainty sets with finite-sample guarantees, paired with a Shapley explanation method that isolates provably sufficient traini...
Learning to Focus and Precise Cropping: A Reinforcement Learning Framework with Information Gaps and Grounding Loss for MLLMs
cs.CV 2026-03 unverdicted novelty 5.0

A two-stage RL method with information gaps and grounding loss trains MLLMs to focus on and precisely crop relevant image regions, yielding SOTA results on high-resolution VQA benchmarks.
Can Textual Reasoning Improve the Performance of MLLMs on Fine-grained Visual Classification?
cs.CV 2026-01 unverdicted novelty 5.0

Longer textual reasoning chains degrade MLLM accuracy on fine-grained visual tasks; a new normalization and constrained-reward training framework mitigates the effect and sets new SOTA numbers.
VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement Fine-Tuning
cs.CV 2025-04 unverdicted novelty 5.0

Reinforcement fine-tuning with temporal rewards produces VideoChat-R1, a video MLLM showing large gains on spatio-temporal perception benchmarks such as +31.8 temporal grounding and +31.2 object tracking.

Reference graph

Works this paper leans on

117 extracted references · 117 canonical work pages · cited by 22 Pith papers · 31 internal anchors

[1]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ah- mad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. GPT-4 technical report. arXiv preprint arXiv:2303.08774 ,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Lawrence Zitnick, Devi Parikh, and Dhruv Ba- tra

Aishwarya Agrawal, Jiasen Lu, Stanislaw Antol, Margaret Mitchell, C. Lawrence Zitnick, Devi Parikh, and Dhruv Ba- tra. Vqa: Visual question answering. IJCV, 2015. 13

work page 2015
[3]

Flamingo: A visual language model for few-shot learning

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: A visual language model for few-shot learning. In NeurIPS,

work page
[4]

Qwen Technical Report

Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xi- aodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, Binyuan Hui, Luo Ji, Mei Li, Junyang Lin, Runji Lin, Day- iheng Liu, Gao Liu, Chengqiang Lu, Keming Lu, Jianxin Ma, Rui Men, Xingzhang Ren, Xuancheng Ren, Chuanqi Tan, Sinan Tan, Jianhong Tu, Peng Wang, Shijie Wang, Wei Wang, Shengguang Wu, Benfe...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[5]

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-VL: A versatile vision-language model for un- derstanding, localization, text reading, and beyond. arXiv preprint arXiv:2308.12966, 2023. 3

work page internal anchor Pith review Pith/arXiv arXiv 2023
[6]

Qwen2.5-VL Technical Report

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhao- hai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Jun- yang Lin. Qwen2.5-VL technical repor...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[7]

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, et al. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022. 3

work page internal anchor Pith review Pith/arXiv arXiv 2022
[8]

An augmented benchmark dataset for geometric question answering through dual parallel text en- coding

Jie Cao and Jing Xiao. An augmented benchmark dataset for geometric question answering through dual parallel text en- coding. In International Conference on Computational Lin- guistics, 2022. 13

work page 2022
[9]

Mapqa: A dataset for question answering on choropleth maps

Shuaichen Chang, David Palzer, Jialin Li, Eric Fosler- Lussier, and Ningchuan Xiao. Mapqa: A dataset for question answering on choropleth maps. arXiv preprint arXiv:2211.08545, 2022. 13

work page arXiv 2022
[10]

ArXiv abs/2212.02746 (2022)

Jiaqi Chen, Tong Li, Jinghui Qin, Pan Lu, Liang Lin, Chongyu Chen, and Xiaodan Liang. UniGeo: Unifying ge- ometry logical reasoning via reformulating mathematical ex- pression. arXiv preprint arXiv:2212.02746, 2022. 13

work page arXiv 2022
[11]

Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic

Keqin Chen, Zhao Zhang, Weili Zeng, Richong Zhang, Feng Zhu, and Rui Zhao. Shikra: Unleashing multi- modal llm’s referential dialogue magic. arXiv preprint arXiv:2306.15195, 2023. 3

work page internal anchor Pith review Pith/arXiv arXiv 2023
[12]

Are we on the right way for evaluating large vision-language models? In NeurIPS, 2024

Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Jiaqi Wang, Yu Qiao, Dahua Lin, and Feng Zhao. Are we on the right way for evaluating large vision-language models? In NeurIPS, 2024. 6, 14

work page 2024
[13]

R1-v: Reinforcing super generalization ability in vision- language models with less than $3

Liang Chen, Lei Li, Haozhe Zhao, Yifan Song, and Vinci. R1-v: Reinforcing super generalization ability in vision- language models with less than $3. https://github. com/Deep-Agent/R1-V , 2025. Accessed: 2025-02-02. 3

work page 2025
[14]

Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling

Xiaokang Chen, Zhiyu Wu, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, and Chong Ruan. Janus- pro: Unified multimodal understanding and generation with data and model scaling. arXiv preprint arXiv:2501.17811 ,

work page internal anchor Pith review Pith/arXiv arXiv
[15]

InternVL: Scaling up vision foundation models and aligning for generic visual-linguistic tasks

Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. InternVL: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. In CVPR, 2024. 3

work page 2024
[16]

CoMT: A novel benchmark for chain of multi-modal thought on large vision- language models

Zihui Cheng, Qiguang Chen, Jin Zhang, Hao Fei, Xiaocheng Feng, Wanxiang Che, Min Li, and Libo Qin. CoMT: A novel benchmark for chain of multi-modal thought on large vision- language models. arXiv preprint arXiv:2412.12932, 2024. 2

work page arXiv 2024
[17]

Gonzalez, Ion Stoica, and Eric P

Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhang- hao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yong- hao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, 2023. 3

work page 2023
[18]

LightEval: A lightweight framework for llm evaluation, 2023

Clémentine Fourrier, Nathan Habib, Hynek Kydlí ˇcek, Thomas Wolf, and Lewis Tunstall. LightEval: A lightweight framework for llm evaluation, 2023. 13

work page 2023
[19]

Omni-MATH: A Universal Olympiad Level Mathematic Benchmark For Large Language Models

Bofei Gao, Feifan Song, Zhe Yang, Zefan Cai, Yibo Miao, Qingxiu Dong, Lei Li, Chenghao Ma, Liang Chen, Runxin Xu, Zhengyang Tang, Benyou Wang, Daoguang Zan, Shang- haoran Quan, Ge Zhang, Lei Sha, Yichang Zhang, Xu- ancheng Ren, Tianyu Liu, and Baobao Chang. Omni-MATH: 9 A universal olympiad level mathematic benchmark for large language models. arXiv prepr...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[20]

SEED-X: Multimodal Models with Unified Multi-granularity Comprehension and Generation

Yuying Ge, Sijie Zhao, Jinguo Zhu, Yixiao Ge, Kun Yi, Lin Song, Chen Li, Xiaohan Ding, and Ying Shan. Seed-x: Mul- timodal models with unified multi-granularity comprehen- sion and generation. arXiv preprint arXiv:2404.14396, 2024. 3

work page internal anchor Pith review Pith/arXiv arXiv 2024
[21]

Making the v in vqa matter: Elevating the role of image understanding in visual question answer- ing

Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Ba- tra, and Devi Parikh. Making the v in vqa matter: Elevating the role of image understanding in visual question answer- ing. IJCV, 2016. 13

work page 2016
[22]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025. 1, 3, 4, 7

work page internal anchor Pith review Pith/arXiv arXiv 2025
[23]

Danna Gurari, Qing Li, Abigale Stangl, Anhong Guo, Chi Lin, Kristen Grauman, Jiebo Luo, and Jeffrey P. Bigham. Vizwiz grand challenge: Answering visual questions from blind people. In CVPR, 2018. 13

work page 2018
[24]

OlympiadBench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems

Chaoqun He, Renjie Luo, Yuzhuo Bai, Shengding Hu, Zhen Leng Thai, Junhao Shen, Jinyi Hu, Xu Han, Yujie Huang, Yuxiang Zhang, Jie Liu, Lei Qi, Zhiyuan Liu, and Maosong Sun. OlympiadBench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems. In ACL, 2024. 2, 5, 14

work page 2024
[25]

OpenRLHF: An Easy-to-use, Scalable and High-performance RLHF Framework

Jian Hu, Xibin Wu, Zilin Zhu, Xianyu, Weixun Wang, De- hao Zhang, and Yu Cao. OpenRLHF: An easy-to-use, scal- able and high-performance RLHF framework.arXiv preprint arXiv:2405.11143, 2024. 13

work page internal anchor Pith review Pith/arXiv arXiv 2024
[26]

Cohen, Brian L

Kushal Kafle, Scott D. Cohen, Brian L. Price, and Christo- pher Kanan. DVQA: Understanding data visualizations via question answering. CVPR, 2018. 13

work page 2018
[27]

FigureQA: An Annotated Figure Dataset for Visual Reasoning

Samira Ebrahimi Kahou, Adam Atkinson, Vincent Michal- ski, Ákos Kádár, Adam Trischler, and Yoshua Bengio. Fig- ureqa: An annotated figure dataset for visual reasoning. arXiv preprint arXiv:1710.07300, 2017. 13

work page internal anchor Pith review Pith/arXiv arXiv 2017
[28]

A Diagram Is Worth A Dozen Images

Aniruddha Kembhavi, Michael Salvato, Eric Kolve, Minjoon Seo, Hannaneh Hajishirzi, and Ali Farhadi. A diagram is worth a dozen images. arXiv preprint arXiv:1603.07396 ,

work page internal anchor Pith review Pith/arXiv arXiv
[29]

Are you smarter than a sixth grader? textbook question answer- ing for multimodal machine comprehension

Aniruddha Kembhavi, Minjoon Seo, Dustin Schwenk, Jonghyun Choi, Ali Farhadi, and Hannaneh Hajishirzi. Are you smarter than a sixth grader? textbook question answer- ing for multimodal machine comprehension. CVPR, 2017. 13

work page 2017
[30]

Google research football: A novel reinforcement learning environ- ment

Karol Kurach, Anton Raichuk, Piotr Sta ´nczyk, Michał Zaj ˛ ac, Olivier Bachem, Lasse Espeholt, Carlos Riquelme, Damien Vincent, Marcin Michalski, Olivier Bousquet, et al. Google research football: A novel reinforcement learning environ- ment. In AAAI, 2020. 3

work page 2020
[31]

Gonzalez, Hao Zhang, and Ion Stoica

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. In Pro- ceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023. 14

work page 2023
[32]

A dataset of clinically generated visual questions and answers about radiology images

Jason J Lau, Soumya Gayen, Asma Ben Abacha, and Dina Demner-Fushman. A dataset of clinically generated visual questions and answers about radiology images. Scientific Data, 2018. 13

work page 2018
[33]

BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In ICML,

work page
[34]

Uni- mMoE: Scaling unified multimodal llms with mixture of ex- perts

Yunxin Li, Shenyuan Jiang, Baotian Hu, Longyue Wang, Wanqi Zhong, Wenhan Luo, Lin Ma, and Min Zhang. Uni- mMoE: Scaling unified multimodal llms with mixture of ex- perts. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025. 3

work page 2025
[35]

Super-CLEVR: A virtual benchmark to diagnose domain robustness in visual reasoning

Zhuowan Li, Xingrui Wang, Elias Stengel-Eskin, Adam Ko- rtylewski, Wufei Ma, Benjamin Van Durme, Alan Yuille Johns Hopkins University, University of Southern Califor- nia, Max Planck Institute for Informatics, and University of Freiburg. Super-CLEVR: A virtual benchmark to diagnose domain robustness in visual reasoning. In CVPR, 2022. 13

work page 2022
[36]

Let’s verify step by step

Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Ed- wards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. In ICLR, 2023. 1, 6, 13

work page 2023
[37]

VILA: On pre-training for vi- sual language models

Ji Lin, Hongxu Yin, Wei Ping, Pavlo Molchanov, Moham- mad Shoeybi, and Song Han. VILA: On pre-training for vi- sual language models. In CVPR, 2024. 2

work page 2024
[38]

CLEVR-Math: A dataset for compositional language, visual and mathematical reasoning

Adam Dahlgren Lindström and Savitha Sam Abraham. CLEVR-Math: A dataset for compositional language, visual and mathematical reasoning. arXiv preprint arXiv:2208.05358, 2022. 13

work page arXiv 2022
[39]

Visual instruction tuning

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. In NeurIPS, 2023. 3

work page 2023
[40]

Improved baselines with visual instruction tuning

Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. In CVPR, 2024

work page 2024
[41]

LLaV A-NeXT: Im- proved reasoning, ocr, and world knowledge, 2024

Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. LLaV A-NeXT: Im- proved reasoning, ocr, and world knowledge, 2024. 3

work page 2024
[42]

Visualagentbench: Towards large multi- modal models as visual agents

Xiao Liu, Tianjie Zhang, Yu Gu, Iat Long Iong, Song XiXuan, Yifan Xu, Shudan Zhang, Hanyu Lai, Jiadai Sun, Xinyue Yang, et al. Visualagentbench: Towards large multi- modal models as visual agents. In ICLR, 2025. 3

work page 2025
[43]

Inter-GPS: Inter- pretable geometry problem solving with formal language and symbolic reasoning

Pan Lu, Ran Gong, Shibiao Jiang, Liang Qiu, Siyuan Huang, Xiaodan Liang, and Song-Chun Zhu. Inter-GPS: Inter- pretable geometry problem solving with formal language and symbolic reasoning. In ACL, 2021. 13

work page 2021
[44]

IconQA: A new benchmark for abstract diagram under- standing and visual language reasoning

Pan Lu, Liang Qiu, Jiaqi Chen, Tony Xia, Yizhou Zhao, Wei Zhang, Zhou Yu, Xiaodan Liang, and Song-Chun Zhu. IconQA: A new benchmark for abstract diagram under- standing and visual language reasoning. arXiv preprint arXiv:2110.13214, 2021. 13

work page arXiv 2021
[45]

Pan Lu, Swaroop Mishra, Tony Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and A. Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering. arXiv preprint arXiv:2209.09513, 2022. 13

work page arXiv 2022
[46]

10 Kalyan

Pan Lu, Liang Qiu, Kai-Wei Chang, Ying Nian Wu, Song-Chun Zhu, Tanmay Rajpurohit, Peter Clark, and A. 10 Kalyan. Dynamic prompt learning via policy gradient for semi-structured mathematical reasoning. arXiv preprint arXiv:2209.14610, 2022. 13

work page arXiv 2022
[47]

MathVista: Evaluating mathemat- ical reasoning of foundation models in visual contexts

Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. MathVista: Evaluating mathemat- ical reasoning of foundation models in visual contexts. In ICLR, 2024. 6, 14

work page 2024
[48]

Improve Mathematical Reasoning in Language Models by Automated Process Supervision

Liangchen Luo, Yinxiao Liu, Rosanne Liu, Samrat Phatale, Harsh Lara, Yunxuan Li, Lei Shu, Yun Zhu, Lei Meng, Jiao Sun, et al. Improve mathematical reasoning in language models by automated process supervision. arXiv preprint arXiv:2406.06592, 2, 2024. 1

work page internal anchor Pith review Pith/arXiv arXiv 2024
[49]

Tang, Manan Roongta, Colin Cai, Jeffrey Luo, Tianjun Zhang, Li Erran Li, Raluca Ada Popa, and Ion Sto- ica

Michael Luo, Sijun Tan, Justin Wong, Xiaoxiang Shi, William Y . Tang, Manan Roongta, Colin Cai, Jeffrey Luo, Tianjun Zhang, Li Erran Li, Raluca Ada Popa, and Ion Sto- ica. DeepScaleR: Surpassing o1-preview with a 1.5b model by scaling rl, 2025. Notion Blog. 2, 5, 7, 13

work page 2025
[50]

Ursa: Under- standing and verifying chain-of-thought reasoning in multi- modal mathematics.arXiv preprint arXiv:2501.04686, 2025

Ruilin Luo, Zhuofan Zheng, Yifan Wang, Yiyao Yu, Xinzhe Ni, Zicheng Lin, Jin Zeng, and Yujiu Yang. Ursa: Under- standing and verifying chain-of-thought reasoning in multi- modal mathematics. arXiv preprint arXiv:2501.04686, 2025. 3

work page arXiv 2025
[51]

ChartQA: A Benchmark for Question Answering about Charts with Visual and Logical Reasoning

Ahmed Masry, Do Xuan Long, Jia Qing Tan, Shafiq R. Joty, and Enamul Hoque. Chartqa: A benchmark for question an- swering about charts with visual and logical reasoning.arXiv preprint arXiv:2203.10244, 2022. 13

work page internal anchor Pith review Pith/arXiv arXiv 2022
[52]

Minesh Mathew, Viraj Bagal, Rubèn Pérez Tito, Dimosthe- nis Karatzas, Ernest Valveny, and C.V . Jawahar. Infograph- icvqa. In WACV, 2021. 13

work page 2021
[53]

Khapra, and Pratyush Kumar

Nitesh Methani, Pritha Ganguly, Mitesh M. Khapra, and Pratyush Kumar. Plotqa: Reasoning over scientific plots. In WACV, 2019. 13

work page 2019
[54]

Gpt-4v(ision) system card, 2023

OpenAI. Gpt-4v(ision) system card, 2023. 3

work page 2023
[55]

Openai o1 system card, 2024

OpenAI. Openai o1 system card, 2024. 1

work page 2024
[56]

Training lan- guage models to follow instructions with human feedback

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Car- roll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training lan- guage models to follow instructions with human feedback. In NeurIPS, 2022. 3

work page 2022
[57]

Kosmos-2: Grounding Multimodal Large Language Models to the World

Zhiliang Peng, Wenhui Wang, Li Dong, Yaru Hao, Shaohan Huang, Shuming Ma, and Furu Wei. Kosmos-2: Ground- ing multimodal large language models to the world. arXiv preprint arXiv:2306.14824, 2023. 3

work page internal anchor Pith review Pith/arXiv arXiv 2023
[58]

Imagination-augmented agents for deep reinforce- ment learning

Sébastien Racanière, Théophane Weber, David Reichert, Lars Buesing, Arthur Guez, Danilo Jimenez Rezende, Adrià Puigdomènech Badia, Oriol Vinyals, Nicolas Heess, Yujia Li, et al. Imagination-augmented agents for deep reinforce- ment learning. In NeurIPS, 2017. 3, 5, 13

work page 2017
[59]

GPQA: A graduate-level google- proof q&a benchmark

David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R Bowman. GPQA: A graduate-level google- proof q&a benchmark. In COLM, 2024. 6, 13

work page 2024
[60]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Rad- ford, and Oleg Klimov. Proximal policy optimization algo- rithms. arXiv preprint arXiv:1707.06347, 2017. 3

work page internal anchor Pith review Pith/arXiv arXiv 2017
[61]

Solving geometry problems: Combining text and diagram interpretation

Minjoon Seo, Hannaneh Hajishirzi, Ali Farhadi, Oren Et- zioni, and Clint Malcolm. Solving geometry problems: Combining text and diagram interpretation. In EMNLP,

work page
[62]

Rewarding progress: Scaling automated process verifiers for llm reasoning

Amrith Setlur, Chirag Nagpal, Adam Fisch, Xinyang Geng, Jacob Eisenstein, Rishabh Agarwal, Alekh Agarwal, Jonathan Berant, and Aviral Kumar. Rewarding progress: Scaling automated process verifiers for llm reasoning. In ICLR, 2025. 1

work page 2025
[63]

MoME: Mixture of multimodal experts for gen- eralist multimodal large language models

Leyang Shen, Gongwei Chen, Rui Shao, Weili Guan, and Liqiang Nie. MoME: Mixture of multimodal experts for gen- eralist multimodal large language models. InNeurIPS, 2025. 3

work page 2025
[64]

Math-LLaV A: Bootstrapping mathematical reasoning for multimodal large language models

Wenhao Shi, Zhiqiang Hu, Yi Bin, Junhua Liu, Yang Yang, See-Kiong Ng, Li Bing, and Roy Ka wei Lee. Math-LLaV A: Bootstrapping mathematical reasoning for multimodal large language models. In EMNLP, 2024. 5, 13

work page 2024
[65]

Towards vqa models that can read

Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. Towards vqa models that can read. In CVPR,

work page
[66]

Aligning Large Multimodal Models with Factually Augmented RLHF

Zhiqing Sun, Sheng Shen, Shengcao Cao, Haotian Liu, Chunyuan Li, Yikang Shen, Chuang Gan, Liang-Yan Gui, Yu-Xiong Wang, Yiming Yang, et al. Aligning large mul- timodal models with factually augmented RLHF. arXiv preprint arXiv:2309.14525, 2023. 3

work page internal anchor Pith review Pith/arXiv arXiv 2023
[67]

Chameleon: Mixed-Modal Early-Fusion Foundation Models

Chameleon Team. Chameleon: Mixed-modal early-fusion foundation models. arXiv preprint arXiv:2405.09818, 2024. 3

work page internal anchor Pith review Pith/arXiv arXiv 2024
[68]

Gemini: A Family of Highly Capable Multimodal Models

Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean- Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023. 3

work page internal anchor Pith review Pith/arXiv arXiv 2023
[69]

Kimi Team, Angang Du, Bofei Gao, Bowei Xing, Changjiu Jiang, Cheng Chen, Cheng Li, Chenjun Xiao, Chenzhuang Du, Chonghua Liao, et al. Kimi k1. 5: Scaling reinforcement learning with llms. arXiv preprint arXiv:2501.12599, 2025. 3

work page internal anchor Pith review Pith/arXiv arXiv 2025
[70]

Qwen2.5: A party of foundation models, 2024

Qwen Team. Qwen2.5: A party of foundation models, 2024. 14

work page 2024
[71]

Qwq: Reflect deeply on the boundaries of the unknown, 2024

Qwen Team. Qwq: Reflect deeply on the boundaries of the unknown, 2024. 1

work page 2024
[72]

STILL-3-1.5B-preview: Enhanc- ing slow thinking abilities of small models through reinforce- ment learning, 2025

RUCAIBox STILL Team. STILL-3-1.5B-preview: Enhanc- ing slow thinking abilities of small models through reinforce- ment learning, 2025. 13

work page 2025
[73]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023. 3

work page internal anchor Pith review Pith/arXiv arXiv 2023
[74]

Mobile-agent-v2: Mobile device operation assistant with ef- fective navigation via multi-agent collaboration

Junyang Wang, Haiyang Xu, Haitao Jia, Xi Zhang, Ming Yan, Weizhou Shen, Ji Zhang, Fei Huang, and Jitao Sang. Mobile-agent-v2: Mobile device operation assistant with ef- fective navigation via multi-agent collaboration. InNeurIPS,

work page
[75]

Mea- suring multimodal mathematical reasoning with math-vision dataset

Ke Wang, Junting Pan, Weikang Shi, Zimu Lu, Houxing Ren, Aojun Zhou, Mingjie Zhan, and Hongsheng Li. Mea- suring multimodal mathematical reasoning with math-vision dataset. In NeurIPS, 2025. 2, 6, 14

work page 2025
[76]

Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations

Peiyi Wang, Lei Li, Zhihong Shao, RX Xu, Damai Dai, Yifei Li, Deli Chen, Yu Wu, and Zhifang Sui. Math-shepherd: Verify and reinforce llms step-by-step without human anno- tations. arXiv preprint arXiv:2312.08935, 2023. 1, 4

work page internal anchor Pith review Pith/arXiv arXiv 2023
[77]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Yang Fan, Kai Dang, Mengfei Du, Xuancheng Ren, Rui Men, Dayiheng Liu, Chang Zhou, Jingren Zhou, and Jun- yang Lin. Qwen2-VL: Enhancing vision-language model’s perception of the world at any resolution. arXiv preprint arXiv:2409.12191, 2024. 3

work page internal anchor Pith review Pith/arXiv arXiv 2024
[78]

VisionLLM: Large language model is also an open-ended decoder for vision-centric tasks

Wenhai Wang, Zhe Chen, Xiaokang Chen, Jiannan Wu, Xizhou Zhu, Gang Zeng, Ping Luo, Tong Lu, Jie Zhou, Yu Qiao, et al. VisionLLM: Large language model is also an open-ended decoder for vision-centric tasks. In NeurIPS,

work page
[79]

Exploring the reasoning abilities of multimodal large language models (mllms): A compre- hensive survey on emerging trends in multimodal reasoning

Yiqi Wang, Wentao Chen, Xiaotian Han, Xudong Lin, Hait- eng Zhao, Yongfei Liu, Bohan Zhai, Jianbo Yuan, Quanzeng You, and Hongxia Yang. Exploring the reasoning abilities of multimodal large language models (mllms): A compre- hensive survey on emerging trends in multimodal reasoning. arXiv preprint arXiv:2401.06805, 2024. 2

work page arXiv 2024
[80]

Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation

Chengyue Wu, Xiaokang Chen, Zhiyu Wu, Yiyang Ma, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, Chong Ruan, et al. Janus: Decoupling visual encoding for unified multimodal understanding and generation. arXiv preprint arXiv:2410.13848, 2024. 3

work page internal anchor Pith review Pith/arXiv arXiv 2024

Showing first 80 references.