Recognition: 2 theorem links
· Lean TheoremLMM-R1: Empowering 3B LMMs with Strong Reasoning Abilities Through Two-Stage Rule-Based RL
Pith reviewed 2026-05-16 15:11 UTC · model grok-4.3
The pith
Strengthening reasoning first on text data then transferring to images improves 3B multimodal models without extra multimodal data.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
LMM-R1 first strengthens reasoning abilities using text-only data with rule-based RL in the Foundational Reasoning Enhancement stage, then generalizes these capabilities to multimodal domains in the Multimodal Generalization Training stage, achieving 4.83 percent and 4.5 percent average improvements over baselines in multimodal and text-only benchmarks respectively.
What carries the argument
Two-stage LMM-R1 framework with Foundational Reasoning Enhancement (FRE) on text-only rule-based RL followed by Multimodal Generalization Training (MGT) to transfer reasoning to visual-text inputs.
If this is right
- Small 3B LMMs reach higher accuracy on complex multimodal reasoning tasks such as Football Game.
- Rule-based RL developed in text domains can be adapted for multimodal use through sequential stages.
- Development of capable small multimodal models becomes feasible with less reliance on expensive multimodal datasets.
- Foundational text reasoning gains carry over to multimodal settings while preserving baseline text performance.
Where Pith is reading between the lines
- Abundant text corpora could serve as a scalable bootstrap for multimodal reasoning in resource-limited settings.
- The staged method might extend to other modalities such as video if the transfer property holds beyond static images.
- Similar separation of text strengthening and multimodal application could be tested with non-rule-based RL variants.
- If transfer works reliably, it lowers barriers for smaller research groups to build reasoning-focused LMMs.
Load-bearing premise
Reasoning skills strengthened on text data will transfer to multimodal tasks without being substantially disrupted by visual perception components.
What would settle it
Running multimodal training directly on the same base model without the preceding text-only FRE stage and obtaining equal or higher benchmark scores would show the two-stage separation is not required.
read the original abstract
Enhancing reasoning in Large Multimodal Models (LMMs) faces unique challenges from the complex interplay between visual perception and logical reasoning, particularly in compact 3B-parameter architectures where architectural constraints limit reasoning capacity and modality alignment. While rule-based reinforcement learning (RL) excels in text-only domains, its multimodal extension confronts two critical barriers: (1) data limitations due to ambiguous answers and scarce complex reasoning examples, and (2) degraded foundational reasoning induced by multimodal pretraining. To address these challenges, we propose \textbf{LMM-R1}, a two-stage framework adapting rule-based RL for multimodal reasoning through \textbf{Foundational Reasoning Enhancement (FRE)} followed by \textbf{Multimodal Generalization Training (MGT)}. The FRE stage first strengthens reasoning abilities using text-only data with rule-based RL, then the MGT stage generalizes these reasoning capabilities to multimodal domains. Experiments on Qwen2.5-VL-Instruct-3B demonstrate that LMM-R1 achieves 4.83\% and 4.5\% average improvements over baselines in multimodal and text-only benchmarks, respectively, with a 3.63\% gain in complex Football Game tasks. These results validate that text-based reasoning enhancement enables effective multimodal generalization, offering a data-efficient paradigm that bypasses costly high-quality multimodal training data.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes LMM-R1, a two-stage framework for improving reasoning in 3B-parameter LMMs: Foundational Reasoning Enhancement (FRE) applies rule-based RL on text-only data to strengthen core reasoning, followed by Multimodal Generalization Training (MGT) to transfer those skills to multimodal inputs. Experiments on Qwen2.5-VL-Instruct-3B report average gains of 4.83% on multimodal benchmarks and 4.5% on text-only benchmarks over baselines, plus 3.63% on complex Football Game tasks, positioning the method as a data-efficient alternative that avoids expensive high-quality multimodal training data.
Significance. If substantiated, the work offers a practical route to bootstrap multimodal reasoning in compact LMMs by leveraging abundant text-only RL signals, which could lower data costs and address capacity limits in small architectures. The empirical pipeline is straightforward and avoids circular parameter fitting, but the modest reported gains and absence of isolating controls currently limit the strength of the transfer claim.
major comments (3)
- [Abstract and Experiments] Abstract and experimental results: the claimed 4.83% multimodal and 4.5% text-only average improvements are presented without defining the baselines, reporting error bars, statistical significance tests, or details on task selection and data filtering. This prevents assessment of whether the gains are robust or sensitive to post-hoc choices.
- [Method and Experiments] Method and Experiments: the load-bearing claim that FRE (text-only rule-based RL) produces transferable reasoning skills that MGT then generalizes to multimodal inputs without degradation or visual interference is unsupported by ablations. No results compare MGT alone versus the full FRE+MGT pipeline, nor provide pre/post-FRE multimodal reasoning probes, so the contribution of the first stage cannot be isolated from MGT training itself.
- [Method] Method: although the paper notes answer ambiguity as a barrier for multimodal rule-based RL, it supplies no concrete definition or computation details for the reward function used in the MGT stage. This is essential for verifying that the RL process remains well-defined and stable under multimodal inputs.
minor comments (2)
- [Experiments] Clarify the exact set of multimodal and text-only benchmarks used, including any filtering criteria, so that the reported averages can be reproduced.
- [Method] Provide the precise formulation of the rule-based reward (e.g., exact matching criteria or partial-credit rules) for both FRE and MGT stages.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback, which has helped us clarify key aspects of our work. We address each major comment point by point below and have revised the manuscript to incorporate additional details, ablations, and explanations as suggested.
read point-by-point responses
-
Referee: [Abstract and Experiments] Abstract and experimental results: the claimed 4.83% multimodal and 4.5% text-only average improvements are presented without defining the baselines, reporting error bars, statistical significance tests, or details on task selection and data filtering. This prevents assessment of whether the gains are robust or sensitive to post-hoc choices.
Authors: We agree that these details are required for proper evaluation. In the revised manuscript, we explicitly define all baselines (the base Qwen2.5-VL-Instruct-3B and other compared models), report error bars as standard deviations from three random seeds, include paired t-test results with p-values for statistical significance, and add a dedicated paragraph in Section 4 detailing task selection criteria and data filtering procedures. A summary table of these settings has also been added. revision: yes
-
Referee: [Method and Experiments] Method and Experiments: the load-bearing claim that FRE (text-only rule-based RL) produces transferable reasoning skills that MGT then generalizes to multimodal inputs without degradation or visual interference is unsupported by ablations. No results compare MGT alone versus the full FRE+MGT pipeline, nor provide pre/post-FRE multimodal reasoning probes, so the contribution of the first stage cannot be isolated from MGT training itself.
Authors: We acknowledge the need to isolate the FRE contribution. We have run additional experiments comparing (i) MGT alone on the base model, (ii) multimodal performance immediately before and after FRE, and (iii) the full FRE+MGT pipeline. These results, now included in the revised Experiments section, show that FRE yields measurable multimodal gains and that the combined pipeline outperforms MGT alone with no degradation, thereby supporting the transfer claim. revision: yes
-
Referee: [Method] Method: although the paper notes answer ambiguity as a barrier for multimodal rule-based RL, it supplies no concrete definition or computation details for the reward function used in the MGT stage. This is essential for verifying that the RL process remains well-defined and stable under multimodal inputs.
Authors: We apologize for the missing details. The revised manuscript now contains a new subsection (3.3) that fully specifies the MGT reward function: it applies rule-based verification to the extracted final answer, using exact string matching augmented by a confidence threshold inherited from the FRE stage to mitigate ambiguity. The exact formula, input preprocessing steps for multimodal cases, and stability safeguards are provided to allow verification of the RL process. revision: yes
Circularity Check
No circularity in empirical two-stage training pipeline
full rationale
The paper presents a purely empirical two-stage training procedure (FRE on text-only data followed by MGT on multimodal data) with no equations, derivations, or fitted parameters that reduce the reported benchmark gains to quantities defined inside the same work. Performance is measured on external benchmarks after training, and the central claim of text-to-multimodal transfer rests on experimental outcomes rather than any self-definitional, self-citation load-bearing, or ansatz-smuggling step. No uniqueness theorems or renamings of known results are invoked in a circular manner.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Rule-based RL can strengthen foundational reasoning abilities when applied to text-only data
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
two-stage framework adapting rule-based RL for multimodal reasoning through Foundational Reasoning Enhancement (FRE) followed by Multimodal Generalization Training (MGT)
-
IndisputableMonolith/Foundation/ArrowOfTime.leanz_monotone_absolute unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
response length trends ... FRE-Text demonstrates rapid growth in response length ... FRE-Multi shows a consistent downward trend
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 23 Pith papers
-
CurveBench: A Benchmark for Exact Topological Reasoning over Nested Jordan Curves
CurveBench benchmark reveals that even leading VLMs like Gemini 3.1 Pro reach only 71.1% accuracy recovering containment trees on easy nested-curve images and 19.1% on hard versions, while fine-tuning lifts an open 8B...
-
Improving Vision-language Models with Perception-centric Process Reward Models
Perceval is a perception-centric PRM that detects token-level perceptual errors in VLMs, supporting token-advantage RL training and iterative test-time scaling for improved reasoning.
-
AnchorSeg: Language Grounded Query Banks for Reasoning Segmentation
AnchorSeg uses ordered query banks of latent reasoning tokens plus a spatial anchor token and a Token-Mask Cycle Consistency loss to achieve 67.7% gIoU and 68.1% cIoU on the ReasonSeg benchmark.
-
V-Reflection: Transforming MLLMs from Passive Observers to Active Interrogators
V-Reflection introduces a think-then-look mechanism where MLLM latent states actively interrogate visual features via two-stage distillation from a box-guided teacher to a dynamic autoregressive student, narrowing the...
-
Addressing Overthinking in Large Vision-Language Models via Gated Perception-Reasoning Optimization
GPRO trains a meta-controller on 790k failure-labeled samples to dynamically select fast, perception, or reasoning paths in LVLMs, yielding higher accuracy and shorter responses than prior slow-thinking methods.
-
Latent Visual Reasoning
Latent Visual Reasoning enables autoregressive generation of latent visual states that reconstruct critical image tokens, yielding gains on perception-heavy VQA benchmarks such as 71.67% on MMVP.
-
DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning
DeepEyes uses reinforcement learning to teach vision-language models active perception and image-based thinking, yielding gains on perception, reasoning, grounding, and hallucination benchmarks.
-
SpaceR: Reinforcing MLLMs in Video Spatial Reasoning
SpaceR uses a new verifiable dataset and map-imagination-augmented RLVR to reach SOTA spatial reasoning accuracy in MLLMs, exceeding GPT-4o on VSI-Bench.
-
R1-VL: Learning to Reason with Multimodal Large Language Models via Step-wise Group Relative Policy Optimization
R1-VL uses StepGRPO with rule-based StepRAR and StepRVR rewards to let MLLMs learn step-by-step reasoning beyond imitation of positive paths.
-
Self-Consistent Latent Reasoning: Long Latent Sequence Reasoning for Vision-Language Model
SCOLAR fixes information gain collapse in latent visual reasoning by generating independent auxiliary visual tokens via a detransformer, extending acceptable CoT length over 30x and delivering +14.12% gains on reasoni...
-
Self-Consistent Latent Reasoning: Long Latent Sequence Reasoning for Vision-Language Model
SCOLAR addresses information gain collapse in latent visual reasoning by generating independent auxiliary visual tokens from LLM hidden states, extending acceptable CoT length over 30x and achieving +14.12% gains on b...
-
RadThinking: A Dataset for Longitudinal Clinical Reasoning in Radiology
RadThinking releases a large longitudinal CT VQA dataset stratified into foundation perception questions, single-rule reasoning questions, and compositional multi-step chains grounded in clinical reporting standards f...
-
Reinforcing Multimodal Reasoning Against Visual Degradation
ROMA improves MLLM robustness to seen and unseen visual corruptions by +2.3-2.4% over GRPO on seven reasoning benchmarks while matching clean accuracy.
-
See Further, Think Deeper: Advancing VLM's Reasoning Ability with Low-level Visual Cues and Reflection
ForeSight lets VLMs use low-level visual cues and mask-based visual feedback within an RL loop to reason more accurately, with the 7B model beating same-scale peers and some closed-source SOTA on a new benchmark.
-
Thinking Before Matching: A Reinforcement Reasoning Paradigm Towards General Person Re-Identification
ReID-R achieves competitive person re-identification performance using chain-of-thought reasoning and reinforcement learning with only 14.3K non-trivial samples, about 20.9% of typical data scales, while providing int...
-
Reinforce to Learn, Elect to Reason: A Dual Paradigm for Video Reasoning
RLER trains video-reasoning models with three task-driven RL rewards for evidence production and elects the best answer from a few candidates via evidence consistency scoring, yielding 6.3% average gains on eight benchmarks.
-
AdaTooler-V: Adaptive Tool-Use for Images and Videos
AdaTooler-V trains MLLMs to adaptively use vision tools via AT-GRPO reinforcement learning and new datasets, reaching 89.8% on V* and outperforming GPT-4o.
-
SPHINX: A Synthetic Environment for Visual Perception and Reasoning
SPHINX generates synthetic visual puzzles for benchmarking LVLMs, where GPT-5 scores 51.1% and RLVR training improves both in-domain and external visual reasoning performance.
-
VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model
VLM-R1 applies R1-style RL using rule-based rewards on visual tasks with clear ground truth to achieve competitive performance and superior generalization over SFT in vision-language models.
-
Quantifying and Understanding Uncertainty in Large Reasoning Models
Conformal prediction adapted to reasoning-answer pairs in LRMs yields distribution-free uncertainty sets with finite-sample guarantees, paired with a Shapley explanation method that isolates provably sufficient traini...
-
Learning to Focus and Precise Cropping: A Reinforcement Learning Framework with Information Gaps and Grounding Loss for MLLMs
A two-stage RL method with information gaps and grounding loss trains MLLMs to focus on and precisely crop relevant image regions, yielding SOTA results on high-resolution VQA benchmarks.
-
Can Textual Reasoning Improve the Performance of MLLMs on Fine-grained Visual Classification?
Longer textual reasoning chains degrade MLLM accuracy on fine-grained visual tasks; a new normalization and constrained-reward training framework mitigates the effect and sets new SOTA numbers.
-
VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement Fine-Tuning
Reinforcement fine-tuning with temporal rewards produces VideoChat-R1, a video MLLM showing large gains on spatio-temporal perception benchmarks such as +31.8 temporal grounding and +31.2 object tracking.
Reference graph
Works this paper leans on
-
[1]
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ah- mad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. GPT-4 technical report. arXiv preprint arXiv:2303.08774 ,
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
Lawrence Zitnick, Devi Parikh, and Dhruv Ba- tra
Aishwarya Agrawal, Jiasen Lu, Stanislaw Antol, Margaret Mitchell, C. Lawrence Zitnick, Devi Parikh, and Dhruv Ba- tra. Vqa: Visual question answering. IJCV, 2015. 13
work page 2015
-
[3]
Flamingo: A visual language model for few-shot learning
Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: A visual language model for few-shot learning. In NeurIPS,
-
[4]
Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xi- aodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, Binyuan Hui, Luo Ji, Mei Li, Junyang Lin, Runji Lin, Day- iheng Liu, Gao Liu, Chengqiang Lu, Keming Lu, Jianxin Ma, Rui Men, Xingzhang Ren, Xuancheng Ren, Chuanqi Tan, Sinan Tan, Jianhong Tu, Peng Wang, Shijie Wang, Wei Wang, Shengguang Wu, Benfe...
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[5]
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-VL: A versatile vision-language model for un- derstanding, localization, text reading, and beyond. arXiv preprint arXiv:2308.12966, 2023. 3
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[6]
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhao- hai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Jun- yang Lin. Qwen2.5-VL technical repor...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[7]
Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback
Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, et al. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022. 3
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[8]
Jie Cao and Jing Xiao. An augmented benchmark dataset for geometric question answering through dual parallel text en- coding. In International Conference on Computational Lin- guistics, 2022. 13
work page 2022
-
[9]
Mapqa: A dataset for question answering on choropleth maps
Shuaichen Chang, David Palzer, Jialin Li, Eric Fosler- Lussier, and Ningchuan Xiao. Mapqa: A dataset for question answering on choropleth maps. arXiv preprint arXiv:2211.08545, 2022. 13
-
[10]
Jiaqi Chen, Tong Li, Jinghui Qin, Pan Lu, Liang Lin, Chongyu Chen, and Xiaodan Liang. UniGeo: Unifying ge- ometry logical reasoning via reformulating mathematical ex- pression. arXiv preprint arXiv:2212.02746, 2022. 13
-
[11]
Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic
Keqin Chen, Zhao Zhang, Weili Zeng, Richong Zhang, Feng Zhu, and Rui Zhao. Shikra: Unleashing multi- modal llm’s referential dialogue magic. arXiv preprint arXiv:2306.15195, 2023. 3
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[12]
Are we on the right way for evaluating large vision-language models? In NeurIPS, 2024
Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Jiaqi Wang, Yu Qiao, Dahua Lin, and Feng Zhao. Are we on the right way for evaluating large vision-language models? In NeurIPS, 2024. 6, 14
work page 2024
-
[13]
R1-v: Reinforcing super generalization ability in vision- language models with less than $3
Liang Chen, Lei Li, Haozhe Zhao, Yifan Song, and Vinci. R1-v: Reinforcing super generalization ability in vision- language models with less than $3. https://github. com/Deep-Agent/R1-V , 2025. Accessed: 2025-02-02. 3
work page 2025
-
[14]
Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling
Xiaokang Chen, Zhiyu Wu, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, and Chong Ruan. Janus- pro: Unified multimodal understanding and generation with data and model scaling. arXiv preprint arXiv:2501.17811 ,
work page internal anchor Pith review Pith/arXiv arXiv
-
[15]
InternVL: Scaling up vision foundation models and aligning for generic visual-linguistic tasks
Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. InternVL: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. In CVPR, 2024. 3
work page 2024
-
[16]
CoMT: A novel benchmark for chain of multi-modal thought on large vision- language models
Zihui Cheng, Qiguang Chen, Jin Zhang, Hao Fei, Xiaocheng Feng, Wanxiang Che, Min Li, and Libo Qin. CoMT: A novel benchmark for chain of multi-modal thought on large vision- language models. arXiv preprint arXiv:2412.12932, 2024. 2
-
[17]
Gonzalez, Ion Stoica, and Eric P
Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhang- hao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yong- hao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, 2023. 3
work page 2023
-
[18]
LightEval: A lightweight framework for llm evaluation, 2023
Clémentine Fourrier, Nathan Habib, Hynek Kydlí ˇcek, Thomas Wolf, and Lewis Tunstall. LightEval: A lightweight framework for llm evaluation, 2023. 13
work page 2023
-
[19]
Omni-MATH: A Universal Olympiad Level Mathematic Benchmark For Large Language Models
Bofei Gao, Feifan Song, Zhe Yang, Zefan Cai, Yibo Miao, Qingxiu Dong, Lei Li, Chenghao Ma, Liang Chen, Runxin Xu, Zhengyang Tang, Benyou Wang, Daoguang Zan, Shang- haoran Quan, Ge Zhang, Lei Sha, Yichang Zhang, Xu- ancheng Ren, Tianyu Liu, and Baobao Chang. Omni-MATH: 9 A universal olympiad level mathematic benchmark for large language models. arXiv prepr...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[20]
SEED-X: Multimodal Models with Unified Multi-granularity Comprehension and Generation
Yuying Ge, Sijie Zhao, Jinguo Zhu, Yixiao Ge, Kun Yi, Lin Song, Chen Li, Xiaohan Ding, and Ying Shan. Seed-x: Mul- timodal models with unified multi-granularity comprehen- sion and generation. arXiv preprint arXiv:2404.14396, 2024. 3
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[21]
Making the v in vqa matter: Elevating the role of image understanding in visual question answer- ing
Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Ba- tra, and Devi Parikh. Making the v in vqa matter: Elevating the role of image understanding in visual question answer- ing. IJCV, 2016. 13
work page 2016
-
[22]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025. 1, 3, 4, 7
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[23]
Danna Gurari, Qing Li, Abigale Stangl, Anhong Guo, Chi Lin, Kristen Grauman, Jiebo Luo, and Jeffrey P. Bigham. Vizwiz grand challenge: Answering visual questions from blind people. In CVPR, 2018. 13
work page 2018
-
[24]
Chaoqun He, Renjie Luo, Yuzhuo Bai, Shengding Hu, Zhen Leng Thai, Junhao Shen, Jinyi Hu, Xu Han, Yujie Huang, Yuxiang Zhang, Jie Liu, Lei Qi, Zhiyuan Liu, and Maosong Sun. OlympiadBench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems. In ACL, 2024. 2, 5, 14
work page 2024
-
[25]
OpenRLHF: An Easy-to-use, Scalable and High-performance RLHF Framework
Jian Hu, Xibin Wu, Zilin Zhu, Xianyu, Weixun Wang, De- hao Zhang, and Yu Cao. OpenRLHF: An easy-to-use, scal- able and high-performance RLHF framework.arXiv preprint arXiv:2405.11143, 2024. 13
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[26]
Kushal Kafle, Scott D. Cohen, Brian L. Price, and Christo- pher Kanan. DVQA: Understanding data visualizations via question answering. CVPR, 2018. 13
work page 2018
-
[27]
FigureQA: An Annotated Figure Dataset for Visual Reasoning
Samira Ebrahimi Kahou, Adam Atkinson, Vincent Michal- ski, Ákos Kádár, Adam Trischler, and Yoshua Bengio. Fig- ureqa: An annotated figure dataset for visual reasoning. arXiv preprint arXiv:1710.07300, 2017. 13
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[28]
A Diagram Is Worth A Dozen Images
Aniruddha Kembhavi, Michael Salvato, Eric Kolve, Minjoon Seo, Hannaneh Hajishirzi, and Ali Farhadi. A diagram is worth a dozen images. arXiv preprint arXiv:1603.07396 ,
work page internal anchor Pith review Pith/arXiv arXiv
-
[29]
Aniruddha Kembhavi, Minjoon Seo, Dustin Schwenk, Jonghyun Choi, Ali Farhadi, and Hannaneh Hajishirzi. Are you smarter than a sixth grader? textbook question answer- ing for multimodal machine comprehension. CVPR, 2017. 13
work page 2017
-
[30]
Google research football: A novel reinforcement learning environ- ment
Karol Kurach, Anton Raichuk, Piotr Sta ´nczyk, Michał Zaj ˛ ac, Olivier Bachem, Lasse Espeholt, Carlos Riquelme, Damien Vincent, Marcin Michalski, Olivier Bousquet, et al. Google research football: A novel reinforcement learning environ- ment. In AAAI, 2020. 3
work page 2020
-
[31]
Gonzalez, Hao Zhang, and Ion Stoica
Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. In Pro- ceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023. 14
work page 2023
-
[32]
A dataset of clinically generated visual questions and answers about radiology images
Jason J Lau, Soumya Gayen, Asma Ben Abacha, and Dina Demner-Fushman. A dataset of clinically generated visual questions and answers about radiology images. Scientific Data, 2018. 13
work page 2018
-
[33]
Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In ICML,
-
[34]
Uni- mMoE: Scaling unified multimodal llms with mixture of ex- perts
Yunxin Li, Shenyuan Jiang, Baotian Hu, Longyue Wang, Wanqi Zhong, Wenhan Luo, Lin Ma, and Min Zhang. Uni- mMoE: Scaling unified multimodal llms with mixture of ex- perts. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025. 3
work page 2025
-
[35]
Super-CLEVR: A virtual benchmark to diagnose domain robustness in visual reasoning
Zhuowan Li, Xingrui Wang, Elias Stengel-Eskin, Adam Ko- rtylewski, Wufei Ma, Benjamin Van Durme, Alan Yuille Johns Hopkins University, University of Southern Califor- nia, Max Planck Institute for Informatics, and University of Freiburg. Super-CLEVR: A virtual benchmark to diagnose domain robustness in visual reasoning. In CVPR, 2022. 13
work page 2022
-
[36]
Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Ed- wards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. In ICLR, 2023. 1, 6, 13
work page 2023
-
[37]
VILA: On pre-training for vi- sual language models
Ji Lin, Hongxu Yin, Wei Ping, Pavlo Molchanov, Moham- mad Shoeybi, and Song Han. VILA: On pre-training for vi- sual language models. In CVPR, 2024. 2
work page 2024
-
[38]
CLEVR-Math: A dataset for compositional language, visual and mathematical reasoning
Adam Dahlgren Lindström and Savitha Sam Abraham. CLEVR-Math: A dataset for compositional language, visual and mathematical reasoning. arXiv preprint arXiv:2208.05358, 2022. 13
-
[39]
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. In NeurIPS, 2023. 3
work page 2023
-
[40]
Improved baselines with visual instruction tuning
Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. In CVPR, 2024
work page 2024
-
[41]
LLaV A-NeXT: Im- proved reasoning, ocr, and world knowledge, 2024
Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. LLaV A-NeXT: Im- proved reasoning, ocr, and world knowledge, 2024. 3
work page 2024
-
[42]
Visualagentbench: Towards large multi- modal models as visual agents
Xiao Liu, Tianjie Zhang, Yu Gu, Iat Long Iong, Song XiXuan, Yifan Xu, Shudan Zhang, Hanyu Lai, Jiadai Sun, Xinyue Yang, et al. Visualagentbench: Towards large multi- modal models as visual agents. In ICLR, 2025. 3
work page 2025
-
[43]
Inter-GPS: Inter- pretable geometry problem solving with formal language and symbolic reasoning
Pan Lu, Ran Gong, Shibiao Jiang, Liang Qiu, Siyuan Huang, Xiaodan Liang, and Song-Chun Zhu. Inter-GPS: Inter- pretable geometry problem solving with formal language and symbolic reasoning. In ACL, 2021. 13
work page 2021
-
[44]
IconQA: A new benchmark for abstract diagram under- standing and visual language reasoning
Pan Lu, Liang Qiu, Jiaqi Chen, Tony Xia, Yizhou Zhao, Wei Zhang, Zhou Yu, Xiaodan Liang, and Song-Chun Zhu. IconQA: A new benchmark for abstract diagram under- standing and visual language reasoning. arXiv preprint arXiv:2110.13214, 2021. 13
- [45]
- [46]
-
[47]
MathVista: Evaluating mathemat- ical reasoning of foundation models in visual contexts
Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. MathVista: Evaluating mathemat- ical reasoning of foundation models in visual contexts. In ICLR, 2024. 6, 14
work page 2024
-
[48]
Improve Mathematical Reasoning in Language Models by Automated Process Supervision
Liangchen Luo, Yinxiao Liu, Rosanne Liu, Samrat Phatale, Harsh Lara, Yunxuan Li, Lei Shu, Yun Zhu, Lei Meng, Jiao Sun, et al. Improve mathematical reasoning in language models by automated process supervision. arXiv preprint arXiv:2406.06592, 2, 2024. 1
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[49]
Michael Luo, Sijun Tan, Justin Wong, Xiaoxiang Shi, William Y . Tang, Manan Roongta, Colin Cai, Jeffrey Luo, Tianjun Zhang, Li Erran Li, Raluca Ada Popa, and Ion Sto- ica. DeepScaleR: Surpassing o1-preview with a 1.5b model by scaling rl, 2025. Notion Blog. 2, 5, 7, 13
work page 2025
-
[50]
Ruilin Luo, Zhuofan Zheng, Yifan Wang, Yiyao Yu, Xinzhe Ni, Zicheng Lin, Jin Zeng, and Yujiu Yang. Ursa: Under- standing and verifying chain-of-thought reasoning in multi- modal mathematics. arXiv preprint arXiv:2501.04686, 2025. 3
-
[51]
ChartQA: A Benchmark for Question Answering about Charts with Visual and Logical Reasoning
Ahmed Masry, Do Xuan Long, Jia Qing Tan, Shafiq R. Joty, and Enamul Hoque. Chartqa: A benchmark for question an- swering about charts with visual and logical reasoning.arXiv preprint arXiv:2203.10244, 2022. 13
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[52]
Minesh Mathew, Viraj Bagal, Rubèn Pérez Tito, Dimosthe- nis Karatzas, Ernest Valveny, and C.V . Jawahar. Infograph- icvqa. In WACV, 2021. 13
work page 2021
-
[53]
Nitesh Methani, Pritha Ganguly, Mitesh M. Khapra, and Pratyush Kumar. Plotqa: Reasoning over scientific plots. In WACV, 2019. 13
work page 2019
- [54]
- [55]
-
[56]
Training lan- guage models to follow instructions with human feedback
Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Car- roll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training lan- guage models to follow instructions with human feedback. In NeurIPS, 2022. 3
work page 2022
-
[57]
Kosmos-2: Grounding Multimodal Large Language Models to the World
Zhiliang Peng, Wenhui Wang, Li Dong, Yaru Hao, Shaohan Huang, Shuming Ma, and Furu Wei. Kosmos-2: Ground- ing multimodal large language models to the world. arXiv preprint arXiv:2306.14824, 2023. 3
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[58]
Imagination-augmented agents for deep reinforce- ment learning
Sébastien Racanière, Théophane Weber, David Reichert, Lars Buesing, Arthur Guez, Danilo Jimenez Rezende, Adrià Puigdomènech Badia, Oriol Vinyals, Nicolas Heess, Yujia Li, et al. Imagination-augmented agents for deep reinforce- ment learning. In NeurIPS, 2017. 3, 5, 13
work page 2017
-
[59]
GPQA: A graduate-level google- proof q&a benchmark
David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R Bowman. GPQA: A graduate-level google- proof q&a benchmark. In COLM, 2024. 6, 13
work page 2024
-
[60]
Proximal Policy Optimization Algorithms
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Rad- ford, and Oleg Klimov. Proximal policy optimization algo- rithms. arXiv preprint arXiv:1707.06347, 2017. 3
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[61]
Solving geometry problems: Combining text and diagram interpretation
Minjoon Seo, Hannaneh Hajishirzi, Ali Farhadi, Oren Et- zioni, and Clint Malcolm. Solving geometry problems: Combining text and diagram interpretation. In EMNLP,
-
[62]
Rewarding progress: Scaling automated process verifiers for llm reasoning
Amrith Setlur, Chirag Nagpal, Adam Fisch, Xinyang Geng, Jacob Eisenstein, Rishabh Agarwal, Alekh Agarwal, Jonathan Berant, and Aviral Kumar. Rewarding progress: Scaling automated process verifiers for llm reasoning. In ICLR, 2025. 1
work page 2025
-
[63]
MoME: Mixture of multimodal experts for gen- eralist multimodal large language models
Leyang Shen, Gongwei Chen, Rui Shao, Weili Guan, and Liqiang Nie. MoME: Mixture of multimodal experts for gen- eralist multimodal large language models. InNeurIPS, 2025. 3
work page 2025
-
[64]
Math-LLaV A: Bootstrapping mathematical reasoning for multimodal large language models
Wenhao Shi, Zhiqiang Hu, Yi Bin, Junhua Liu, Yang Yang, See-Kiong Ng, Li Bing, and Roy Ka wei Lee. Math-LLaV A: Bootstrapping mathematical reasoning for multimodal large language models. In EMNLP, 2024. 5, 13
work page 2024
-
[65]
Towards vqa models that can read
Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. Towards vqa models that can read. In CVPR,
-
[66]
Aligning Large Multimodal Models with Factually Augmented RLHF
Zhiqing Sun, Sheng Shen, Shengcao Cao, Haotian Liu, Chunyuan Li, Yikang Shen, Chuang Gan, Liang-Yan Gui, Yu-Xiong Wang, Yiming Yang, et al. Aligning large mul- timodal models with factually augmented RLHF. arXiv preprint arXiv:2309.14525, 2023. 3
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[67]
Chameleon: Mixed-Modal Early-Fusion Foundation Models
Chameleon Team. Chameleon: Mixed-modal early-fusion foundation models. arXiv preprint arXiv:2405.09818, 2024. 3
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[68]
Gemini: A Family of Highly Capable Multimodal Models
Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean- Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023. 3
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[69]
Kimi Team, Angang Du, Bofei Gao, Bowei Xing, Changjiu Jiang, Cheng Chen, Cheng Li, Chenjun Xiao, Chenzhuang Du, Chonghua Liao, et al. Kimi k1. 5: Scaling reinforcement learning with llms. arXiv preprint arXiv:2501.12599, 2025. 3
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[70]
Qwen2.5: A party of foundation models, 2024
Qwen Team. Qwen2.5: A party of foundation models, 2024. 14
work page 2024
-
[71]
Qwq: Reflect deeply on the boundaries of the unknown, 2024
Qwen Team. Qwq: Reflect deeply on the boundaries of the unknown, 2024. 1
work page 2024
-
[72]
RUCAIBox STILL Team. STILL-3-1.5B-preview: Enhanc- ing slow thinking abilities of small models through reinforce- ment learning, 2025. 13
work page 2025
-
[73]
LLaMA: Open and Efficient Foundation Language Models
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023. 3
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[74]
Junyang Wang, Haiyang Xu, Haitao Jia, Xi Zhang, Ming Yan, Weizhou Shen, Ji Zhang, Fei Huang, and Jitao Sang. Mobile-agent-v2: Mobile device operation assistant with ef- fective navigation via multi-agent collaboration. InNeurIPS,
-
[75]
Mea- suring multimodal mathematical reasoning with math-vision dataset
Ke Wang, Junting Pan, Weikang Shi, Zimu Lu, Houxing Ren, Aojun Zhou, Mingjie Zhan, and Hongsheng Li. Mea- suring multimodal mathematical reasoning with math-vision dataset. In NeurIPS, 2025. 2, 6, 14
work page 2025
-
[76]
Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations
Peiyi Wang, Lei Li, Zhihong Shao, RX Xu, Damai Dai, Yifei Li, Deli Chen, Yu Wu, and Zhifang Sui. Math-shepherd: Verify and reinforce llms step-by-step without human anno- tations. arXiv preprint arXiv:2312.08935, 2023. 1, 4
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[77]
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution
Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Yang Fan, Kai Dang, Mengfei Du, Xuancheng Ren, Rui Men, Dayiheng Liu, Chang Zhou, Jingren Zhou, and Jun- yang Lin. Qwen2-VL: Enhancing vision-language model’s perception of the world at any resolution. arXiv preprint arXiv:2409.12191, 2024. 3
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[78]
VisionLLM: Large language model is also an open-ended decoder for vision-centric tasks
Wenhai Wang, Zhe Chen, Xiaokang Chen, Jiannan Wu, Xizhou Zhu, Gang Zeng, Ping Luo, Tong Lu, Jie Zhou, Yu Qiao, et al. VisionLLM: Large language model is also an open-ended decoder for vision-centric tasks. In NeurIPS,
-
[79]
Yiqi Wang, Wentao Chen, Xiaotian Han, Xudong Lin, Hait- eng Zhao, Yongfei Liu, Bohan Zhai, Jianbo Yuan, Quanzeng You, and Hongxia Yang. Exploring the reasoning abilities of multimodal large language models (mllms): A compre- hensive survey on emerging trends in multimodal reasoning. arXiv preprint arXiv:2401.06805, 2024. 2
-
[80]
Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation
Chengyue Wu, Xiaokang Chen, Zhiyu Wu, Yiyang Ma, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, Chong Ruan, et al. Janus: Decoupling visual encoding for unified multimodal understanding and generation. arXiv preprint arXiv:2410.13848, 2024. 3
work page internal anchor Pith review Pith/arXiv arXiv 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.