Recognition: 3 theorem links
· Lean TheoremDeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning
Pith reviewed 2026-05-11 14:37 UTC · model grok-4.3
The pith
Reinforcement learning lets vision-language models develop native image-based reasoning without pre-collected data.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
DeepEyes trains a vision-language model end-to-end with reinforcement learning so that it learns to think with images through active perception, using its intrinsic grounding capability rather than external tools or pre-collected reasoning data. A tailored data selection and reward strategy steers the model to strategically ground its reasoning in visual content. The outcome is significant gains on general perception and reasoning benchmarks together with better grounding, lower hallucination rates, and stronger mathematical reasoning. During training the model passes through distinct stages: initial exploratory perception gives way to efficient and accurate exploitation, accompanied by a多样化
What carries the argument
Active perception, the learned strategy by which the model decides when and how to ground its ongoing reasoning directly in visual information.
If this is right
- Performance improves on perception and reasoning benchmarks without any pre-collected reasoning traces.
- Grounding accuracy rises while hallucination rates fall, including on mathematical reasoning tasks.
- The model exhibits an internal progression from exploratory to exploitative visual behavior.
- Diverse thinking patterns appear that parallel human visual reasoning sequences.
Where Pith is reading between the lines
- The same reinforcement-learning incentive structure could be tested on video or audio sequences to induce analogous active-perception loops.
- If the approach scales, training pipelines for multimodal models may require far less curated reasoning data than current supervised routes.
- Longer-horizon tasks could reveal whether the emergent perception strategies remain stable or require additional reward shaping.
- Real-world deployment in dynamic environments would test whether the learned visual-grounding habits transfer beyond static benchmark images.
Load-bearing premise
The custom reward and data selection rules will steer the model toward genuine, useful visual grounding rather than superficial patterns that merely maximize the reward signal.
What would settle it
Run the same reinforcement learning loop with the visual-grounding reward terms removed or replaced by generic accuracy rewards; if benchmark gains and the reported evolution of perception behavior remain unchanged, the claim that active perception drives the improvements is falsified.
read the original abstract
Large Vision-Language Models excel at multimodal understanding but struggle to deeply integrate visual information into their predominantly text-based reasoning processes, a key challenge in mirroring human cognition. To address this, we introduce DeepEyes, a model that learns to "think with images", trained end-to-end with reinforcement learning without requiring pre-collected reasoning data for cold-start supervised fine-tuning (SFT). Notably, this ability emerges natively, leveraging the model's own grounding capability as an intrinsic function rather than relying on external specialized models or APIs. We enable this capability through active perception, where the model learns to strategically ground its reasoning in visual information, guided by a tailored data selection and reward strategy. DeepEyes achieves significant performance gains on general perception and reasoning benchmarks and also demonstrates improvement in grounding, hallucination, and mathematical reasoning tasks. Interestingly, we observe the distinct evolution of active perception from initial exploration to efficient and accurate exploitation, and diverse thinking patterns that closely mirror human visual reasoning processes. Code is available at https://github.com/Visual-Agent/DeepEyes.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces DeepEyes, a vision-language model trained end-to-end via reinforcement learning to develop native 'thinking with images' capability through active perception. It claims this emerges without any cold-start supervised fine-tuning on pre-collected reasoning data, relying instead on tailored data selection and a custom reward strategy that leverages the model's intrinsic grounding. The approach reportedly yields significant gains on general perception and reasoning benchmarks, plus improvements in grounding, hallucination reduction, and mathematical reasoning, with observed behavioral evolution from exploration to exploitation and diverse human-like thinking patterns.
Significance. If the central claims hold under rigorous verification, the work would be moderately significant for multimodal AI research. It offers an empirical demonstration that RL can elicit integrated visual reasoning in VLMs without heavy reliance on SFT or external tools, potentially reducing data curation costs and enabling more autonomous active perception. The public code release is a clear strength for reproducibility.
major comments (3)
- [Results] Results section (and any associated tables/figures reporting benchmark scores): The manuscript claims 'significant performance gains' on perception and reasoning benchmarks but provides no quantitative deltas, baseline comparisons, statistical significance tests, or error bars. Without these, it is impossible to evaluate whether the gains exceed what data curation alone would produce, which is load-bearing for the claim that the RL mechanism (rather than selection) drives the result.
- [Methods] Methods section on reward design and data selection: The reward strategy is described at a high level as 'tailored' to encourage active perception, but no explicit formulation (e.g., components for grounding accuracy, reasoning utility, or format compliance) or weighting is given. This prevents assessment of whether the policy converges to integrative visual thinking or to superficial high-reward patterns such as periodic token emission, directly undermining the 'natively emerges' and 'causal integration' claims.
- [Analysis] Analysis or ablation subsection (if present): There are no reported ablations that isolate the contribution of the RL reward versus data selection, nor any causal intervention (e.g., forcing or removing visual thought steps and measuring downstream accuracy change). The observed 'evolution from exploration to exploitation' is presented observationally; without metrics tracking grounding utility over training or controlled experiments, the mechanism remains unverified.
minor comments (2)
- [Abstract] The abstract and introduction use the phrase 'significant performance gains' without defining the term or providing supporting numbers; this should be replaced with concrete metrics or removed.
- [Methods] Notation for the active perception loop (e.g., how visual grounding actions are interleaved with text reasoning) is introduced informally; a clear algorithmic pseudocode or diagram would improve clarity.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive feedback. We appreciate the opportunity to clarify the presentation of our results, methods, and analyses. We address each major comment below and commit to revisions that will strengthen the manuscript.
read point-by-point responses
-
Referee: [Results] Results section (and any associated tables/figures reporting benchmark scores): The manuscript claims 'significant performance gains' on perception and reasoning benchmarks but provides no quantitative deltas, baseline comparisons, statistical significance tests, or error bars. Without these, it is impossible to evaluate whether the gains exceed what data curation alone would produce, which is load-bearing for the claim that the RL mechanism (rather than selection) drives the result.
Authors: We agree that explicit quantitative comparisons are necessary to substantiate the claims. In the revised manuscript we will add tables reporting baseline scores, absolute and relative performance deltas, error bars from multiple runs, and statistical significance tests. We will also include a discussion comparing the observed gains against what data curation alone can achieve, thereby clarifying the contribution of the RL objective. revision: yes
-
Referee: [Methods] Methods section on reward design and data selection: The reward strategy is described at a high level as 'tailored' to encourage active perception, but no explicit formulation (e.g., components for grounding accuracy, reasoning utility, or format compliance) or weighting is given. This prevents assessment of whether the policy converges to integrative visual thinking or to superficial high-reward patterns such as periodic token emission, directly undermining the 'natively emerges' and 'causal integration' claims.
Authors: We acknowledge that the reward formulation was presented at too high a level. The revised Methods section will contain the complete mathematical definition of the reward, explicitly listing each component (grounding accuracy, reasoning utility, format compliance) together with the weighting coefficients. This will enable readers to evaluate convergence behavior and rule out superficial reward hacking. revision: yes
-
Referee: [Analysis] Analysis or ablation subsection (if present): There are no reported ablations that isolate the contribution of the RL reward versus data selection, nor any causal intervention (e.g., forcing or removing visual thought steps and measuring downstream accuracy change). The observed 'evolution from exploration to exploitation' is presented observationally; without metrics tracking grounding utility over training or controlled experiments, the mechanism remains unverified.
Authors: We agree that additional ablations and quantitative tracking would strengthen the mechanistic claims. The revision will include ablation experiments that compare full RL training against data-selection-only baselines, as well as plots of grounding utility and exploration/exploitation metrics across training steps. Full causal interventions (forcing or ablating visual thought steps) would require new controlled runs; we will therefore provide enhanced observational analysis and discuss the limits of the current evidence. revision: partial
Circularity Check
No circularity: empirical RL training with external benchmarks
full rationale
The paper presents an empirical end-to-end RL method for training VLMs to perform active perception and 'think with images' without cold-start SFT. Claims rest on performance gains measured against external perception/reasoning benchmarks and observed behavioral evolution during training. No mathematical derivations, equations, or self-referential definitions are present that would reduce any result to its inputs by construction. The approach is self-contained against independent evaluation data.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 47 Pith papers
-
CiteVQA: Benchmarking Evidence Attribution for Trustworthy Document Intelligence
CiteVQA requires models to cite specific document regions with bounding boxes alongside answers and finds that even the strongest MLLMs frequently cite the wrong region, with top SAA scores of only 76.0 for closed mod...
-
GeoVista: Visually Grounded Active Perception for Ultra-High-Resolution Remote Sensing Understanding
GeoVista introduces a planning-driven active perception framework with global exploration plans, branch-wise local inspection, and explicit evidence tracking to achieve state-of-the-art results on ultra-high-resolutio...
-
UHR-Micro: Diagnosing and Mitigating the Resolution Illusion in Earth Observation VLMs
VLMs show a resolution illusion on UHR Earth observation imagery where higher resolution does not improve micro-target perception; UHR-Micro benchmark and MAP-Agent address this via evidence-centered active inspection.
-
UniVLR: Unifying Text and Vision in Visual Latent Reasoning for Multimodal LLMs
UniVLR unifies textual and visual reasoning in multimodal LLMs by compressing reasoning traces and auxiliary images into visual latent tokens for direct inference without interleaved text CoT.
-
When to Re-Commit: Temporal Abstraction Discovery for Long-Horizon Vision-Language Reasoning
State-conditioned commitment depth in a vision-language policy Pareto-dominates fixed-depth baselines on Sliding Puzzle and Sokoban, raising solve rates by up to 12.5 points while using 25% fewer actions and beating l...
-
GazeVLM: Active Vision via Internal Attention Control for Multimodal Reasoning
GazeVLM introduces internal gaze tokens that allow VLMs to dynamically suppress irrelevant visual features and simulate foveal attention for improved high-resolution multimodal reasoning.
-
Pest-Thinker: Learning to Think and Reason like Entomologists via Reinforcement Learning
Pest-Thinker is a reinforcement learning framework that improves MLLMs' expert-level reasoning on pest morphology via synthesized CoT trajectories, GRPO optimization, and an LLM-judged feature reward on new benchmarks...
-
Act2See: Emergent Active Visual Perception for Video Reasoning
Act2See trains VLMs via supervised fine-tuning on verified reasoning traces to interleave active frame calls within text CoTs, yielding SOTA results on video reasoning benchmarks.
-
Chain of Modality: From Static Fusion to Dynamic Orchestration in Omni-MLLMs
Chain of Modality dynamically orchestrates multimodal input topologies and bifurcates cognitive execution to overcome static fusion biases in Omni-MLLMs.
-
TableVision: A Large-Scale Benchmark for Spatially Grounded Reasoning over Complex Hierarchical Tables
TableVision benchmark shows explicit spatial grounding recovers MLLM reasoning on hierarchical tables, delivering 12.3% accuracy improvement through a decoupled perception-reasoning framework.
-
OmniSch: A Multimodal PCB Schematic Benchmark For Structured Diagram Visual Reasoning
OmniSch is the first benchmark exposing gaps in LMMs for PCB schematic visual grounding, topology-to-graph parsing, geometric weighting, and tool-augmented reasoning.
-
V-Reflection: Transforming MLLMs from Passive Observers to Active Interrogators
V-Reflection introduces a think-then-look mechanism where MLLM latent states actively interrogate visual features via two-stage distillation from a box-guided teacher to a dynamic autoregressive student, narrowing the...
-
Motion-o: Trajectory-Grounded Video Reasoning
Motion-o extends VLMs with Motion Chain of Thought (MCoT) using <motion/> tags and perturbation rewards to make object trajectories explicit and supervised in video reasoning.
-
Venus-DeFakerOne: Unified Fake Image Detection & Localization
DeFakerOne integrates InternVL2 and SAM2 into a single model that achieves state-of-the-art results on 39 detection and 9 localization benchmarks for unified fake image detection and localization.
-
ToolCUA: Towards Optimal GUI-Tool Path Orchestration for Computer Use Agents
ToolCUA introduces a trajectory scaling pipeline and staged RL to optimize GUI-tool switching, reaching 46.85% accuracy on OSWorld-MCP for a 66% relative gain over baseline.
-
Self-Consistent Latent Reasoning: Long Latent Sequence Reasoning for Vision-Language Model
SCOLAR addresses information gain collapse in latent visual reasoning by generating independent auxiliary visual tokens from LLM hidden states, extending acceptable CoT length over 30x and achieving +14.12% gains on b...
-
Self-Consistent Latent Reasoning: Long Latent Sequence Reasoning for Vision-Language Model
SCOLAR fixes information gain collapse in latent visual reasoning by generating independent auxiliary visual tokens via a detransformer, extending acceptable CoT length over 30x and delivering +14.12% gains on reasoni...
-
Beyond Thinking: Imagining in 360$^\circ$ for Humanoid Visual Search
Imagining in 360° decouples visual search into a single-step probabilistic semantic layout predictor and an actor, removing the need for multi-turn CoT reasoning and trajectory annotations while improving efficiency i...
-
Hierarchical Visual Agent: Managing Contexts in Joint Image-Text Space for Advanced Chart Reasoning
HierVA improves multi-step chart question answering by having a high-level manager maintain key joint contexts while specialized workers perform targeted reasoning with visual zoom-in.
-
AutoFocus: Uncertainty-Aware Active Visual Search for GUI Grounding
AutoFocus converts token perplexity into an anisotropic Gaussian uncertainty field to drive region proposals and shape-aware zooming for improved GUI grounding in VLMs.
-
Agentic AI for Remote Sensing: Technical Challenges and Research Directions
Agentic AI faces structural challenges in remote sensing due to geospatial data properties and workflow constraints, requiring EO-native agents built around structured state, tool-aware reasoning, and validity-aware e...
-
See Further, Think Deeper: Advancing VLM's Reasoning Ability with Low-level Visual Cues and Reflection
ForeSight lets VLMs use low-level visual cues and mask-based visual feedback within an RL loop to reason more accurately, with the 7B model beating same-scale peers and some closed-source SOTA on a new benchmark.
-
SSL-R1: Self-Supervised Visual Reinforcement Post-Training for Multimodal Large Language Models
SSL-R1 reformulates visual SSL tasks into verifiable puzzles to supply rewards for RL post-training of MLLMs, yielding gains on multimodal benchmarks without external supervision.
-
Visual Reasoning through Tool-supervised Reinforcement Learning
ToolsRL trains MLLMs via a tool-specific then accuracy-focused RL curriculum to master visual tools for complex reasoning tasks.
-
Agent-World: Scaling Real-World Environment Synthesis for Evolving General Agent Intelligence
Agent-World autonomously synthesizes verifiable real-world tasks and uses continuous self-evolution to train 8B and 14B agents that outperform proprietary models on 23 benchmarks.
-
Chain-of-Glimpse: Search-Guided Progressive Object-Grounded Reasoning for Video Understanding
Chain-of-Glimpse is a reinforcement learning framework that builds progressive, spatially grounded reasoning traces around task-relevant objects in videos to enable more accurate and interpretable multi-step decisions.
-
Towards Long-horizon Agentic Multimodal Search
LMM-Searcher uses file-based visual UIDs and a fetch tool plus 12K synthesized trajectories to fine-tune a multimodal agent that scales to 100-turn horizons and reaches SOTA among open-source models on MM-BrowseComp a...
-
Visual Enhanced Depth Scaling for Multimodal Latent Reasoning
Visual replay module and adaptive depth scaling improve multimodal latent reasoning, reaching SOTA benchmarks with faster inference than explicit chain-of-thought methods.
-
AnomalyAgent: Agentic Industrial Anomaly Synthesis via Tool-Augmented Reinforcement Learning
AnomalyAgent uses tool-augmented reinforcement learning with self-reflection to generate realistic industrial anomalies, achieving better metrics than zero-shot methods on MVTec-AD.
-
Q-Zoom: Query-Aware Adaptive Perception for Efficient Multimodal Large Language Models
Q-Zoom achieves up to 4.39x inference speedup in high-resolution MLLM scenarios via query-aware gating and region localization, matching or exceeding baseline accuracy on document and high-res benchmarks.
-
Walk the Talk: Bridging the Reasoning-Action Gap for Thinking with Images via Multimodal Agentic Policy Optimization
MAPO improves multimodal chain-of-thought reasoning by requiring explicit textual descriptions of visual tool results and using a novel advantage estimator that combines semantic alignment with task rewards.
-
LAST: Leveraging Tools as Hints to Enhance Spatial Reasoning for Multimodal Large Language Models
LAST augments MLLMs with a tool-abstraction sandbox and three-stage training to deliver around 20% gains on spatial reasoning tasks, outperforming closed-source models.
-
CharTool: Tool-Integrated Visual Reasoning for Chart Understanding
CharTool equips MLLMs with cropping and code tools plus agentic RL on DuoChart data to raise chart-reasoning accuracy by up to 9.78 percent on benchmarks.
-
Fill the GAP: A Granular Alignment Paradigm for Visual Reasoning in Multimodal Large Language Models
GAP aligns visual latent reasoning in MLLMs at feature, context, and capacity levels, yielding the best aggregate perception and reasoning scores on Qwen2.5-VL 7B among supervised variants while providing task-relevan...
-
Perceptual Flow Network for Visually Grounded Reasoning
PFlowNet decouples perception from reasoning, integrates multi-dimensional rewards with vicinal geometric shaping via variational RL, and reports new SOTA results on V* Bench (90.6%) and MME-RealWorld-lite (67.0%).
-
Enhancing Multimodal In-Context Learning via Inductive-Deductive Reasoning
A framework with similarity-based visual token compression, dynamic attention rebalancing, and explicit inductive-deductive chain-of-thought improves multimodal ICL performance across eight benchmarks for open-source VLMs.
-
Agentic AI for Remote Sensing: Technical Challenges and Research Directions
Agentic AI for remote sensing requires new designs centered on structured geospatial state, tool-aware reasoning, verifier-guided execution, and physical validity rather than generic extensions.
-
SAKE: Self-aware Knowledge Exploitation-Exploration for Grounded Multimodal Named Entity Recognition
SAKE is an agentic framework for GMNER that uses uncertainty-based self-awareness and reinforcement learning to balance internal knowledge exploitation with adaptive external exploration.
-
Q-DeepSight: Incentivizing Thinking with Images for Image Quality Assessment and Refinement
Q-DeepSight proposes a think-with-image multimodal CoT framework trained via RL with perceptual curriculum rewards and evidence gradient filtering to achieve SOTA IQA performance and enable training-free perceptual re...
-
HiVLA: A Visual-Grounded-Centric Hierarchical Embodied Manipulation System
HiVLA decouples VLM-based semantic planning with visual grounding from a cascaded cross-attention DiT action expert, outperforming end-to-end VLAs on long-horizon and fine-grained manipulation.
-
HiVLA: A Visual-Grounded-Centric Hierarchical Embodied Manipulation System
HiVLA decouples VLM-based semantic planning from DiT-based motor control via structured plans and cascaded cross-attention to outperform end-to-end VLA baselines in long-horizon and fine-grained manipulation.
-
Test-time Scaling over Perception: Resolving the Grounding Paradox in Thinking with Images
TTSP resolves the Grounding Paradox by treating perception as a scalable test-time process that generates, filters, and iteratively refines multiple visual exploration traces, outperforming baselines on high-resolutio...
-
Visual Enhanced Depth Scaling for Multimodal Latent Reasoning
A visual replay module combined with adaptive depth scaling improves multimodal latent reasoning, delivering state-of-the-art benchmark results and faster inference than explicit chain-of-thought methods.
-
Visual Enhanced Depth Scaling for Multimodal Latent Reasoning
Visual replay and depth scaling in latent reasoning produce state-of-the-art multimodal results with faster inference than explicit CoT.
-
MAG-3D: Multi-Agent Grounded Reasoning for 3D Understanding
MAG-3D is a training-free multi-agent framework that coordinates planning, grounding, and coding agents with off-the-shelf VLMs to achieve grounded 3D reasoning and state-of-the-art benchmark results.
-
Act Wisely: Cultivating Meta-Cognitive Tool Use in Agentic Multimodal Models
HDPO reframes tool efficiency as a conditional objective within accurate trajectories, enabling Metis to reduce tool invocations by orders of magnitude while raising reasoning accuracy.
-
Learning to Focus and Precise Cropping: A Reinforcement Learning Framework with Information Gaps and Grounding Loss for MLLMs
A two-stage RL method with information gaps and grounding loss trains MLLMs to focus on and precisely crop relevant image regions, yielding SOTA results on high-resolution VQA benchmarks.
Reference graph
Works this paper leans on
-
[1]
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774,
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report.arXiv preprint arXiv:2309.16609,
work page internal anchor Pith review Pith/arXiv arXiv
-
[3]
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923,
work page internal anchor Pith review Pith/arXiv arXiv
-
[4]
Mahtab Bigverdi, Zelun Luo, Cheng-Yu Hsieh, Ethan Shen, Dongping Chen, Linda G Shapiro, and Ranjay Krishna. Perception tokens enhance visual reasoning in multimodal language models. arXiv preprint arXiv:2412.03548,
-
[5]
Kezhen Chen, Rahul Thapa, Rahul Chalamala, Ben Athiwaratkun, Shuaiwen Leon Song, and James Zou. Dragonfly: Multi-resolution zoom supercharges large visual-language model.arXiv e-prints, pp. arXiv–2406, 2024a. Shuang Chen, Yue Guo, Zhaochen Su, Yafu Li, Yulun Wu, Jiacheng Chen, Jiayu Chen, Weijie Wang, Xiaoye Qu, and Yu Cheng. Advancing multimodal reasonin...
-
[6]
10 Published as a conference paper at ICLR 2026 Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhangwei Gao, Erfei Cui, Jinguo Zhu, Shenglong Ye, Hao Tian, Zhaoyang Liu, et al. Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling.arXiv preprint arXiv:2412.05271, 2024b. Zhe Chen, Jiannan Wu, Wenhai Wang, ...
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[7]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025a. Dong Guo, Faming Wu, Feida Zhu, Fuxing Leng, Guang Shi, Haobin Chen, Haoqi Fan, Jian Wang, Jianyu Jiang, Jia...
work page internal anchor Pith review Pith/arXiv arXiv
-
[8]
Referitgame: Referring to objects in photographs of natural scenes
Sahar Kazemzadeh, Vicente Ordonez, Mark Matten, and Tamara Berg. Referitgame: Referring to objects in photographs of natural scenes. InProceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp. 787–798,
work page 2014
-
[9]
LLaVA-OneVision: Easy Visual Task Transfer
Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, et al. Llava-onevision: Easy visual task transfer.arXiv preprint arXiv:2408.03326, 2024a. Chengzu Li, Wenshan Wu, Huanyu Zhang, Yan Xia, Shaoguang Mao, Li Dong, Ivan Vuli ´c, and Furu Wei. Imagine while reasoning in space: Multimodal visua...
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[10]
Microsoft coco: Common objects in context
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. InComputer vision– ECCV 2014: 13th European conference, zurich, Switzerland, September 6-12, 2014, proceedings, part v 13, pp. 740–755. Springer,
work page 2014
-
[11]
arXiv preprint arXiv:2503.06520 (2025)
Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning, 2023a. Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. InNeurIPS, 2023b. Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. Llavanext: Improved reasoning, ocr, and world knowled...
-
[12]
MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts
Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts.arXiv preprint arXiv:2310.02255,
work page internal anchor Pith review Pith/arXiv arXiv
-
[13]
Xuewen Luo, Fan Ding, Yinsheng Song, Xiaofeng Zhang, and Junnyong Loo. Pkrd-cot: A unified chain-of-thought prompting for multi-modal large language models in autonomous driving.arXiv preprint arXiv:2412.02025,
-
[14]
MM-Eureka: Exploring the Frontiers of Multimodal Reasoning with Rule-based Reinforcement Learning
Fanqing Meng, Lingxiao Du, Zongkai Liu, Zhixiang Zhou, Quanfeng Lu, Daocheng Fu, Tiancheng Han, Botian Shi, Wenhai Wang, Junjun He, et al. Mm-eureka: Exploring the frontiers of multimodal reasoning with rule-based reinforcement learning.arXiv preprint arXiv:2503.07365,
-
[15]
12 Published as a conference paper at ICLR 2026 Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel Candès, and Tatsunori Hashimoto. s1: Simple test-time scaling.arXiv preprint arXiv:2501.19393,
work page Pith review arXiv 2026
-
[16]
arXiv preprint arXiv:2503.07536 , year =
Yingzhe Peng, Gongrui Zhang, Miaosen Zhang, Zhiyuan You, Jie Liu, Qipeng Zhu, Kai Yang, Xingzhong Xu, Xin Geng, and Xu Yang. Lmm-r1: Empowering 3b lmms with strong reasoning abilities through two-stage rule-based rl.arXiv preprint arXiv:2503.07536,
-
[17]
Runqi Qiao, Qiuna Tan, Guanting Dong, Minhui Wu, Chong Sun, Xiaoshuai Song, Zhuoma GongQue, Shanglin Lei, Zhe Wei, Miaoxuan Zhang, et al. We-math: Does your large multimodal model achieve human-like mathematical reasoning?arXiv preprint arXiv:2407.01284,
-
[18]
Hao Shao, Shengju Qian, Han Xiao, Guanglu Song, Zhuofan Zong, Letian Wang, Yu Liu, and Hongsheng Li. Visual cot: Advancing multi-modal language models with a comprehensive dataset and benchmark for chain-of-thought reasoning.Advances in Neural Information Processing Systems, 37:8612–8642, 2024a. Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song...
work page internal anchor Pith review Pith/arXiv arXiv
-
[19]
Leyang Shen, Gongwei Chen, Rui Shao, Weili Guan, and Liqiang Nie. Mome: Mixture of multimodal experts for generalist multimodal large language models.arXiv preprint arXiv:2407.12709, 2024b. Fangxun Shu, Yue Liao, Le Zhuo, Chenning Xu, Lei Zhang, Guanghao Zhang, Haonan Shi, Long Chen, Tao Zhong, Wanggui He, et al. Llava-mod: Making llava tiny via moe knowl...
-
[20]
Pixel Reasoner: Incentivizing Pixel-Space Reasoning with Curiosity-Driven Reinforcement Learning
Alex Su, Haozhe Wang, Weiming Ren, Fangzhen Lin, and Wenhu Chen. Pixel reasoner: In- centivizing pixel-space reasoning with curiosity-driven reinforcement learning.arXiv preprint arXiv:2505.15966,
work page internal anchor Pith review arXiv
-
[21]
Visual agents as fast and slow thinkers.arXiv preprint arXiv:2408.08862, 2024
Guangyan Sun, Mingyu Jin, Zhenting Wang, Cheng-Long Wang, Siqi Ma, Qifan Wang, Tong Geng, Ying Nian Wu, Yongfeng Zhang, and Dongfang Liu. Visual agents as fast and slow thinkers.arXiv preprint arXiv:2408.08862,
-
[22]
Kimi Team, Angang Du, Bofei Gao, Bowei Xing, Changjiu Jiang, Cheng Chen, Cheng Li, Chenjun Xiao, Chenzhuang Du, Chonghua Liao, et al. Kimi k1. 5: Scaling reinforcement learning with llms.arXiv preprint arXiv:2501.12599, 2025a. Kimi Team, Angang Du, Bohong Yin, Bowei Xing, Bowen Qu, Bowen Wang, Cheng Chen, Chenlin Zhang, Chenzhuang Du, Chu Wei, et al. Kimi...
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[23]
Yijia Xiao, Edward Sun, Tianyu Liu, and Wei Wang. Logicvista: Multimodal llm logical reasoning benchmark in visual contexts.arXiv preprint arXiv:2407.04973,
-
[24]
Show-o: One Single Transformer to Unify Multimodal Understanding and Generation
Jinheng Xie, Weijia Mao, Zechen Bai, David Junhao Zhang, Weihao Wang, Kevin Qinghong Lin, Yuchao Gu, Zhijie Chen, Zhenheng Yang, and Mike Zheng Shou. Show-o: One single transformer to unify multimodal understanding and generation.arXiv preprint arXiv:2408.12528,
work page internal anchor Pith review arXiv
-
[25]
Chenkai Xu, Xu Wang, Zhenyi Liao, Yishun Li, Tianqi Hou, and Zhijie Deng. Show-o turbo: Towards accelerated unified multimodal understanding and generation.arXiv preprint arXiv:2502.05415,
-
[26]
An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, et al. Qwen2. 5 technical report.arXiv preprint arXiv:2412.15115,
work page internal anchor Pith review Pith/arXiv arXiv
-
[27]
The dawn of lmms: Preliminary explorations with gpt-4v (ision)
Zhengyuan Yang, Linjie Li, Kevin Lin, Jianfeng Wang, Chung-Ching Lin, Zicheng Liu, and Li- juan Wang. The dawn of lmms: Preliminary explorations with gpt-4v (ision).arXiv preprint arXiv:2309.17421, 9(1):1,
-
[28]
Jiabo Ye, Haiyang Xu, Haowei Liu, Anwen Hu, Ming Yan, Qi Qian, Ji Zhang, Fei Huang, and Jingren Zhou. mplug-owl3: Towards long image-sequence understanding in multi-modal large language models.arXiv preprint arXiv:2408.04840, 2024a. Qinghao Ye, Haiyang Xu, Guohai Xu, Jiabo Ye, Ming Yan, Yiyang Zhou, Junyang Wang, Anwen Hu, Pengcheng Shi, Yaya Shi, et al. ...
-
[29]
A Survey on Test-Time Scaling in Large Language Models: What, How, Where, and How Well?
Qinghao Ye, Haiyang Xu, Jiabo Ye, Ming Yan, Anwen Hu, Haowei Liu, Qi Qian, Ji Zhang, and Fei Huang. mplug-owl2: Revolutionizing multi-modal large language model with modality collaboration. InProceedings of the ieee/cvf conference on computer vision and pattern recognition, pp. 13040–13051, 2024b. Qiyuan Zhang, Fuyuan Lyu, Zexu Sun, Lei Wang, Weixu Zhang,...
work page internal anchor Pith review arXiv 2026
-
[30]
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models
Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shenglong Ye, Lixin Gu, Yuchen Duan, Hao Tian, Weijie Su, Jie Shao, et al. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models.arXiv preprint arXiv:2504.10479,
work page internal anchor Pith review Pith/arXiv arXiv
-
[31]
Chengke Zou, Xingang Guo, Rui Yang, Junyu Zhang, Bin Hu, and Huan Zhang. Dynamath: A dynamic visual benchmark for evaluating mathematical reasoning robustness of vision language models.arXiv preprint arXiv:2411.00836,
-
[32]
A PROMPT A.1 SYSTEMPROMPT SYSTEM_PROMPT You are a helpful assistant. # Tools You may call one or more functions to assist with the user query. You are provided with function signatures within <tools></tools> XML tags:,→ <tools> { "type": "function", "function": { "name": "image_zoom_in_tool", "description": "Zoom in on a specific region of an image by cro...
work page 2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.