Recognition: 2 theorem links
· Lean TheoremLatent Visual Reasoning
Pith reviewed 2026-05-15 18:38 UTC · model grok-4.3
The pith
Multimodal models can perform reasoning steps by autoregressively generating latent visual states that reconstruct key image tokens.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that training the language model to generate latent states in the joint visual embedding space, with the explicit objective of reconstructing critical visual tokens from the encoder, constitutes autoregressive visual reasoning that improves perception-intensive visual question answering when interleaved with standard text output.
What carries the argument
Autoregressive generation of latent visual states whose training target is reconstruction of selected visual tokens from the shared embedding space.
If this is right
- Interleaving visual latent steps with text output produces measurable gains on fine-grained visual question answering.
- Reinforcement learning via adapted GRPO can be used to balance the frequency of latent visual generation against text generation.
- The approach reduces dependence on external visual-editing tools by keeping reasoning inside the model's own embedding space.
Where Pith is reading between the lines
- The method might extend naturally to video by treating successive frames as sequences of visual tokens to reconstruct.
- If the latent states truly carry visual reasoning, similar reconstruction objectives could be added to pure language models for tasks that benefit from imagined visual simulation.
- A practical test would measure whether the same gains appear when the visual tokens are supplied through text captions instead of direct reconstruction.
Load-bearing premise
That forcing the model to reconstruct chosen visual tokens through latent states produces actual reasoning steps that outperform language-only chain-of-thought or external editing tools.
What would settle it
A controlled experiment showing that removing the visual-token reconstruction loss while keeping the same model size, data, and interleaving schedule yields no gain on MMVP or similar perception benchmarks.
read the original abstract
Multimodal Large Language Models (MLLMs) have achieved notable gains in various tasks by incorporating Chain-of-Thought (CoT) reasoning in language spaces. Recent work extends this direction by leveraging external tools for visual editing, thereby enhancing the visual signal along the reasoning trajectories. Nevertheless, these approaches remain fundamentally constrained: reasoning is still confined to the language space, with visual information treated as static preconditions. We introduce Latent Visual Reasoning (LVR), a new paradigm that enables autoregressive reasoning directly in the visual embedding space. A visual encoder first projects images into visual tokens within a joint semantic space shared with the language model. The language model is then trained to generate latent states that reconstruct key visual tokens critical for answering the query, constituting the process of latent visual reasoning. By interleaving LVR with standard text generation, our model achieves substantial gains on perception-intensive visual question answering tasks. In addition, we adapt the GRPO algorithm to conduct reinforcement learning on latent reasoning, further balancing LVR and textual generation. We show that LVR substantially improves fine-grained visual understanding and perception, achieving 71.67% on MMVP compared to 66.67% with Qwen2.5-VL. Code base and model weights will be released later.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes Latent Visual Reasoning (LVR), a paradigm for autoregressive reasoning in the visual embedding space of MLLMs. A visual encoder maps images to tokens in a joint space; the LM is trained to generate latent states that reconstruct query-critical visual tokens. These latents are interleaved with text generation, and the GRPO RL algorithm is adapted to balance LVR with textual output. The work reports gains on perception-intensive VQA, including 71.67% on MMVP versus 66.67% for Qwen2.5-VL.
Significance. If the reported gains can be isolated to the latent reconstruction mechanism rather than the RL stage or capacity increases, the approach could meaningfully extend reasoning beyond language-space CoT or external visual tools by operating directly in shared visual embeddings. The GRPO adaptation for latent trajectories is a concrete technical contribution worth exploring further.
major comments (2)
- [Abstract] Abstract: the single reported benchmark improvement (MMVP 71.67% vs 66.67%) provides no baseline details, ablations, or controls to separate the contribution of the latent reconstruction objective from the GRPO training stage, additional capacity, or data differences. This is load-bearing for the central claim that LVR constitutes a new reasoning paradigm.
- [Abstract] Abstract: the mechanism description does not specify how latent states are produced step-by-step as an autoregressive trajectory independent of the final answer, nor whether reconstruction is enforced only during training or also at inference. Without these details the process risks being equivalent to a conditional visual predictor rather than genuine multi-step visual reasoning.
minor comments (1)
- [Abstract] Abstract: the statement that code and model weights will be released later lacks a timeline or repository link.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive comments. We address each major point below, providing clarifications from the full manuscript and committing to revisions that strengthen the presentation of LVR as a distinct reasoning paradigm.
read point-by-point responses
-
Referee: [Abstract] Abstract: the single reported benchmark improvement (MMVP 71.67% vs 66.67%) provides no baseline details, ablations, or controls to separate the contribution of the latent reconstruction objective from the GRPO training stage, additional capacity, or data differences. This is load-bearing for the central claim that LVR constitutes a new reasoning paradigm.
Authors: We agree that the abstract alone does not isolate contributions. The full manuscript includes direct comparisons to the Qwen2.5-VL base model, an ablation removing the latent reconstruction loss while retaining GRPO, and controls for data and capacity. These show that the largest gains on MMVP and similar perception benchmarks arise specifically from the latent reconstruction objective. We will revise the abstract to reference these controls and point readers to the relevant experimental sections. revision: yes
-
Referee: [Abstract] Abstract: the mechanism description does not specify how latent states are produced step-by-step as an autoregressive trajectory independent of the final answer, nor whether reconstruction is enforced only during training or also at inference. Without these details the process risks being equivalent to a conditional visual predictor rather than genuine multi-step visual reasoning.
Authors: The manuscript (Section 3) specifies that the LM autoregressively generates a variable-length sequence of latent visual states conditioned on prior latents and text tokens, before emitting the final textual answer. Each latent is trained to reconstruct query-critical visual tokens via an auxiliary loss applied only during training; at inference the reconstruction loss is disabled and the model generates the latent trajectory freely as part of its reasoning process. This interleaving produces multi-step visual reasoning trajectories distinct from a single conditional prediction. We will add a concise description of this training-versus-inference distinction to the abstract. revision: yes
Circularity Check
No significant circularity; empirical paradigm introduction without derivations
full rationale
The paper introduces LVR as a training paradigm in which the LM generates latent states to reconstruct selected visual tokens in a shared embedding space, then interleaves this with text generation and applies GRPO RL. No equations, derivations, or first-principles results are presented. Claims rest entirely on empirical benchmark gains (e.g., MMVP improvement over Qwen2.5-VL). No self-definitional loops, fitted inputs renamed as predictions, load-bearing self-citations, uniqueness theorems, or smuggled ansatzes appear. The method is self-contained as a proposed architecture evaluated on external tasks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption A visual encoder projects images into visual tokens within a joint semantic space shared with the language model.
invented entities (1)
-
latent states
no independent evidence
Forward citations
Cited by 22 Pith papers
-
ATLAS: Agentic or Latent Visual Reasoning? One Word is Enough for Both
ATLAS uses a single functional token to unify agentic and latent visual reasoning without image generation or external execution.
-
UniVLR: Unifying Text and Vision in Visual Latent Reasoning for Multimodal LLMs
UniVLR unifies textual and visual reasoning in multimodal LLMs by compressing reasoning traces and auxiliary images into visual latent tokens for direct inference without interleaved text CoT.
-
Hybrid Latent Reasoning with Decoupled Policy Optimization
HyLaR with DePO enables effective RL in hybrid discrete-continuous spaces for multimodal models, outperforming prior MLLMs on perception and understanding benchmarks.
-
V-Reflection: Transforming MLLMs from Passive Observers to Active Interrogators
V-Reflection introduces a think-then-look mechanism where MLLM latent states actively interrogate visual features via two-stage distillation from a box-guided teacher to a dynamic autoregressive student, narrowing the...
-
Self-Consistent Latent Reasoning: Long Latent Sequence Reasoning for Vision-Language Model
SCOLAR fixes information gain collapse in latent visual reasoning by generating independent auxiliary visual tokens via a detransformer, extending acceptable CoT length over 30x and delivering +14.12% gains on reasoni...
-
Self-Consistent Latent Reasoning: Long Latent Sequence Reasoning for Vision-Language Model
SCOLAR addresses information gain collapse in latent visual reasoning by generating independent auxiliary visual tokens from LLM hidden states, extending acceptable CoT length over 30x and achieving +14.12% gains on b...
-
CoWorld-VLA: Thinking in a Multi-Expert World Model for Autonomous Driving
CoWorld-VLA extracts semantic, geometric, dynamic, and trajectory expert tokens from multi-source supervision and feeds them into a diffusion-based hierarchical planner, achieving competitive collision avoidance and t...
-
CoWorld-VLA: Thinking in a Multi-Expert World Model for Autonomous Driving
CoWorld-VLA encodes world information into four expert tokens that condition a diffusion-based planner, yielding competitive collision avoidance and trajectory accuracy on the NAVSIM benchmark.
-
Retrieve, Integrate, and Synthesize: Spatial-Semantic Grounded Latent Visual Reasoning
RIS improves MLLM latent visual reasoning by retrieving spatial-semantic evidence, integrating it via attention bottlenecks, and synthesizing it with language transition tokens, yielding gains on V*, HRBench, MMVP, an...
-
4DThinker: Thinking with 4D Imagery for Dynamic Spatial Understanding
4DThinker enables VLMs to perform dynamic spatial reasoning by internally simulating 4D imagery in latent space, outperforming prior text-based and modular approaches.
-
HypEHR: Hyperbolic Modeling of Electronic Health Records for Efficient Question Answering
HypEHR is a hyperbolic embedding model for EHR data that uses Lorentzian geometry and hierarchy-aware pretraining to answer clinical questions nearly as well as large language models but with much smaller size.
-
Visual Enhanced Depth Scaling for Multimodal Latent Reasoning
Visual replay module and adaptive depth scaling improve multimodal latent reasoning, reaching SOTA benchmarks with faster inference than explicit chain-of-thought methods.
-
GRASP: Grounded CoT Reasoning with Dual-Stage Optimization for Multimodal Sarcasm Target Identification
GRASP improves multimodal sarcasm target identification by anchoring visual regions in grounded chain-of-thought reasoning and using dual-stage optimization on a new balanced dataset.
-
Q-Zoom: Query-Aware Adaptive Perception for Efficient Multimodal Large Language Models
Q-Zoom achieves up to 4.39x inference speedup in high-resolution MLLM scenarios via query-aware gating and region localization, matching or exceeding baseline accuracy on document and high-res benchmarks.
-
Walk the Talk: Bridging the Reasoning-Action Gap for Thinking with Images via Multimodal Agentic Policy Optimization
MAPO improves multimodal chain-of-thought reasoning by requiring explicit textual descriptions of visual tool results and using a novel advantage estimator that combines semantic alignment with task rewards.
-
Fill the GAP: A Granular Alignment Paradigm for Visual Reasoning in Multimodal Large Language Models
GAP aligns visual latent reasoning in MLLMs at feature, context, and capacity levels, yielding the best aggregate perception and reasoning scores on Qwen2.5-VL 7B among supervised variants while providing task-relevan...
-
Test-time Scaling over Perception: Resolving the Grounding Paradox in Thinking with Images
TTSP resolves the Grounding Paradox by treating perception as a scalable test-time process that generates, filters, and iteratively refines multiple visual exploration traces, outperforming baselines on high-resolutio...
-
Visual Enhanced Depth Scaling for Multimodal Latent Reasoning
A visual replay module combined with adaptive depth scaling improves multimodal latent reasoning, delivering state-of-the-art benchmark results and faster inference than explicit chain-of-thought methods.
-
Visual Enhanced Depth Scaling for Multimodal Latent Reasoning
Visual replay and depth scaling in latent reasoning produce state-of-the-art multimodal results with faster inference than explicit CoT.
-
MedLVR: Latent Visual Reasoning for Reliable Medical Visual Question Answering
MedLVR interleaves latent visual reasoning segments in autoregressive decoding and uses two-stage training to raise average medical VQA accuracy from 48.3% to 53.4% over a Qwen2.5-VL-7B backbone on OmniMedVQA and five...
-
Act Wisely: Cultivating Meta-Cognitive Tool Use in Agentic Multimodal Models
HDPO reframes tool efficiency as a conditional objective within accurate trajectories, enabling Metis to reduce tool invocations by orders of magnitude while raising reasoning accuracy.
-
Decompose, Look, and Reason: Reinforced Latent Reasoning for VLMs
DLR is a new reinforced latent reasoning method for VLMs that decomposes queries, uses continuous visual latents, and outperforms text-only and multimodal CoT baselines on vision-centric benchmarks with better interpr...
Reference graph
Works this paper leans on
-
[1]
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-vl technical report.ar...
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
Rui Cai, Bangzheng Li, Xiaofei Wen, Muhao Chen, and Zhe Zhao. Diagnosing and mitigating modality interference in multimodal large language models.arXiv preprint arXiv:2505.19616,
-
[3]
Sft or rl? an early investigation into training r1-like reasoning large vision-language models
Hardy Chen, Haoqin Tu, Fali Wang, Hui Liu, Xianfeng Tang, Xinya Du, Yuyin Zhou, and Cihang Xie. Sft or rl? an early investigation into training r1-like reasoning large vision-language models. arXiv preprint arXiv:2504.11468, 2025a. 10 Preprint. Shuang Chen, Yue Guo, Zhaochen Su, Yafu Li, Yulun Wu, Jiacheng Chen, Jiayu Chen, Weijie Wang, Xiaoye Qu, and Yu ...
-
[4]
Jeffrey Cheng and Benjamin Van Durme. Compressed chain of thought: Efficient reasoning through dense representations.arXiv preprint arXiv:2412.13171,
-
[5]
v1: Learning to Point Visual Tokens for Multimodal Grounded Reasoning
Jiwan Chung, Junhyeok Kim, Siyeol Kim, Jaeyoung Lee, Min Soo Kim, and Youngjae Yu. Don’t look only once: Towards multimodal interactive reasoning with selective visual revisitation.arXiv preprint arXiv:2505.18842,
work page internal anchor Pith review Pith/arXiv arXiv
-
[6]
Emerging Properties in Unified Multimodal Pretraining
Ailin Deng, Tri Cao, Zhirui Chen, and Bryan Hooi. Words or vision: Do vision-language models have blind faith in text?, 2025a. Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Wei- hao Yu, Xiaonan Nie, Ziang Song, Guang Shi, and Haoqi Fan. Emerging properties in unified multimodal pretraining.arXiv preprint arXiv:2505.1468...
work page internal anchor Pith review Pith/arXiv arXiv
-
[7]
Xingyu Fu, Minqian Liu, Zhengyuan Yang, John Corring, Yijuan Lu, Jianwei Yang, Dan Roth, Dinei Florencio, and Cha Zhang. Refocus: Visual editing as a chain of thought for structured image understanding.arXiv preprint arXiv:2501.05452,
-
[8]
WebWatcher: Breaking New Frontier of Vision-Language Deep Research Agent
Xinyu Geng, Peng Xia, Zhen Zhang, Xinyu Wang, Qiuchen Wang, Ruixue Ding, Chenxi Wang, Jialong Wu, Yida Zhao, Kuan Li, et al. Webwatcher: Breaking new frontiers of vision-language deep research agent.arXiv preprint arXiv:2508.05748,
work page internal anchor Pith review arXiv
-
[9]
Training Large Language Models to Reason in a Continuous Latent Space
Shibo Hao, Sainbayar Sukhbaatar, DiJia Su, Xian Li, Zhiting Hu, Jason Weston, and Yuandong Tian. Training large language models to reason in a continuous latent space.arXiv preprint arXiv:2412.06769,
work page internal anchor Pith review Pith/arXiv arXiv
-
[10]
Minjie Hong, Zirun Guo, Yan Xia, Zehan Wang, Ziang Zhang, Tao Jin, and Zhou Zhao. Apo: Enhancing reasoning ability of mllms via asymmetric policy optimization.arXiv preprint arXiv:2506.21655,
-
[11]
Yushi Hu, Weijia Shi, Xingyu Fu, Dan Roth, Mari Ostendorf, Luke Zettlemoyer, Noah A Smith, and Ranjay Krishna. Visual sketchpad: Sketching as a visual chain of thought for multimodal language models.arXiv preprint arXiv:2406.09403,
-
[12]
Vision-r1: Incentivizing reasoning capability in multimodal large language models, 2025a
Wenxuan Huang, Bohan Jia, Zijie Zhai, Shaosheng Cao, Zheyu Ye, Fei Zhao, Zhe Xu, Yao Hu, and Shaohui Lin. Vision-r1: Incentivizing reasoning capability in multimodal large language models, 2025a. Zeyi Huang, Yuyang Ji, Anirudh Sundara Rajan, Zefan Cai, Wen Xiao, Haohan Wang, Junjie Hu, and Yong Jae Lee. Visualtoolagent (vista): A reinforcement learning fr...
-
[13]
LLaVA-OneVision: Easy Visual Task Transfer
Bo Li, Peiyuan Zhang, Kaichen Zhang, Fanyi Pu, Xinrun Du, Yuhao Dong, Haotian Liu, Yuanhan Zhang, Ge Zhang, Chunyuan Li, and Ziwei Liu. Lmms-eval: Accelerating the development of large multimoal models, 2024a. URLhttps://github.com/EvolvingLMMs-Lab/ lmms-eval. Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, ...
work page internal anchor Pith review Pith/arXiv arXiv
-
[14]
URL https://arxiv.org/abs/2501.07542. Yiqing Liang, Jielin Qiu, Wenhao Ding, Zuxin Liu, James Tompkin, Mengdi Xu, Mengzhou Xia, Zhengzhong Tu, Laixi Shi, and Jiacheng Zhu. Modomodo: Multi-domain data mixtures for multimodal llm reinforcement learning.arXiv preprint arXiv:2505.24871,
-
[15]
Visual Ab- stract Thinking Empowers Multimodal Reasoning, May 2025a
Dairu Liu, Ziyue Wang, Minyuan Ruan, Fuwen Luo, Chi Chen, Peng Li, and Yang Liu. Visual Ab- stract Thinking Empowers Multimodal Reasoning, May 2025a. URLhttp://arxiv.org/ abs/2505.20164. Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. InNeurIPS,
-
[16]
Zeyu Liu, Yuhang Liu, Guanghao Zhu, Congkai Xie, Zhen Li, Jianbo Yuan, Xinyao Wang, Qing Li, Shing-Chi Cheung, Shengyu Zhang, et al. Infi-mmr: Curriculum-based unlocking multimodal reasoning via phased reinforcement learning in multimodal small language models.arXiv preprint arXiv:2505.23091, 2025b. Ziyu Liu, Zeyi Sun, Yuhang Zang, Xiaoyi Dong, Yuhang Cao...
-
[17]
Minheng Ni, Zhengyuan Yang, Linjie Li, Chung-Ching Lin, Kevin Lin, Wangmeng Zuo, and Li- juan Wang. Point-rft: Improving multimodal reasoning with visually grounded reinforcement finetuning.arXiv preprint arXiv:2505.19702,
-
[18]
Yingzhe Peng, Gongrui Zhang, Miaosen Zhang, Zhiyuan You, Jie Liu, Qipeng Zhu, Kai Yang, Xingzhong Xu, Xin Geng, and Xu Yang. Lmm-r1: Empowering 3b lmms with strong reasoning abilities through two-stage rule-based rl.arXiv preprint arXiv:2503.07536,
-
[19]
Visual cot: Unleashing chain-of-thought reasoning in multi-modal language models, 2024a
Hao Shao, Shengju Qian, Han Xiao, Guanglu Song, Zhuofan Zong, Letian Wang, Yu Liu, and Hong- sheng Li. Visual cot: Unleashing chain-of-thought reasoning in multi-modal language models, 2024a. Hao Shao, Shengju Qian, Han Xiao, Guanglu Song, Zhuofan Zong, Letian Wang, Yu Liu, and Hong- sheng Li. Visual cot: Unleashing chain-of-thought reasoning in multi-mod...
-
[20]
Reason-rft: Reinforcement fine-tuning for visual reasoning.arXiv preprint arXiv:2503.20752, 2025
Huajie Tan, Yuheng Ji, Xiaoshuai Hao, Minglan Lin, Pengwei Wang, Zhongyuan Wang, and Shanghang Zhang. Reason-rft: Reinforcement fine-tuning for visual reasoning.arXiv preprint arXiv:2503.20752,
-
[21]
Chameleon: Mixed-Modal Early-Fusion Foundation Models
Chameleon Team. Chameleon: Mixed-modal early-fusion foundation models.arXiv preprint arXiv:2405.09818,
work page internal anchor Pith review Pith/arXiv arXiv
-
[22]
Chameleon: Mixed-Modal Early-Fusion Foundation Models
doi: 10.48550/arXiv.2405.09818. URLhttps://github.com/ facebookresearch/chameleon. Shengbang Tong, Zhuang Liu, Yuexiang Zhai, Yi Ma, Yann LeCun, and Saining Xie. Eyes wide shut? exploring the visual shortcomings of multimodal llms,
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2405.09818
-
[23]
VL-Rethinker: Incentivizing Self-Reflection of Vision-Language Models with Reinforcement Learning
Haozhe Wang, Chao Qu, Zuming Huang, Wei Chu, Fangzhen Lin, and Wenhu Chen. Vl- rethinker: Incentivizing self-reflection of vision-language models with reinforcement learning. arXiv preprint arXiv:2504.08837, 2025a. Jiacong Wang, Zijian Kang, Haochen Wang, Haiyong Jiang, Jiawen Li, Bohong Wu, Ya Wang, Jiao Ran, Xiao Liang, Chao Feng, et al. Vgr: Visual gro...
work page internal anchor Pith review Pith/arXiv arXiv
-
[24]
Show-o2: Improved Native Unified Multimodal Models
Jinheng Xie, Zhenheng Yang, and Mike Zheng Shou. Show-o2: Improved native unified multimodal models.arXiv preprint arXiv:2506.15564,
work page internal anchor Pith review Pith/arXiv arXiv
-
[25]
Llava-cot: Let vision language models reason step-by-step.arXiv preprint arXiv:2411.10440,
Guowei Xu, Peng Jin, Ziang Wu, Hao Li, Yibing Song, Lichao Sun, and Li Yuan. Llava-cot: Let vision language models reason step-by-step.arXiv preprint arXiv:2411.10440,
-
[26]
URLhttps://arxiv.org/abs/ 2505.11409. Yi Yang, Xiaoxuan He, Hongkun Pan, Xiyan Jiang, Yan Deng, Xingtao Yang, Haoyu Lu, Dacheng Yin, Fengyun Rao, Minfeng Zhu, et al. R1-onevision: Advancing generalized multimodal rea- soning through cross-modal formalization.arXiv preprint arXiv:2503.10615, 2025a. Zeyuan Yang, Xueyang Yu, Delin Chen, Maohao Shen, and Chua...
-
[27]
Guanghao Zhang, Tao Zhong, Yan Xia, Zhelun Yu, Haoyuan Li, Wanggui He, Fangxun Shu, Mushui Liu, Dong She, Yi Wang, et al. Cmmcot: Enhancing complex multi-image comprehension via multi-modal chain-of-thought and memory augmentation.arXiv preprint arXiv:2503.05255, 2025a. Jingyi Zhang, Jiaxing Huang, Huanjin Yao, Shunyu Liu, Xikun Zhang, Shijian Lu, and Dac...
-
[28]
Multimodal Chain-of-Thought Reasoning in Language Models
Zhuosheng Zhang, Aston Zhang, Mu Li, Hai Zhao, George Karypis, and Alex Smola. Multimodal chain-of-thought reasoning in language models.arXiv preprint arXiv:2302.00923,
work page internal anchor Pith review Pith/arXiv arXiv
-
[29]
DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning
Ziwei Zheng, Michael Yang, Jack Hong, Chenxiao Zhao, Guohai Xu, Le Yang, Chao Shen, and Xing Yu. Deepeyes: Incentivizing” thinking with images” via reinforcement learning.arXiv preprint arXiv:2505.14362,
work page internal anchor Pith review Pith/arXiv arXiv
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.