pith. sign in

arxiv: 2601.06803 · v2 · submitted 2026-01-11 · 💻 cs.CL · cs.CV

Forest Before Trees: Latent Superposition for Efficient Visual Reasoning

Pith reviewed 2026-05-16 15:55 UTC · model grok-4.3

classification 💻 cs.CL cs.CV
keywords latent reasoningvisual reasoningsuperpositionvision-language modelsefficient inferencechain-of-thought alternative
0
0 comments X p. Extension

The pith

Laser lets vision-language models reason in latent space by holding global features in superposition before committing to local details.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Laser as a way for large vision-language models to perform multi-step visual reasoning without generating explicit text rationales. It replaces rigid next-token prediction with Dynamic Windowed Alignment Learning, which aligns each latent state to a sliding window of upcoming semantics. This produces a forest-before-trees hierarchy in which the model sustains a probabilistic superposition of global visual features rather than collapsing early to specifics. Self-Refined Superposition is added to keep the unconstrained process stable and interpretable. The approach yields higher benchmark accuracy than prior latent methods while slashing the number of inference tokens required.

Core claim

Reformulating visual deduction through Dynamic Windowed Alignment Learning aligns latent states to a dynamic validity window of future semantics, enforcing maintenance of a probabilistic superposition of global features before narrowing to local details and thereby improving both accuracy and token efficiency over point-wise latent baselines.

What carries the argument

Dynamic Windowed Alignment Learning (DWAL), which replaces point-wise next-token targets with alignment to a sliding window of future semantic states to sustain feature superposition.

If this is right

  • Visual reasoning can preserve continuous image details that text tokenization normally discards.
  • Inference cost drops by more than 97 percent while accuracy on standard benchmarks rises.
  • Latent trajectories remain decodable, preserving a degree of interpretability.
  • Generalization improves on out-of-distribution visual reasoning problems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same windowed superposition idea could be tested in pure language models to reduce token budgets during long reasoning chains.
  • Models might learn to modulate window size dynamically according to the granularity demanded by the current task.
  • Hybrid systems could switch between latent superposition and explicit text steps depending on whether global or local structure is required.

Load-bearing premise

That aligning latent states to a dynamic window of future semantics will keep global features in stable probabilistic superposition without premature collapse into local predictions.

What would settle it

Measure whether accuracy on tasks that require early fine-grained visual distinctions drops sharply when the alignment window size is reduced to force immediate local commitment.

Figures

Figures reproduced from arXiv: 2601.06803 by Juntian Zhang, Nils Lukas, Yankai Lin, Yichen Wu, Yubo Wang, Yuhan Liu.

Figure 1
Figure 1. Figure 1: Laser replaces verbose textual rationales [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the Laser. Laser employs DWAL. At each step t, a dynamic validity window Wt is defined over future semantic tokens to construct a Reference Superposition Distribution. The latent state is then optimized to align with this distribution via LDW AL. The final answer is generated explicitly after the reasoning using LCE. 3.2 Synthesizing Cognitive Scanpaths For the Laser method, we require a datase… view at source ↗
Figure 3
Figure 3. Figure 3: Fine-grained comparison across 14 distinct categories. Laser outperforms Qwen2.5-VL-7B [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Visualization of the latent cognitive trajectory. The decoded tokens reveal a structured [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Ablation study. We contrast the full Laser model with variants lacking the DWAL (w/o [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: A test case from MMStar showcases the efficacy and efficiency of our Laser. [PITH_FULL_IMAGE:figures/full_fig_p020_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: This multi-image reasoning test case from MMStar illustrates the effectiveness and [PITH_FULL_IMAGE:figures/full_fig_p021_7.png] view at source ↗
read the original abstract

While Chain-of-Thought empowers Large Vision-Language Models with multi-step reasoning, explicit textual rationales suffer from an information bandwidth bottleneck, where continuous visual details are discarded during discrete tokenization. Recent latent reasoning methods attempt to address this challenge, but often fall prey to premature semantic collapse due to rigid autoregressive objectives. In this paper, we propose Laser, a novel paradigm that reformulates visual deduction via Dynamic Windowed Alignment Learning (DWAL). Instead of forcing a point-wise prediction, Laser aligns the latent state with a dynamic validity window of future semantics. This mechanism enforces a "Forest-before-Trees" cognitive hierarchy, enabling the model to maintain a probabilistic superposition of global features before narrowing down to local details. Crucially, Laser maintains interpretability via decodable trajectories while stabilizing unconstrained learning via Self-Refined Superposition. Extensive experiments on 6 benchmarks demonstrate that Laser achieves state-of-the-art performance among latent reasoning methods, surpassing the strong baseline Monet by 5.03% on average. Notably, it achieves these gains with extreme efficiency, reducing inference tokens by more than 97%, while demonstrating robust generalization to out-of-distribution domains.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes Laser, a latent reasoning paradigm for large vision-language models that replaces explicit chain-of-thought with Dynamic Windowed Alignment Learning (DWAL). DWAL aligns latent states to a dynamic validity window of future semantics to enforce a 'Forest-before-Trees' hierarchy, maintaining probabilistic superposition of global features before local refinement. Self-Refined Superposition is introduced to stabilize unconstrained learning while preserving interpretability via decodable trajectories. On six benchmarks the method reports state-of-the-art results among latent-reasoning approaches, outperforming the Monet baseline by 5.03% on average and reducing inference tokens by more than 97%.

Significance. If the empirical gains and token-reduction figures are confirmed by detailed ablations and training protocols, the work would constitute a meaningful advance in efficient multimodal reasoning. By explicitly targeting premature semantic collapse through windowed alignment rather than rigid autoregressive objectives, Laser offers a concrete mechanism for preserving high-bandwidth visual information in latent space. The reported efficiency improvement is particularly noteworthy for deployment scenarios where token budget is constrained.

major comments (2)
  1. [Abstract and §3] Abstract and §3 (Method): The central performance claims (+5.03% over Monet, >97% token reduction) are presented without the DWAL loss function, the precise definition of the dynamic validity window, or any ablation isolating the contribution of Self-Refined Superposition. Without these equations or controls it is impossible to determine whether the reported gains are produced by the proposed alignment process or by other unstated factors.
  2. [§4] §4 (Experiments): No training details (optimizer, learning-rate schedule, number of epochs, or data mixture) or statistical significance tests are supplied for the six-benchmark results. The absence of these elements makes it impossible to assess reproducibility or to rule out that the observed margin over Monet arises from hyper-parameter differences rather than the DWAL mechanism.
minor comments (2)
  1. [§2] The term 'probabilistic superposition' is used repeatedly but never given a formal definition or distance metric; a short paragraph clarifying its operational meaning would improve clarity.
  2. [Figures] Figure captions should explicitly state which baseline corresponds to each bar and whether error bars represent standard deviation across seeds.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thorough review and positive evaluation of our paper's contributions to efficient visual reasoning in large vision-language models. We address each of the major comments below and will make the necessary revisions to improve the clarity and completeness of the manuscript.

read point-by-point responses
  1. Referee: [Abstract and §3] Abstract and §3 (Method): The central performance claims (+5.03% over Monet, >97% token reduction) are presented without the DWAL loss function, the precise definition of the dynamic validity window, or any ablation isolating the contribution of Self-Refined Superposition. Without these equations or controls it is impossible to determine whether the reported gains are produced by the proposed alignment process or by other unstated factors.

    Authors: We acknowledge that the current version of the manuscript does not present the explicit DWAL loss function or the precise mathematical definition of the dynamic validity window in Section 3, nor does it include a dedicated ablation isolating Self-Refined Superposition. In the revision we will insert the full DWAL objective equation, which aligns each latent state to a validity window whose size is determined by remaining sequence length and estimated semantic entropy. We will also add a controlled ablation in Section 4 that compares the full Laser model against a variant without the self-refinement step, thereby isolating its contribution and confirming that the reported gains derive from the windowed alignment mechanism. revision: yes

  2. Referee: [§4] §4 (Experiments): No training details (optimizer, learning-rate schedule, number of epochs, or data mixture) or statistical significance tests are supplied for the six-benchmark results. The absence of these elements makes it impossible to assess reproducibility or to rule out that the observed margin over Monet arises from hyper-parameter differences rather than the DWAL mechanism.

    Authors: We agree that the experimental section currently omits the required training protocol and statistical analysis. In the revised manuscript we will expand Section 4 to report the optimizer (AdamW), the learning-rate schedule with warm-up and decay parameters, the total number of epochs, and the exact data-mixture ratios. We will additionally include standard deviations computed over multiple random seeds together with paired statistical significance tests (e.g., Wilcoxon signed-rank) that compare Laser against Monet, thereby demonstrating that the observed margins are attributable to the DWAL mechanism rather than hyper-parameter variation. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper introduces Laser as an empirical method relying on Dynamic Windowed Alignment Learning (DWAL) to align latent states with a dynamic validity window and Self-Refined Superposition for stabilization. Central claims consist of benchmark performance gains (SOTA among latent methods, +5.03% over Monet, >97% token reduction) demonstrated via experiments on 6 benchmarks. No equations, fitted parameters, or derivation steps are presented that reduce predictions or uniqueness claims to self-definitions, self-citations, or inputs by construction. The mechanism descriptions are internally consistent with stated goals and do not invoke load-bearing self-citations or ansatzes that collapse the argument. This is a standard empirical proposal whose validity rests on external experimental falsifiability rather than internal re-expression of inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 2 invented entities

The abstract introduces two named mechanisms (DWAL and Self-Refined Superposition) without equations or stated assumptions, so the ledger records only the explicitly named invented components.

invented entities (2)
  • Dynamic Windowed Alignment Learning (DWAL) no independent evidence
    purpose: Align latent state with a dynamic validity window of future semantics to enforce forest-before-trees hierarchy
    Core proposed mechanism for maintaining probabilistic superposition
  • Self-Refined Superposition no independent evidence
    purpose: Stabilize unconstrained learning during the alignment process
    Stabilizing component mentioned to prevent collapse

pith-pipeline@v0.9.0 · 5510 in / 1242 out tokens · 48229 ms · 2026-05-16T15:55:26.862848+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 5 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Hybrid Latent Reasoning with Decoupled Policy Optimization

    cs.CV 2026-04 unverdicted novelty 7.0

    HyLaR with DePO enables effective RL in hybrid discrete-continuous spaces for multimodal models, outperforming prior MLLMs on perception and understanding benchmarks.

  2. Geometric Decoupling: Diagnosing the Structural Instability of Latent

    cs.CV 2026-04 unverdicted novelty 6.0

    Latent diffusion models exhibit geometric decoupling where curvature in out-of-distribution generation is misallocated to unstable semantic boundaries instead of image details, identifying geometric hotspots as the st...

  3. Visual Enhanced Depth Scaling for Multimodal Latent Reasoning

    cs.CV 2026-04 unverdicted novelty 6.0

    Visual replay module and adaptive depth scaling improve multimodal latent reasoning, reaching SOTA benchmarks with faster inference than explicit chain-of-thought methods.

  4. Visual Enhanced Depth Scaling for Multimodal Latent Reasoning

    cs.CV 2026-04 unverdicted novelty 5.0

    Visual replay and depth scaling in latent reasoning produce state-of-the-art multimodal results with faster inference than explicit CoT.

  5. Visual Enhanced Depth Scaling for Multimodal Latent Reasoning

    cs.CV 2026-04 unverdicted novelty 5.0

    A visual replay module combined with adaptive depth scaling improves multimodal latent reasoning, delivering state-of-the-art benchmark results and faster inference than explicit chain-of-thought methods.

Reference graph

Works this paper leans on

46 extracted references · 46 canonical work pages · cited by 3 Pith papers · 13 internal anchors

  1. [1]

    GPT-4 Technical Report

    OpenAI. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

  2. [2]

    Llava-onevision: Easy visual task transfer.Transactions on Machine Learning Research, 2024

    Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, et al. Llava-onevision: Easy visual task transfer.Transactions on Machine Learning Research, 2024

  3. [3]

    Chain-of-thought prompting elicits reasoning in large language models

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837, 2022

  4. [5]

    Quiet-STaR: Language Models Can Teach Themselves to Think Before Speaking

    Georges Zelikman, Eric andovor et al. Quiet-star: Language models can teach themselves to think before speaking.arXiv preprint arXiv:2403.09629, 2024

  5. [6]

    Training Large Language Models to Reason in a Continuous Latent Space

    Shibo Hao, Sainbayar Sukhbaatar, DiJia Su, et al. Training large language models to reason in a continuous latent space.arXiv preprint arXiv:2412.06769, 2024

  6. [7]

    Forest before trees: The precedence of global features in visual perception

    David Navon. Forest before trees: The precedence of global features in visual perception. Cognitive psychology, 9(3):353–383, 1977

  7. [8]

    Flamingo: a visual language model for few-shot learning.Advances in neural information processing systems, 35:23716–23736, 2022

    Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning.Advances in neural information processing systems, 35:23716–23736, 2022

  8. [9]

    Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models

    Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InInternational conference on machine learning, pages 19730–19742. PMLR, 2023

  9. [10]

    Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

  10. [11]

    Minigpt-4: Enhanc- ing vision-language understanding with advanced large language models

    Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. Minigpt-4: Enhanc- ing vision-language understanding with advanced large language models. InICLR, 2024

  11. [12]

    Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks

    Zhe Chen, Jiannan Wu, Pipu Wang, et al. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

  12. [15]

    Qwen3-vl technical report, 2025

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Junyang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shixuan ...

  13. [16]

    Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models

    Wenxuan Huang, Bohan Jia, Zijie Zhai, Shaosheng Cao, Zheyu Ye, Fei Zhao, Zhe Xu, Yao Hu, and Shaohui Lin. Vision-r1: Incentivizing reasoning capability in multimodal large language models.arXiv preprint arXiv:2503.06749, 2025. 11

  14. [17]

    VL-Rethinker: Incentivizing Self-Reflection of Vision-Language Models with Reinforcement Learning

    Haozhe Wang, Chao Qu, Zuming Huang, Wei Chu, Fangzhen Lin, and Wenhu Chen. Vl- rethinker: Incentivizing self-reflection of vision-language models with reinforcement learning. arXiv preprint arXiv:2504.08837, 2025

  15. [18]

    Weaving context across images: Improving vision-language models through focus-centric visual chains

    Juntian Zhang, Chuanqi Cheng, Yuhan Liu, Wei Liu, Jian Luan, and Rui Yan. Weaving context across images: Improving vision-language models through focus-centric visual chains. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 27782–27798, 2025

  16. [19]

    Viper: Empowering the self-evolution of visual perception abilities in vision-language model.arXiv preprint arXiv:2510.24285, 2025

    Juntian Zhang, Song Jin, Chuanqi Cheng, Yuhan Liu, Yankai Lin, Xun Zhang, Yufei Zhang, Fei Jiang, Guojun Yin, Wei Lin, et al. Viper: Empowering the self-evolution of visual perception abilities in vision-language model.arXiv preprint arXiv:2510.24285, 2025

  17. [20]

    DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning

    Ziwei Zheng, Michael Yang, Jack Hong, Chenxiao Zhao, Guohai Xu, Le Yang, Chao Shen, and Xing Yu. Deepeyes: Incentivizing" thinking with images" via reinforcement learning.arXiv preprint arXiv:2505.14362, 2025

  18. [22]

    Quiet-star: Language models can teach themselves to think before speaking

    Eric Zelikman, Georges Raif Harik, Yijia Shao, Varuna Jayasiri, Nick Haber, and Noah Good- man. Quiet-star: Language models can teach themselves to think before speaking. InFirst Conference on Language Modeling

  19. [23]

    Tianci Xue, Ziqi Wang, Zhenhailong Wang, Chi Han, Pengfei Yu, and Heng Ji

    Yige Xu, Xu Guo, Zhiwei Zeng, and Chunyan Miao. Softcot: Soft chain-of-thought for efficient reasoning with llms.arXiv preprint arXiv:2502.12134, 2025

  20. [24]

    Cocova: Chain of continuous vision- language thought for latent space reasoning.arXiv preprint arXiv:2511.02360, 2025

    Jizheng Ma, Xiaofei Zhou, Yanlong Song, and Han Yan. Cocova: Chain of continuous vision- language thought for latent space reasoning.arXiv preprint arXiv:2511.02360, 2025

  21. [25]

    Multimodal chain of continuous thought for latent-space reasoning in vision-language models, 2025

    Tan-Hanh Pham and Chris Ngo. Multimodal chain of continuous thought for latent-space reasoning in vision-language models.arXiv preprint arXiv:2508.12587, 2025

  22. [26]

    Machine Mental Imagery: Empower Multimodal Reasoning with Latent Visual Tokens

    Zeyuan Yang, Xueyang Yu, Delin Chen, Maohao Shen, and Chuang Gan. Machine men- tal imagery: Empower multimodal reasoning with latent visual tokens.arXiv preprint arXiv:2506.17218, 2025

  23. [27]

    Reasoning in the dark: Interleaved vision-text reasoning in latent space, 2025

    Chao Chen, Zhixin Ma, Yongqi Li, Yupeng Hu, Yinwei Wei, Wenjie Li, and Liqiang Nie. Reasoning in the dark: Interleaved vision-text reasoning in latent space.arXiv preprint arXiv:2510.12603, 2025

  24. [28]

    Interleaved latent visual reasoning with selective perceptual modeling.arXiv preprint arXiv:2512.05665, 2025

    Shuai Dong, Siyuan Wang, Xingyu Liu, and Zhongyu Wei. Interleaved latent visual reasoning with selective perceptual modeling.arXiv preprint arXiv:2512.05665, 2025

  25. [29]

    arXiv preprint arXiv:2510.24514 , year=

    Huanyu Zhang, Wenshan Wu, Chengzu Li, Ning Shang, Yan Xia, Yangyu Huang, Yifan Zhang, Li Dong, Zhang Zhang, Liang Wang, et al. Latent sketchpad: Sketching visual thoughts to elicit multimodal reasoning in mllms.arXiv preprint arXiv:2510.24514, 2025

  26. [30]

    Sketch-in-latents: Eliciting unified reasoning in mllms.arXiv preprint arXiv:2512.16584, 2025

    Jintao Tong, Jiaqi Gu, Yujing Lou, Lubin Fan, Yixiong Zou, Yue Wu, Jieping Ye, and Ruixuan Li. Sketch-in-latents: Eliciting unified reasoning in mllms.arXiv preprint arXiv:2512.16584, 2025

  27. [31]

    arXiv preprint arXiv:2511.21395 , year=

    Qixun Wang, Yang Shi, Yifei Wang, Yuanxing Zhang, Pengfei Wan, Kun Gai, Xianghua Ying, and Yisen Wang. Monet: Reasoning in latent visual space beyond images and language.arXiv preprint arXiv:2511.21395, 2025

  28. [32]

    Latent chain-of-thought for visual reasoning

    Guohao Sun, Hang Hua, Jian Wang, Jiebo Luo, Sohail Dianat, MAJID RABBANI, Raghuveer Rao, and Zhiqiang Tao. Latent chain-of-thought for visual reasoning. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

  29. [33]

    Reasoning Within the Mind: Dynamic Multimodal Interleaving in Latent Space

    Chengzhi Liu, Yuzhe Yang, Yue Fan, Qingyue Wei, Sheng Liu, and Xin Eric Wang. Rea- soning within the mind: Dynamic multimodal interleaving in latent space.arXiv preprint arXiv:2512.12623, 2025. 12

  30. [34]

    Mull-Tokens: Modality-Agnostic Latent Thinking

    Arijit Ray, Ahmed Abdelkader, Chengzhi Mao, Bryan A Plummer, Kate Saenko, Ranjay Krishna, Leonidas Guibas, and Wen-Sheng Chu. Mull-tokens: Modality-agnostic latent thinking.arXiv preprint arXiv:2512.10941, 2025

  31. [35]

    Titans: Learning to Memorize at Test Time

    Ali Behrouz, Peilin Zhong, and Vahab Mirrokni. Titans: Learning to memorize at test time. arXiv preprint arXiv:2501.00663, 2025

  32. [36]

    Hao Shao, Shengju Qian, Han Xiao, Guanglu Song, Zhuofan Zong, Letian Wang, Yu Liu, and Hongsheng Li. Visual cot: Advancing multi-modal language models with a comprehen- sive dataset and benchmark for chain-of-thought reasoning.Advances in Neural Information Processing Systems, 37:8612–8642, 2024

  33. [37]

    Perception-Aware Policy Optimization for Multimodal Reasoning

    Zhenhailong Wang, Xuehang Guo, Sofia Stoica, Haiyang Xu, Hongru Wang, Hyeonjeong Ha, Xiusi Chen, Yangyi Chen, Ming Yan, Fei Huang, et al. Perception-aware policy optimization for multimodal reasoning.arXiv preprint arXiv:2507.06448, 2025

  34. [38]

    Blink: Multimodal large language models can see but not perceive

    Xingyu Fu, Yushi Hu, Bangzheng Li, Yu Feng, Haoyu Wang, Xudong Lin, Dan Roth, Noah A Smith, Wei-Chiu Ma, and Ranjay Krishna. Blink: Multimodal large language models can see but not perceive. InEuropean Conference on Computer Vision, pages 148–166. Springer, 2024

  35. [39]

    Eyes wide shut? exploring the visual shortcomings of multimodal llms

    Shengbang Tong, Zhuang Liu, Yuexiang Zhai, Yi Ma, Yann LeCun, and Saining Xie. Eyes wide shut? exploring the visual shortcomings of multimodal llms. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9568–9578, 2024

  36. [40]

    Are we on the right way for evaluating large vision- language models?Advances in Neural Information Processing Systems, 37:27056–27087, 2024

    Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Jiaqi Wang, Yu Qiao, Dahua Lin, et al. Are we on the right way for evaluating large vision- language models?Advances in Neural Information Processing Systems, 37:27056–27087, 2024

  37. [41]

    Divide, conquer and combine: A training-free framework for high-resolution image perception in multimodal large language models

    Wenbin Wang, Liang Ding, Minyan Zeng, Xiabin Zhou, Li Shen, Yong Luo, Wei Yu, and Dacheng Tao. Divide, conquer and combine: A training-free framework for high-resolution image perception in multimodal large language models. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 7907–7915, 2025

  38. [42]

    Hallusionbench: an advanced diagnostic suite for entangled language hallucination and visual illusion in large vision-language models

    Tianrui Guan, Fuxiao Liu, Xiyang Wu, Ruiqi Xian, Zongxia Li, Xiaoyu Liu, Xijun Wang, Lichang Chen, Furong Huang, Yaser Yacoob, et al. Hallusionbench: an advanced diagnostic suite for entangled language hallucination and visual illusion in large vision-language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pag...

  39. [43]

    arXiv preprint arXiv:2404.16790 , year=

    Bohao Li, Yuying Ge, Yixiao Ge, Guangzhi Wang, Rui Wang, Ruimao Zhang, and Ying Shan. SEED-Bench-2-Plus: Benchmarking multimodal large language models with text-rich visual comprehension.arXiv preprint arXiv:2404.16790, 2024

  40. [44]

    Qwen2.5-VL Technical Report

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2.5-vl technical report.arXiv preprint arXiv:2502.13923, 2025

  41. [45]

    InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

    Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, et al. Internvl3. 5: Advancing open-source multimodal models in versatility, reasoning, and efficiency.arXiv preprint arXiv:2508.18265, 2025

  42. [46]

    Latent Visual Reasoning

    Bangzheng Li, Ximeng Sun, Jiang Liu, Ze Wang, Jialian Wu, Xiaodong Yu, Hao Chen, Emad Barsoum, Muhao Chen, and Zicheng Liu. Latent visual reasoning.arXiv preprint arXiv:2509.24251, 2025. 13 Appendix Table of Contents A Implementation Details 14 B Baseline Details 14 C Benchmark Details 15 D RL Analysis 15 D.1 Optimization Objective . . . . . . . . . . . ....

  43. [47]

    Kitchen",

    Global Anchor (Step 1-2):Start with the broadest visible context (e.g., "Kitchen", "Blue Sky")

  44. [48]

    Subject Localization (Step 3-X):Locate the main subject relevant to the question

  45. [49]

    Visual Evidence (Step X-Y):List visible attributes or actions supporting the answer

  46. [50]

    Fruit"→"Apple

    Critical Resolution (Final Step):The specific concept answering the query must appear at the very end. ### Negative Constraints -NO Premature Reveals:Do not output the answer early. -NO Artificial Hierarchy:No "Fruit"→"Apple", just "Apple". -NO Sentences:Raw concepts only. ### Output Format Output strictly valid JSON: { "reasoning_chain": [ "String1", "St...