pith. the verified trust layer for science. sign in

arxiv: 2512.12623 · v3 · submitted 2025-12-14 · 💻 cs.CV · cs.CL

Reasoning Within the Mind: Dynamic Multimodal Interleaving in Latent Space

Pith reviewed 2026-05-16 22:59 UTC · model grok-4.3

classification 💻 cs.CV cs.CL
keywords multimodal reasoninglatent spacedynamic interleavingpolicy gradient optimizationvisual injectionchain-of-thoughtmultimodal large language modelstest-time reasoning
0
0 comments X p. Extension

The pith

Multimodal models interleave reasoning and visual perception dynamically inside latent space.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that effective multimodal reasoning does not require explicit linear steps or external tools to handle visual information. Instead, it claims that models can achieve deeper reasoning by dynamically interleaving perception and thought directly within their internal latent representations. To realize this, DMLR refines sequences of latent think tokens using confidence-guided policy gradient optimization. A complementary Dynamic Visual Injection Strategy retrieves and updates the most relevant visual patches at each step before injecting them into the token stream. If correct, this would let models perform more fluid and efficient cross-modal reasoning than prior chain-of-thought extensions.

Core claim

We posit that thinking unfolds not linearly but through the dynamic interleaving of reasoning and perception within the mind. Motivated by this perspective, we propose DMLR, a test-time Dynamic Multimodal Latent Reasoning framework that employs confidence-guided latent policy gradient optimization to refine latent think tokens for in-depth reasoning. Furthermore, a Dynamic Visual Injection Strategy is introduced, which retrieves the most relevant visual features at each latent think token and updates the set of best visual patches. The updated patches are then injected into latent think token to achieve dynamic visual-textual interleaving.

What carries the argument

DMLR framework using confidence-guided latent policy gradient optimization on think tokens together with Dynamic Visual Injection Strategy that retrieves and injects relevant visual patches into the latent stream.

If this is right

  • DMLR improves both reasoning accuracy and perception quality across seven multimodal benchmarks.
  • The gains hold across different model architectures while preserving high inference efficiency.
  • The method operates entirely at test time without retraining or external modules.
  • Visual-textual interleaving occurs inside the latent space rather than through generated text or images.
  • Reliance on explicit chain-of-thought steps or tool calls is reduced.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same latent optimization approach could be adapted to interleave audio or video features without changing the core policy-gradient loop.
  • If the mechanism scales, models might handle longer-horizon tasks by maintaining a running set of best perceptual patches across many latent steps.
  • Removing the need for external tools suggests potential for fully internal reasoning loops that are harder to inspect or steer from outside.

Load-bearing premise

Dynamic interleaving of reasoning and perception can be realized effectively in latent space through confidence-guided policy gradient optimization on think tokens without explicit step-by-step processes or external tools.

What would settle it

Running the same seven-benchmark suite but replacing the dynamic visual injection with fixed initial patches and disabling the policy gradient updates, then checking whether the reported performance gains disappear.

read the original abstract

Recent advancements in Multimodal Large Language Models (MLLMs) have significantly enhanced cross-modal understanding and reasoning by incorporating Chain-of-Thought (CoT) reasoning in the semantic space. Building upon this, recent studies extend the CoT mechanism to the visual modality, enabling models to integrate visual information during reasoning through external tools or explicit image generation. However, these methods remain dependent on explicit step-by-step reasoning, unstable perception-reasoning interaction and notable computational overhead. Inspired by human cognition, we posit that thinking unfolds not linearly but through the dynamic interleaving of reasoning and perception within the mind. Motivated by this perspective, we propose DMLR, a test-time Dynamic Multimodal Latent Reasoning framework that employs confidence-guided latent policy gradient optimization to refine latent think tokens for in-depth reasoning. Furthermore, a Dynamic Visual Injection Strategy is introduced, which retrieves the most relevant visual features at each latent think token and updates the set of best visual patches. The updated patches are then injected into latent think token to achieve dynamic visual-textual interleaving. Experiments across seven multimodal reasoning benchmarks and various model architectures demonstrate that DMLR significantly improves reasoning and perception performance while maintaining high inference efficiency.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript proposes DMLR, a test-time Dynamic Multimodal Latent Reasoning framework for MLLMs. It refines latent think tokens via confidence-guided policy gradient optimization and introduces a Dynamic Visual Injection Strategy that retrieves relevant visual patches and injects them into the tokens to achieve dynamic visual-textual interleaving. The central claim is that this approach yields significant gains in reasoning and perception performance across seven multimodal benchmarks and multiple model architectures while preserving high inference efficiency, without relying on external tools or explicit step-by-step processes.

Significance. If the empirical claims hold after proper validation, the work could advance test-time reasoning in MLLMs by realizing human-inspired dynamic interleaving directly in latent space. The avoidance of external tools and explicit generation steps is a potential strength, but the absence of any quantitative results, error bars, ablation studies, or implementation details for the policy gradient makes it impossible to evaluate the magnitude, robustness, or efficiency of the reported gains.

major comments (3)
  1. [Abstract and §4] Abstract and §4 (Experiments): The abstract asserts that DMLR 'significantly improves reasoning and perception performance' across seven benchmarks, yet the manuscript supplies no quantitative results, tables, figures, error bars, or statistical tests. This absence is load-bearing for the central empirical claim and prevents assessment of whether the method outperforms baselines.
  2. [§3.2] §3.2 (Policy Gradient Formulation): The reward signal for the confidence-guided policy gradient on latent think tokens is not explicitly defined or derived. If confidence reduces to a simple scalar (e.g., token probability) rather than a structured alignment signal, the updates are likely to exhibit high variance, undermining the claim of stable dynamic interleaving; the stress-test concern on reward shaping is not addressed.
  3. [§3.3] §3.3 (Dynamic Visual Injection Strategy): The mechanism for retrieving and updating the set of best visual patches at each latent think token lacks a precise description of the retrieval criterion, update rule, and integration into the latent token representation. Without these details, it is unclear whether the injection achieves true latent-space interleaving or reduces to heuristic patch swapping.
minor comments (2)
  1. [§3] Notation for latent think tokens and the policy gradient objective should be introduced with explicit equations rather than prose descriptions to improve reproducibility.
  2. [§3.2] The manuscript should include a clear statement of the number of parameters updated during test-time optimization and the stopping criterion for the policy gradient steps.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We agree that the current manuscript draft requires additional empirical results and methodological clarifications to fully support the central claims. We will revise the paper accordingly and address each point below.

read point-by-point responses
  1. Referee: [Abstract and §4] Abstract and §4 (Experiments): The abstract asserts that DMLR 'significantly improves reasoning and perception performance' across seven benchmarks, yet the manuscript supplies no quantitative results, tables, figures, error bars, or statistical tests. This absence is load-bearing for the central empirical claim and prevents assessment of whether the method outperforms baselines.

    Authors: We acknowledge that the submitted draft omits the full quantitative results. In the revised version we will expand §4 with complete performance tables across all seven benchmarks, baseline comparisons, error bars from multiple random seeds, and statistical significance tests. These results will be summarized in the abstract to substantiate the reported gains. revision: yes

  2. Referee: [§3.2] §3.2 (Policy Gradient Formulation): The reward signal for the confidence-guided policy gradient on latent think tokens is not explicitly defined or derived. If confidence reduces to a simple scalar (e.g., token probability) rather than a structured alignment signal, the updates are likely to exhibit high variance, undermining the claim of stable dynamic interleaving; the stress-test concern on reward shaping is not addressed.

    Authors: We will revise §3.2 to explicitly define the reward as a composite signal: token-level log-probability combined with a structured multimodal alignment term obtained from cross-attention between latent think tokens and visual patch embeddings. We will also add the full policy-gradient derivation, variance analysis, and ablation studies on reward shaping to demonstrate stability. revision: yes

  3. Referee: [§3.3] §3.3 (Dynamic Visual Injection Strategy): The mechanism for retrieving and updating the set of best visual patches at each latent think token lacks a precise description of the retrieval criterion, update rule, and integration into the latent token representation. Without these details, it is unclear whether the injection achieves true latent-space interleaving or reduces to heuristic patch swapping.

    Authors: We will expand §3.3 with the exact retrieval criterion (latent-space cosine similarity with confidence gating), the update rule (priority-queue replacement under a dynamic threshold), and the integration operation (additive injection of selected patch features into the latent token). Pseudocode and a diagram will be added to show that the process performs true latent-space interleaving. revision: yes

Circularity Check

0 steps flagged

No significant circularity; method is a novel test-time optimization procedure

full rationale

The paper introduces DMLR as a new framework that applies confidence-guided policy gradient optimization to latent think tokens and a dynamic visual injection strategy. This builds on existing MLLM and CoT concepts but does not reduce any claimed prediction or result to its own inputs by definition, fitting, or self-citation chains. The derivation chain consists of standard optimization techniques applied at test time, with performance claims supported by experiments on external benchmarks rather than tautological re-derivations. No load-bearing step equates outputs to inputs by construction.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that human-like dynamic interleaving can be approximated by latent-space optimization; no new physical entities are introduced, but the latent think tokens and injection strategy are novel constructs whose effectiveness is asserted empirically.

free parameters (1)
  • confidence threshold for policy updates
    The optimization is described as confidence-guided; the exact threshold or weighting is not specified in the abstract and would require fitting or tuning.
axioms (1)
  • domain assumption Human cognition unfolds through dynamic interleaving of reasoning and perception rather than linear steps
    Explicitly stated as the inspirational premise in the abstract.

pith-pipeline@v0.9.0 · 5519 in / 1231 out tokens · 24990 ms · 2026-05-16T22:59:56.868425+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 5 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. UniVLR: Unifying Text and Vision in Visual Latent Reasoning for Multimodal LLMs

    cs.CV 2026-05 unverdicted novelty 7.0

    UniVLR unifies textual and visual reasoning in multimodal LLMs by compressing reasoning traces and auxiliary images into visual latent tokens for direct inference without interleaved text CoT.

  2. Forest Before Trees: Latent Superposition for Efficient Visual Reasoning

    cs.CL 2026-01 unverdicted novelty 7.0

    Laser reformulates visual reasoning via Dynamic Windowed Alignment Learning to maintain latent superposition of global features, delivering 5.03% average gains over Monet and over 97% fewer inference tokens on six benchmarks.

  3. Visual Latents Know More Than They Say: Unsilencing Latent Reasoning in MLLMs

    cs.LG 2026-05 unverdicted novelty 6.0

    Visual latents in MLLMs are systematically silenced by autoregressive training but can be unsilenced at inference via query-guided contrastive alignment followed by a confidence-progression reward.

  4. Fill the GAP: A Granular Alignment Paradigm for Visual Reasoning in Multimodal Large Language Models

    cs.CV 2026-05 unverdicted novelty 5.0

    GAP aligns visual latent reasoning in MLLMs at feature, context, and capacity levels, yielding the best aggregate perception and reasoning scores on Qwen2.5-VL 7B among supervised variants while providing task-relevan...

  5. Towards Explainable Industrial Anomaly Detection via Knowledge-Guided Latent Reasoning

    cs.CV 2026-02 unverdicted novelty 5.0

    Reason-IAD improves explainable industrial anomaly detection by combining retrieval-augmented category knowledge with entropy-guided latent reasoning and dynamic visual patch injection in MLLMs.

Reference graph

Works this paper leans on

69 extracted references · 69 canonical work pages · cited by 5 Pith papers · 24 internal anchors

  1. [2]

    InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

    Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, ShenglongYe, JieShao, etal. Internvl3.5: Advancingopen-sourcemultimodalmodelsinversatility, reasoning, and efficiency.arXiv preprint arXiv:2508.18265, 2025

  2. [3]

    GLM-4.5V and GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning

    V Team, Wenyi Hong, Wenmeng Yu, Xiaotao Gu, Guo Wang, Guobing Gan, Haomiao Tang, Jiale Cheng, Ji Qi, Junhui Ji, Lihang Pan, Shuaiqi Duan, Weihan Wang, Yan Wang, Yean Cheng, Zehai He, Zhe Su, Zhen Yang, Ziyang Pan, Aohan Zeng, Baoxu Wang, Bin Chen, Boyan Shi, Changyu Pang, Chenhui Zhang, Da Yin, Fan Yang, Guoqing Chen, Jiazheng Xu, Jiale Zhu, Jiali Chen, J...

  3. [4]

    LLaVA-OneVision: Easy Visual Task Transfer

    Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, 12 Reasoning Within the Mind: Dynamic Multimodal Interleaving in Latent Space Yanwei Li, Ziwei Liu, et al. Llava-onevision: Easy visual task transfer.arXiv preprint arXiv:2408.03326, 2024

  4. [5]

    Kam-cot: Knowledge augmented multimodal chain-of-thoughts reasoning

    Debjyoti Mondal, Suraj Modi, Subhadarshi Panda, Rituraj Singh, and Godawari Sudhakar Rao. Kam-cot: Knowledge augmented multimodal chain-of-thoughts reasoning. InProceedings of the AAAI conference on artificial intelligence, volume 38, pages 18798–18806, 2024

  5. [6]

    Thinking with Images for Multimodal Reasoning: Foundations, Methods, and Future Frontiers

    Zhaochen Su, Peng Xia, Hangyu Guo, Zhenhua Liu, Yan Ma, Xiaoye Qu, Jiaqi Liu, Yanshu Li, Kaide Zeng, Zhengyuan Yang, et al. Thinking with images for multimodal reasoning: Foundations, methods, and future frontiers.arXiv preprint arXiv:2506.23918, 2025

  6. [7]

    Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models

    Wenxuan Huang, Bohan Jia, Zijie Zhai, Shaosheng Cao, Zheyu Ye, Fei Zhao, Zhe Xu, Yao Hu, and Shaohui Lin. Vision-r1: Incentivizing reasoning capability in multimodal large language models.arXiv preprint arXiv:2503.06749, 2025

  7. [8]

    Pixel Reasoner: Incentivizing Pixel-Space Reasoning with Curiosity-Driven Reinforcement Learning

    Alex Su, Haozhe Wang, Weiming Ren, Fangzhen Lin, and Wenhu Chen. Pixel reasoner: Incentivizing pixel-space reasoning with curiosity-driven reinforcement learning.arXiv preprint arXiv:2505.15966, 2025

  8. [9]

    DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning

    Ziwei Zheng, Michael Yang, Jack Hong, Chenxiao Zhao, Guohai Xu, Le Yang, Chao Shen, and Xing Yu. Deepeyes: Incentivizing" thinking with images" via reinforcement learning.arXiv preprint arXiv:2505.14362, 2025

  9. [10]

    Refocus: Visual editing as a chain of thought for structured image understanding

    Xingyu Fu, Minqian Liu, Zhengyuan Yang, John Corring, Yijuan Lu, Jianwei Yang, Dan Roth, Dinei Flo- rencio, andChaZhang. Refocus: Visualeditingasachainofthoughtforstructuredimageunderstanding. arXiv preprint arXiv:2501.05452, 2025

  10. [11]

    GRIT: Teaching MLLMs to Think with Images

    Yue Fan, Xuehai He, Diji Yang, Kaizhi Zheng, Ching-Chen Kuo, Yuting Zheng, Sravana Jyothi Naraya- naraju, Xinze Guan, and Xin Eric Wang. Grit: Teaching mllms to think with images, 2025. URL https://arxiv.org/abs/2505.15879

  11. [12]

    Imagine while Reasoning in Space: Multimodal Visualization-of-Thought

    Chengzu Li, Wenshan Wu, Huanyu Zhang, Yan Xia, Shaoguang Mao, Li Dong, Ivan Vulić, and Furu Wei. Imagine while reasoning in space: Multimodal visualization-of-thought.arXiv preprint arXiv:2501.07542, 2025

  12. [13]

    Latent sketchpad: Sketching visual thoughts to elicit multimodal reasoning in mllms, 2025

    Huanyu Zhang, Wenshan Wu, Chengzu Li, Ning Shang, Yan Xia, Yangyu Huang, Yifan Zhang, Li Dong, Zhang Zhang, Liang Wang, Tieniu Tan, and Furu Wei. Latent sketchpad: Sketching visual thoughts to elicit multimodal reasoning in mllms, 2025. URLhttps://arxiv.org/abs/2510.24514

  13. [14]

    Training Large Language Models to Reason in a Continuous Latent Space

    Shibo Hao, Sainbayar Sukhbaatar, DiJia Su, Xian Li, Zhiting Hu, Jason Weston, and Yuandong Tian. Training large language models to reason in a continuous latent space, 2025. URLhttps://arxiv. org/abs/2412.06769

  14. [15]

    Seek in the dark: Reasoning via test-time instance-level policy gradient in latent space.arXiv preprint arXiv:2505.13308,

    Hengli Li, Chenxi Li, Tong Wu, Xuekai Zhu, Yuxuan Wang, Zhaoxin Yu, Eric Hanchen Jiang, Song-Chun Zhu, Zixia Jia, Ying Nian Wu, and Zilong Zheng. Seek in the dark: Reasoning via test-time instance-level policy gradient in latent space, 2025. URLhttps://arxiv.org/abs/2505.13308

  15. [16]

    Latent visual reasoning, 2025

    Bangzheng Li, Ximeng Sun, Jiang Liu, Ze Wang, Jialian Wu, Xiaodong Yu, Hao Chen, Emad Barsoum, Muhao Chen, and Zicheng Liu. Latent visual reasoning, 2025. URLhttps://arxiv.org/abs/2509. 24251. 13 Reasoning Within the Mind: Dynamic Multimodal Interleaving in Latent Space

  16. [17]

    Machine Mental Imagery: Empower Multimodal Reasoning with Latent Visual Tokens

    Zeyuan Yang, Xueyang Yu, Delin Chen, Maohao Shen, and Chuang Gan. Machine mental imagery: Empower multimodal reasoning with latent visual tokens, 2025. URLhttps://arxiv.org/abs/ 2506.172182

  17. [18]

    Multimodal chain of continuous thought for latent-space reasoning in vision-language models, 2025

    Tan-Hanh Pham and Chris Ngo. Multimodal chain of continuous thought for latent-space reasoning in vision-language models, 2025. URLhttps://arxiv.org/abs/2508.12587

  18. [19]

    Memgen: Weaving generative latent memory for self-evolving agents, 2025

    Guibin Zhang, Muxin Fu, and Shuicheng Yan. Memgen: Weaving generative latent memory for self-evolving agents, 2025. URLhttps://arxiv.org/abs/2509.24704

  19. [21]

    VL-Rethinker: Incentivizing Self-Reflection of Vision-Language Models with Reinforcement Learning

    Haozhe Wang, Chao Qu, Zuming Huang, Wei Chu, Fangzhen Lin, and Wenhu Chen. Vl-rethinker: Incentivizing self-reflection of vision-language models with reinforcement learning, 2025. URLhttps: //arxiv.org/abs/2504.08837

  20. [22]

    Fuxiao Liu, Kevin Lin, Linjie Li, Jianfeng Wang, Yaser Yacoob, and Lijuan Wang

    Chengzhi Liu, Zhongxing Xu, Qingyue Wei, Juncheng Wu, James Zou, Xin Eric Wang, Yuyin Zhou, and Sheng Liu. More thinking, less seeing? assessing amplified hallucination in multimodal reasoning models, 2025. URLhttps://arxiv.org/abs/2505.21523

  21. [23]

    Reducing hallucinations in large vision-language models via latent space steering

    Sheng Liu, Haotian Ye, and James Zou. Reducing hallucinations in large vision-language models via latent space steering. InThe Thirteenth International Conference on Learning Representations, 2025

  22. [24]

    Eyes wide shut? exploring the visual shortcomings of multimodal llms, 2024

    Shengbang Tong, Zhuang Liu, Yuexiang Zhai, Yi Ma, Yann LeCun, and Saining Xie. Eyes wide shut? exploring the visual shortcomings of multimodal llms, 2024. URLhttps://arxiv.org/abs/2401. 06209

  23. [25]

    Seeing far and clearly: Mitigating hallucinations in mllms with attention causal decoding

    Feilong Tang, Chengzhi Liu, Zhongxing Xu, Ming Hu, Zile Huang, Haochen Xue, Ziyang Chen, Zelin Peng, Zhiwei Yang, Sijin Zhou, Wenxue Li, Yulong Li, Wenxuan Song, Shiyan Su, Wei Feng, Jionglong Su, Mingquan Lin, Yifan Peng, Xuelian Cheng, Imran Razzak, and Zongyuan Ge. Seeing far and clearly: Mitigating hallucinations in mllms with attention causal decodin...

  24. [26]

    OpenThinkIMG: Learning to Think with Images via Visual Tool Reinforcement Learning

    Zhaochen Su, Linjie Li, Mingyang Song, Yunzhuo Hao, Zhengyuan Yang, Jun Zhang, Guanjie Chen, Jiawei Gu, Juntao Li, Xiaoye Qu, and Yu Cheng. Openthinkimg: Learning to think with images via visual tool reinforcement learning, 2025. URLhttps://arxiv.org/abs/2505.08617

  25. [27]

    Visual sketchpad: Sketching as a visual chain of thought for multimodal language models.arXiv preprint arXiv:2406.09403, 2024

    Yushi Hu, Weijia Shi, Xingyu Fu, Dan Roth, Mari Ostendorf, Luke Zettlemoyer, Noah A Smith, and Ranjay Krishna. Visual sketchpad: Sketching as a visual chain of thought for multimodal language models, 2024. URLhttps://arxiv.org/abs/2406.09403

  26. [28]

    DeepEyesV2: Toward Agentic Multimodal Model

    Jack Hong, Chenxiao Zhao, ChengLin Zhu, Weiheng Lu, Guohai Xu, and Xing Yu. Deepeyesv2: Toward agentic multimodal model, 2025. URLhttps://arxiv.org/abs/2511.05271

  27. [29]

    Adaptive Chain-of-Focus Reasoning via Dynamic Visual Search and Zooming for Efficient VLMs

    Xintong Zhang, Zhi Gao, Bofei Zhang, Pengxiang Li, Xiaowen Zhang, Yang Liu, Tao Yuan, Yuwei Wu, Yunde Jia, Song-Chun Zhu, et al. Chain-of-focus: Adaptive visual search and zooming for multimodal reasoning via rl.arXiv preprint arXiv:2505.15436, 2025. 14 Reasoning Within the Mind: Dynamic Multimodal Interleaving in Latent Space

  28. [30]

    Look twice before you answer: Memory-space visual retracing for hallucination mitiga- tion in multimodal large language models.arXiv preprint arXiv:2410.03577, 2024

    Xin Zou, Yizhou Wang, Yibo Yan, Yuanhuiyi Lyu, Kening Zheng, Sirui Huang, Junkai Chen, Peijie Jiang, Jia Liu, Chang Tang, and Xuming Hu. Look twice before you answer: Memory-space visual retracing for hallucination mitigation in multimodal large language models, 2025. URLhttps: //arxiv.org/abs/2410.03577

  29. [31]

    Fractional reasoning via latent steering vectors improves inference time compute.arXiv preprint arXiv:2506.15882, 2025

    Sheng Liu, Tianlang Chen, Pan Lu, Haotian Ye, Yizheng Chen, Lei Xing, and James Zou. Fractional reasoning via latent steering vectors improves inference time compute.arXiv preprint arXiv:2506.15882, 2025

  30. [32]

    ThinkAct: Vision-Language-Action Reasoning via Reinforced Visual Latent Planning

    Chi-Pin Huang, Yueh-Hua Wu, Min-Hung Chen, Yu-Chiang Frank Wang, and Fu-En Yang. Thinkact: Vision-language-action reasoning via reinforced visual latent planning, 2025. URLhttps://arxiv. org/abs/2507.16815

  31. [33]

    Milr: Improving multimodal image generation via test-time latent reasoning, 2025

    Yapeng Mi, Hengli Li, Yanpeng Zhao, Chenxi Li, Huimin Wu, Xiaojian Ma, Song-Chun Zhu, Ying Nian Wu, and Qing Li. Milr: Improving multimodal image generation via test-time latent reasoning, 2025. URLhttps://arxiv.org/abs/2509.22761

  32. [34]

    Latent reasoning in llms as a vocabulary-space superposition, 2025

    Jingcheng Deng, Liang Pang, Zihao Wei, Shichen Xu, Zenghao Duan, Kun Xu, Yang Song, Huawei Shen, and Xueqi Cheng. Latent reasoning in llms as a vocabulary-space superposition, 2025. URL https://arxiv.org/abs/2510.15522

  33. [35]

    Spotlight on token perception for multimodal reinforcement learning, 2025

    Siyuan Huang, Xiaoye Qu, Yafu Li, Yun Luo, Zefeng He, Daizong Liu, and Yu Cheng. Spotlight on token perception for multimodal reinforcement learning, 2025. URLhttps://arxiv.org/abs/ 2510.09285

  34. [36]

    Soft thinking: Unlocking the reasoning potential of llms in continuous concept space,

    Zhen Zhang, Xuehai He, Weixiang Yan, Ao Shen, Chenyang Zhao, Shuohang Wang, Yelong Shen, and Xin Eric Wang. Soft thinking: Unlocking the reasoning potential of llms in continuous concept space,

  35. [37]

    URLhttps://arxiv.org/abs/2505.15778

  36. [38]

    Soft tokens, hard truths, 2025

    Natasha Butt, Ariel Kwiatkowski, Ismail Labiad, Julia Kempe, and Yann Ollivier. Soft tokens, hard truths, 2025. URLhttps://arxiv.org/abs/2509.19170

  37. [39]

    Thinking on the fly: Test-time reasoning enhancement via latent thought policy optimization, 2025

    Wengao Ye, Yan Liang, and Lianlei Shan. Thinking on the fly: Test-time reasoning enhancement via latent thought policy optimization, 2025. URLhttps://arxiv.org/abs/2510.04182

  38. [40]

    Feature extraction and steering for enhanced chain-of-thought reasoning in language models

    Zihao Li, Xu Wang, Yuzhe Yang, Ziyu Yao, Haoyi Xiong, and Mengnan Du. Feature extraction and steering for enhanced chain-of-thought reasoning in language models. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors,Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 10904–1092...

  39. [41]

    ResearchAgent: Iterative research idea generation over scientific literature with large language models

    Association for Computational Linguistics. ISBN 979-8-89176-332-6. doi: 10.18653/v1/2025. emnlp-main.552. URLhttps://aclanthology.org/2025.emnlp-main.552/

  40. [42]

    La- tent chain-of-thought for visual reasoning.arXiv preprint arXiv:2510.23925, 2025

    Guohao Sun, Hang Hua, Jian Wang, Jiebo Luo, Sohail Dianat, Majid Rabbani, Raghuveer Rao, and Zhiqiang Tao. Latent chain-of-thought for visual reasoning, 2025. URLhttps://arxiv.org/abs/ 2510.23925

  41. [43]

    Interleaved-modal chain-of-thought, 2025

    Jun Gao, Yongqi Li, Ziqiang Cao, and Wenjie Li. Interleaved-modal chain-of-thought, 2025. URL https://arxiv.org/abs/2411.19488

  42. [44]

    Qwen2.5-VL Technical Report

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, 15 Reasoning Within the Mind: Dynamic Multimodal Interleaving in Latent Space Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang ...

  43. [45]

    OpenAI, :, Aaron Hurst, Adam Lerer, Adam P. Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, Aleksander Mądry, Alex Baker-Whitcomb, Alex Beutel, Alex Borzunov, Alex Carney, Alex Chow, Alex Kirillov, Alex Nichol, Alex Paino, Alex Renzin, Alex Tachard Passos, Alexander Kirillov, Alexi Christakis, Alex...

  44. [46]

    Reasoning in the dark: Interleaved vision-text reasoning in latent space, 2025

    Chao Chen, Zhixin Ma, Yongqi Li, Yupeng Hu, Yinwei Wei, Wenjie Li, and Liqiang Nie. Reasoning in the dark: Interleaved vision-text reasoning in latent space, 2025. URLhttps://arxiv.org/abs/ 2510.12603

  45. [47]

    Zhuo Zhi, Chen Feng, Adam Daneshmend, Mine Orlu, Andreas Demosthenous, Lu Yin, Da Li, Ziquan Liu, and Miguel R. D. Rodrigues. Seeing and reasoning with confidence: Supercharging multimodal llms with an uncertainty-aware agentic framework, 2025. URLhttps://arxiv.org/abs/2503.08308

  46. [48]

    Latentevolve: Self-evolving test-time scaling in latent space, 2025

    Guibin Zhang, Fanci Meng, Guancheng Wan, Zherui Li, Kun Wang, Zhenfei Yin, Lei Bai, and Shuicheng Yan. Latentevolve: Self-evolving test-time scaling in latent space, 2025. URLhttps://arxiv.org/ abs/2509.24771

  47. [49]

    Simple statistical gradient-following algorithms for connectionist reinforcement learning.Machine learning, 8(3):229–256, 1992

    Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning.Machine learning, 8(3):229–256, 1992

  48. [50]

    R1-Onevision: Advancing Generalized Multimodal Reasoning through Cross-Modal Formalization

    Yi Yang, Xiaoxuan He, Hongkun Pan, Xiyan Jiang, Yan Deng, Xingtao Yang, Haoyu Lu, Dacheng Yin, Fengyun Rao, Minfeng Zhu, Bo Zhang, and Wei Chen. R1-onevision: Advancing generalized multimodal reasoning through cross-modal formalization, 2025. URLhttps://arxiv.org/abs/2503.10615

  49. [51]

    SFT or RL? An Early Investigation into Training R1-Like Reasoning Large Vision-Language Models

    Hardy Chen, Haoqin Tu, Fali Wang, Hui Liu, Xianfeng Tang, Xinya Du, Yuyin Zhou, and Cihang Xie. Sft or rl? an early investigation into training r1-like reasoning large vision-language models, 2025. URLhttps://arxiv.org/abs/2504.11468

  50. [52]

    Qwen3-vl: Sharper vision, deeper thought, broader action

    Qwen Team. Qwen3-vl: Sharper vision, deeper thought, broader action. Blog post,https://qwen. ai/blog?id=99f0335c4ad9ff6153e517418d48535ab6d8afef, Sept 2025

  51. [53]

    Compositional chain-of-thought prompting for large multimodal models

    Chancharik Mitra, Brandon Huang, Trevor Darrell, and Roei Herzig. Compositional chain-of-thought prompting for large multimodal models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 14420–14431, June 2024. 17 Reasoning Within the Mind: Dynamic Multimodal Interleaving in Latent Space

  52. [54]

    Multimodal Chain-of-Thought Reasoning in Language Models

    Zhuosheng Zhang, Aston Zhang, Mu Li, Hai Zhao, George Karypis, and Alex Smola. Multimodal chain- of-thought reasoning in language models, 2024. URLhttps://arxiv.org/abs/2302.00923

  53. [55]

    MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts

    Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts, 2024. URLhttps://arxiv.org/abs/2310.02255

  54. [56]

    Measuring multimodal mathematical reasoning with math-vision dataset, 2024

    Ke Wang, Junting Pan, Weikang Shi, Zimu Lu, Mingjie Zhan, and Hongsheng Li. Measuring multimodal mathematical reasoning with math-vision dataset, 2024. URL https://arxiv.org/abs/2402. 14804

  55. [57]

    Mm-math: Advancing multimodal math evaluation with process evaluation and fine-grained classification, 2024

    Kai Sun, Yushi Bai, Ji Qi, Lei Hou, and Juanzi Li. Mm-math: Advancing multimodal math evaluation with process evaluation and fine-grained classification, 2024. URLhttps://arxiv.org/abs/2404. 05091

  56. [58]

    HallusionBench: An Advanced Diagnostic Suite for Entangled Language Hallucination and Visual Illusion in Large Vision-Language Models

    Tianrui Guan, Fuxiao Liu, Xiyang Wu, Ruiqi Xian, Zongxia Li, Xiaoyu Liu, Xijun Wang, Lichang Chen, Furong Huang, Yaser Yacoob, Dinesh Manocha, and Tianyi Zhou. Hallusionbench: An advanced diagnostic suite for entangled language hallucination and visual illusion in large vision-language models, 2024. URLhttps://arxiv.org/abs/2310.14566

  57. [59]

    Are We on the Right Way for Evaluating Large Vision-Language Models?

    Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Jiaqi Wang, Yu Qiao, Dahua Lin, and Feng Zhao. Are we on the right way for evaluating large vision-language models?, 2024. URLhttps://arxiv.org/abs/2403.20330

  58. [60]

    Learn to explain: Multimodal reasoning via thought chains for science question answering, 2022

    Pan Lu, Swaroop Mishra, Tony Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering, 2022. URLhttps://arxiv.org/abs/2209.09513

  59. [61]

    CLIP-blind

    Laurens van der Maaten and Geoffrey Hinton. Visualizing data using t-sne.Journal of machine learning research, 9(Nov):2579–2605, 2008. 18 Reasoning Within the Mind: Dynamic Multimodal Interleaving in Latent Space Appendix A. More Detailed about Evaluation A.1 Datasets ▶ MathVistamini is a benchmark for mathematical reasoning in visual contexts, aggregatin...

  60. [62]

    Observe the side of the truck where the door would be located

  61. [63]

    Look for any visible gap or opening that would indicate the door is not closed

  62. [64]

    Upon examining the image: ▶The door appears to be aligned with the rest of the truck’s body

    Check the overall appearance of the door; if it is open, there should be a noticeable gap or difference in the alignment compared to when it is closed. Upon examining the image: ▶The door appears to be aligned with the rest of the truck’s body. ▶There is no visible gap or difference in the alignment that would suggest the door is open. Based on this analy...

  63. [65]

    The beetle starts at P

  64. [66]

    It moves right to an adjacent vertex

  65. [67]

    It turns left to another adjacent vertex

  66. [68]

    It turns right again to reach a third adjacent vertex

  67. [69]

    It continues this pattern until it returns to P. By examining the cube’s structure, we can see that after 6 edges, the beetle will return to point P because each face of the cube has 4 edges, and the beetle covers all edges of one face before repeating the pattern on the next face. Thus, the beetle will walk along 6 edges before returning to point P. Qwen...

  68. [70]

    2.First Edge: The beetle moves along the first edge and changes direction to the right

    Initial Position and Direction: The beetle starts at pointP and moves in the direction indicated by the arrow, which is along one edge of the cube. 2.First Edge: The beetle moves along the first edge and changes direction to the right. 3.Second Edge: Moving right, the beetle now faces a new direction and continues along the next edge. 4.Third Edge: After ...

  69. [71]

    8.Seventh Edge: After moving along the fourth edge, the beetle changes direction to the left once more

    Sixth Edge: Moving right, the beetle now faces yet another new direction and continues along the next edge. 8.Seventh Edge: After moving along the fourth edge, the beetle changes direction to the left once more. 9.Eighth Edge: Moving left, the beetle now faces a new direction and continues along the next edge. At this point, the beetle has completed a ful...